Pointers in C are more abstract than they might seem.

A pointer refers to a memory cell, and dereferencing a pointer means reading the value of the specified cell. The value of the pointer itself is the address of the memory cell. The C language standard does not specify a form for representing memory addresses. This is a very important point, since different architectures may use different addressing models. Most modern architectures use linear address space or similar. However, even this question is not strictly stipulated, since addresses can be physical or virtual. Some architectures use a non-numeric representation altogether. So, Symbolics Lisp Machine operates with tuples of the form (object, offset) as addresses.

After some time, after the publication of the translation on Habré, the author made major modifications to the text of the article. Updating the translation on Habré is not a good idea, as some comments will lose their meaning or will look inappropriate. I do not want to publish the text as a new article either. Therefore, we just updated the translation of the article on the website viva64.com, and here we left everything as is. If you are a new reader, then I suggest reading a more recent translation on our website by clicking on the link above.

The standard does not specify the presentation of pointers, but specifies - to a greater or lesser extent - operations with them. Below we consider these operations and the features of their definition in the standard. Let's start with the following example:

#include <stdio.h> int main(void) { int a, b; int *p = &a; int *q = &b + 1; printf("%p %p %d\n", (void *)p, (void *)q, p == q); return 0; }

If we compile this GCC code with optimization level 1 and run the program under Linux x86-64, it will print the following:

 0x7fff4a35b19c 0x7fff4a35b19c 0

Note that the p and q pointers refer to the same address. However, the result of the expression p == q is false , and it seems strange at first glance. Shouldn't the two pointers to the same address be equal?
')
Here is how the C standard defines the result of testing two pointers to equality:

C11 § 6.5.9 paragraph 6

Two pointers are equal if and only if both are zero, or point to the same object (including the object pointer and the first subobject in the object) or function, or point to the position behind the last element of the array, or one pointer refers to the position after the last element of the array, and the other to the beginning of another array, immediately following the first in the same address space.

First of all, the question arises: what is an “object ” ? Since we are talking about the C language, it is obvious that objects here have nothing to do with objects in OOP languages like C ++. In standard C, this concept is not defined quite strictly:

C11 § 3.15

An object is a storage area in a runtime environment, the contents of which can be used to represent values.

NOTE When referred to, an object may be considered to be of a specific type; see 6.3.2.1.

Let's figure it out. A 16-bit integer variable is a set of in-memory data that can represent 16-bit integer values. Therefore, such a variable is an object. Will the two pointers be equal if one of them refers to the first byte of the given integer, and the second to the second byte of the same number? The Committee for the Standardization of Language, of course, did not mean this at all. But here it should be noted that on this account he does not have clear explanations, and we are forced to guess what was really meant.

When the compiler gets in the way

Let's return to our first example. Pointer p is obtained from object a , and pointer q is from object b . In the second case, address arithmetic is applied, which is defined for the plus and minus operators as follows:

C11 § 6.5.6 clause 7

When used with these statements, a pointer to an object that is not an element of an array behaves like a pointer to the beginning of an array of one element in length, the type of which corresponds to the type of the original object.

Since any pointer to an object that is not an array actually becomes a pointer to an array of one element in length, the standard defines address arithmetic only for pointers to arrays — this is item 8. We are interested in its next part:

C11 § 6.5.6 clause 8

If an integer expression is added to or subtracted from the pointer, the resulting pointer is of the same type as the original pointer. If the source pointer refers to an element of the array and the array is of sufficient length, then the source and the resulting elements are separated from each other so that the difference between their indices is equal to the value of the integer expression. In other words, if the expression P points to the i-th element of the array, the expressions (P) + N (or equivalent N + (P) ) and (P) -N (where N is n) indicate respectively (i + n) th and (i − n) th elements of the array, provided that they exist. Moreover, if the expression P indicates the last element of the array, then the expression (P) +1 indicates the position behind the last element of the array, and if the expression Q indicates the position behind the last element of the array, then the expression (Q) -1 indicates the last element array. If both the source and result pointers refer to elements of the same array or to a position behind the last element of the array, then overflow is excluded; otherwise, the behavior is undefined. If the result pointer refers to a position beyond the last element of the array, the unary operator * cannot be applied to it.

From this it follows that the result of the expression & b + 1 absolutely must be an address, and, therefore, p and q are valid pointers. Let me remind you how the equality of two pointers in the standard is defined: " Two pointers are equal if and only if [...] one pointer refers to the position behind the last element of the array, and the other to the beginning of another array immediately following the first one in the same address space " (C11 § 6.5.9 clause 6). This is what we see in our example. The pointer q refers to the position behind the object b, immediately followed by the object a referenced by the pointer p. It turns out, in the GCC bug? This contradiction was described in 2014 as bug # 61502 , but the GCC developers do not consider it a bug and therefore are not going to fix it.

Linux programmers encountered a similar problem in 2016. Consider the following code:

 extern int _start[]; extern int _end[]; void foo(void) { for (int *i = _start; i != _end; ++i) { /* ... */ } }

Symbols _start and _end set the boundaries of the memory area. Since they are transferred to an external file, the compiler does not know how the arrays are actually located in memory. For this reason, he should be careful here and proceed from the assumption that they follow each other in the address space. However, GCC compiles the loop condition so that it is always true, which makes the loop infinite. This problem is described here in this post on LKML - a similar code fragment is used there. It seems that in this case, the GCC authors still took into account the comments and changed the behavior of the compiler. At least I could not reproduce this error in the GCC 7.3.1 version under Linux x86_64.

Is there a clue in error report # 260?

Our case may clarify bug report # 260 . It is more concerned with ambiguous values, but in it you can find a curious comment from the committee:

Compiler implementations [...] can also distinguish pointers derived from different objects, even if these pointers have the same set of bits.

If this comment is taken literally, then it is logical that the result of the expression p == q is “false”, since p and q are obtained from different objects that are not related to each other. It seems that we are getting closer to the truth - or not? So far, we have dealt with equality operators, but what about relational operators?

The final answer - in relational operators?

The definition of relational operators < , <= , > and > = in the context of comparison of pointers contains one curious idea:

C11 § 6.5.8 clause 5

The result of comparing two pointers depends on the relative position of the indicated objects in the address space. If two pointers to object types refer to the same object, or both refer to the position behind the last element of the same array, then such pointers are equal. If the referenced objects are members of the same compound object, then pointers to the members of the structure declared later are more pointers to the members declared earlier, and pointers to array elements with large indices are greater than pointers to elements of the same array with smaller indices. All pointers to members of the same union are equal. If the expression P points to an element of the array, and the expression Q points to the last element of the same array, then the value of the pointer-expression Q + 1 is greater than the value of the expression P. In all other cases, the behavior is undefined.

According to this definition, the result of the comparison of pointers is determined only if the pointers are obtained from the same object. Let us show it in two examples.

 int *p = malloc(64 * sizeof(int)); int *q = malloc(64 * sizeof(int)); if (p < q) //   foo();

Here, the pointers p and q refer to two different objects that are not related to each other. Therefore, the result of their comparison is not defined. But in the following example:

 int *p = malloc(64 * sizeof(int)); int *q = p + 42; if (p < q) foo();

pointers p and q refer to the same object and, therefore, are related. So they can be compared - unless malloc returns a null value.

Summary

The C11 standard does not strictly describe pointer comparison. The most problematic point we encountered was clause 6 § 6.5.9, where it is explicitly permitted to compare two pointers referring to two different arrays. This is contrary to the comment from bug report # 260. However, there we are talking about ambiguous values, and I would not like to build my reasoning on the basis of this commentary alone and interpret it in a different context. When comparing pointers, relational operators are defined somewhat differently than equality operators — namely, relational operators are defined only if both pointers are obtained from the same object.

If we ignore the text of the standard and ask ourselves whether it is possible to compare two pointers obtained from two different objects, in any case, the answer is likely to be “no”. The example at the beginning of the article shows a rather theoretical problem. Since the variables a and b have an automatic storage time, our assumptions about their placement in memory will be unreliable. In some cases, we can guess, but it is clear that such code cannot be safely ported, and the meaning of the program can be found only by compiling and running or disassembling the code, which contradicts any serious programming paradigm.

However, in general, I am not satisfied with the wording in the C11 standard, and since several people have already encountered this problem, the question remains: why not formulate the rules more clearly?

Addition
Pointers to the position after the last element of the array

Regarding the rule on comparison and address arithmetic of pointers to a position behind the last element of an array, very often you can find exceptions to it. Suppose that a standard would not allow comparing two pointers obtained from the same array, with at least one of them referring to the position at the end of the array. Then the following code would not work:

 const int num = 64; int x[num]; for (int *i = x; i < &x[num]; ++i) { /* ... */ }

With the help of the loop we go around the entire array x , consisting of 64 elements, i.e. the loop body must execute exactly 64 times. But in fact, the condition is checked 65 times - one time more than the number of elements in the array. In the first 64 iterations, the pointer i always refers to the inside of the array x , while the expression & x [num] always points to the position behind the last element of the array. At the 65th iteration, the i pointer will also refer to the position beyond the end of the array x , which will cause the loop condition to be false. This is a convenient way to bypass the entire array, while it relies on an exception to the rule on behavior uncertainty when comparing such pointers. Note that the standard describes the behavior only when comparing pointers; their dereference is a separate topic.

Is it possible to change our example so that no pointer would refer to the position behind the last element of array x ? It is possible, but it will be more difficult. We'll have to change the condition of the loop and disable the increment of the variable i at the last iteration.

 const int num = 64; int x[num]; for (int *i = x; i <= &x[num-1]; ++i) { /* ... */ if (i == &x[num-1]) break; }

This code is full of technical subtleties, the fussing with which distracts from the main task. In addition, an additional branch appeared in the body of the cycle. So I think it reasonable that the standard allows exceptions when comparing pointers to a position after the last element of an array.

PVS-Studio command note

When developing the PVS-Studio code analyzer, we sometimes have to deal with subtle points in order to make diagnostics more accurate or to give detailed advice to our clients. This article seemed interesting to us, as it touches upon issues in which we ourselves do not feel completely confident. Therefore, we asked the author to post her translation. We hope that more C and C ++ programmers will get to know her and understand that everything is not so simple and that when suddenly the analyzer gives out a strange message, you should not immediately rush to consider it a false positive :).

The article was first published in English on stefansf.de. Translations are published with the permission of the author.

Source: https://habr.com/ru/post/418023/

All Articles