Intervals in C ++, part 1: Intervals with limiters

As we wrote the C ++ Siberia conference in Novosibirsk, Eric Nibler will open. To take a closer look at Habr with this remarkable man, we decided to translate a cycle of his articles on intervals. Erik is currently working on the implementation of the Ranges library for a grant received from the standardization committee.

Recently, I have been busy with intervals, and it became clear to me that this is more than just a couple of iterators. In several posts I want to explain the concept of an interval, describe several intervals that cannot be simply or effectively expressed with the help of STL: intervals with limiters and infinite intervals. In this post, we will consider the problem of representing intervals with delimiters through STL iterators.

Intervals with limiters

To understand the new concepts, you need to have a few good examples. When thinking about an interval with delimiters, imagine a line in C ending in a null character. But at the same time we do not know the exact position of the limiter. The limiter may meet us at a previously unknown position, or, in general, the place of the limiter may be taken by some statement that becomes true. Another example is the istream interval. In this case, the limiter will be the moment when the istream extractor does not work. At the same time, there is a std :: istream_iterator in the standard - therefore, somehow you can still push intervals with delimiters in the STL. Now I will show how.

Intervals with limiters in STL

To show the complexity of the task, I present you an interval with a delimiter for the C-line with iterators that fully comply with the STL standards.

#include <cassert> #include <iostream> #include <boost/iterator/iterator_facade.hpp> struct c_string_range { private: char const *str_; public: using const_iterator = struct iterator : boost::iterator_facade< iterator , char const , std::forward_iterator_tag > { private: friend class boost::iterator_core_access; friend struct c_string_range; char const * str_; iterator(char const * str) : str_(str) {} bool equal(iterator that) const { return str_ ? (that.str_ == str_ || (!that.str_ && !*str_)) : (!that.str_ || !*that.str_); } void increment() { assert(str_ && *str_); ++str_; } char const& dereference() const { assert(str_ && *str_); return *str_; } public: iterator() : str_(nullptr) {} }; c_string_range(char const * str) : str_(str) { assert(str_); } iterator begin() const { return iterator{str_}; } iterator end() const { return iterator{}; } explicit operator bool() const { return !!*str_; } }; int main() { for(char c : c_string_range("hello world!")) std::cout << c; std::cout << 'n'; }

')
The code iterates over the sequence of characters without first searching for the end. This is done by creating a final signaling dummy interpreter, which each time it compares with a real iterator checks whether the real iterator shows a null character. All logic is in the function c_string_range :: iterator :: equal member function. It is unlikely that anyone would call such a decision beautiful.

In modern STL, intervals are defined by two iterators - the initial and final. For iterators like std :: istream_iterator or c_string_range :: iterator, where the iterator can be signaling, this method adds ramifications of the checking code of the iterators, because first you need to determine if one or both of the iterators are signaling. The expression a == b is calculated according to the following table:

a == end?	b == end?	a == b?
true	true	true
true	false	* b == 0
false	true	* a == 0
false	false	& * a == & * b

And these checks must be performed while the program is running. It is impossible to say in advance whether the iterator will be real or empty. And all these checks take away processor time. So pushing the intervals in the STL is not very convenient.

Compiler agrees

And the inconvenience here is not just my opinion. I generated the code for the following two functions:

 int c_strlen(char const *sz) { int i = 0; for(; *sz; ++sz) ++i; return i; } int range_strlen( c_string_range::iterator begin, c_string_range::iterator end) { int i = 0; for(; begin != end; ++begin) ++i; return i; }

These two functions do exactly the same thing, and, in theory, they should lead to the generation of the same code. True, intuition tells us that all these complex logical constructions around c_string_range :: iterator :: equal will not lead to anything good. In fact, the two versions of machine codes are not at all similar to each other.

 ; C_STRLEN pushl %ebp movl %esp, %ebp movl 8(%ebp), %ecx xorl %eax, %eax cmpb $0, (%ecx) je LBB1_3 xorl %eax, %eax .align 16, 0x90 LBB1_2: cmpb $0, 1(%ecx,%eax) leal 1(%eax), %eax jne LBB1_2 LBB1_3: popl %ebp ret

 ; RANGE_STRLEN pushl %ebp movl %esp, %ebp pushl %esi leal 8(%ebp), %ecx movl 12(%ebp), %esi xorl %eax, %eax testl %esi, %esi movl 8(%ebp), %edx jne LBB2_4 jmp LBB2_1 .align 16, 0x90 LBB2_8: incl %eax incl %edx movl %edx, (%ecx) LBB2_4: testl %edx, %edx jne LBB2_5 cmpb $0, (%esi) jne LBB2_8 jmp LBB2_6 .align 16, 0x90 LBB2_5: cmpl %edx, %esi jne LBB2_8 jmp LBB2_6 .align 16, 0x90 LBB2_3: leal 1(%edx,%eax), %esi incl %eax movl %esi, (%ecx) LBB2_1: movl %edx, %esi addl %eax, %esi je LBB2_6 cmpb $0, (%esi) jne LBB2_3 LBB2_6: popl %esi popl %ebp ret

Well well! How many branches and checks there are! The code is obtained using clang 3.4 with -O3 -DNDEBUG. In practice, for range_strlen, the compiler can generate better code. If he can assume that “end” is a signal iterator. But this is only an assumption.

In addition, people usually do not write the c_string_range class for working with delimited strings. They call strlen, and then do the algorithm, thus passing the string twice. But take the same istream. With the input interval, such a focus will not pass, since the very finding of the final iterator absorbs the entire interval. Therefore, there was no alternative but to make a dummy from std :: istream_iterator.

Finally, note that c_string_range :: iterator is a forward (forward) iterator, since the signal iterator cannot be reduced. The interval iterator is as strong as its signal iterator is strong - and it’s not particularly strong.

And now what?

Now it is clear that it is impossible to effectively use STL-algorithms on C-lines. So what? Well, the fact that, in general, all string algorithms cannot be used on C-lines. Look at the beautiful string algorithms in Boost.String_algo. The documentation is written about support for string types:

“Definition: a string is a range of characters that can be accessed sequentially. The first requirement for a string type is to access it through Boost.Range. This library allows you to access the elements of a string using iterators. "

So in Boost.String_algo do not like C-strings. By the way, what happens if you call std :: regex_search with a C-string? He will first call strlen! Even if your line is in megabytes, and the required substring is at the beginning, you still have to go through the entire line, only to find out where it is at the end. What is the meaning of exactly zero.

"Well, do not use the C-line," - you say. But the problem is more than C-lines. All intervals with limiters have the same problem. Only in the standard library there are istream_iterator, istreambuf_iterator, regex_iterator and regex_token_iterator, and all have signal soothers, and all of them are pushed into the library by the same method that I demonstrated earlier.

But didn’t you have such a case, that you had a desire to call a certain general algorithm, but you couldn’t do this because you wanted to leave the middle of the cycle according to some condition? Imagine that you can construct an interval with a delimiter, which will have both such a condition and an end iterator. And you can pass this interval to an algorithm that stops either when the condition is met or when the end is reached. Voila But this type of iterator would have to be crammed just like everyone else, and the algorithms that require something more than direct iterators would not work, since the signal iterator cannot be reduced ...

What to do

What I am following is this: the abstraction of the interval with a pair of iterators, familiar to all of us, was done with the aim of simple construction of abstractions. But, in the case of intervals with limiters, it turned out to be difficult to build abstractions. In addition, these intervals must be modeled by less powerful concepts than they might otherwise be. What to do? I have a solution, but we have not reached it yet - at first I want to speculate about infinite intervals. So stay with us.

The first five to get to the end of the article a small gift: 30% discount on a ticket to the conference on the promotional code Ranges

Continued:
Intervals in C ++, part 2: Infinite intervals
Intervals in C ++, part 3: we represent incrementors (Iterable)
Intervals in C ++, part 4: to infinity and beyond

Source: https://habr.com/ru/post/264201/

All Articles