📜 ⬆️ ⬇️

To the question of bitset

“Isn't it time, my friends, to us to aim a blow at William, you see, our Shakespeare? ".




I recently read a post about a custom keyboard and once again thought that it would be nice to write a reference (that is, one that you are not ashamed to sign with your name and put it on public display) implementation of the keyboard. This thought comes to me NOT for the first time, but somehow it stops at the first stage - reading the source information, because you want to make this stage easily customizable, convenient to use, versatile and effective, and do not like the sentence “choose any two.”

Necessary note - I see 4 layers of keyboard implementation: 0 - event detection, 1 - data reading, 2 - cleaning and storing data, 3 - generating messages, 4 - recoding, and so on. The most promising for layer 1 and layer 0 associated with it seems to me to use Anton Chizhov’s templates for working with pins MK (based on Loki’s class) and maybe someday, the result will not be a shame to share, but definitely not today. And now I thought about layer 2.

We formulate the task - it is necessary to store information about a fixed set of switches that take one of two predefined values ​​- “closed” and “not closed”. The most natural candidate are logical variables and the bitset library, as a way to store a set. Since the requirement of efficiency is a categorical imperative, it is desirable to evaluate the implementation of the module from this point of view.
')
My first thought was to look at the source codes and it would all become clear at once, but after a brief acquaintance with these, I realized that studying other people's templates is not very interesting (and not very productive). In addition, the source does not provide an accurate assessment of the effectiveness of the implementation, since it is directly closed to the compiler. In fact, the source text still had to be studied, otherwise making changes to it becomes a very long process (if, of course, we are interested in achieving a certain result), but this is a topic for a separate post.

Therefore, the method of studying the “black box” has been adopted - we submit to the input various code fragments and consider the generated assembler. Unfortunately, it is not possible to use the godbolt favorite site for the familiar AVR architecture, since this implementation does not have a library under study. You can, of course, drag it with pens, but there are no guarantees that this will be the correct source code.

Therefore, we will look at a different architecture. x51 for the gcc compiler is not presented, x86 I never liked, ARM is not very convenient (for a person) and an understandable assembler, MIPS is very specific and not too common, all sorts of SPARCs are even worse (well, well, I won't offend anyone’s favorite architecture, no better), but there is a great candidate MSP430, the basis of which was taken by the crystal clear and elegant architecture of the PDP and TI could not spoil it much (although the guys tried). The library of the set of bits for this architecture is presented, so that you can begin to study.

We begin, as it is not trivial to sound, from the beginning, that is, with the announcement of the set. Immediately we will see that storage for storage is allocated with four-byte words, despite the fact that the natural unit in this architecture is a two-byte word, and quite convenient and efficient work with bytes is provided, which leads to strange incidents. You can understand the author, the implementation of the 32-bit number should be everywhere and rely on it is quite natural, but in this case, 8-bit would be preferable, and for the AVR, 8-bit would be the only sensible solution.

An interesting question is, how can the architecture capacity be determined during the compilation process, it will be necessary to try through uint8_t_fast. Note the possible optimization and move on.

In addition to memory allocation, initialization is of interest — for global sets it is performed in the standard way — zeroing before calling main, for local ones — also in the standard way, that is, in any way, if the initial value is not explicitly indicated. Well, as always, if there is an opportunity to describe a static set with an initial value outside the function, this should be used in order not to get extra flags and not to waste execution time on them. But here we did not expect any revelations, just checked the general rules.

Let's start working with the modification of the set, for which we have left square brackets and the methods set and reset. We can expect to see for the installation of the element n in the set M something like:

M[n / w] |= (1<<(n % w)) 

where w is the number of bits in the base element, which for a given architecture, a statically defined n (for example, 4) and included optimization leads to a code like:

 bis.w #0x0010, m 

Indeed, we observe such code in the right half of the window, and it is unlikely that anyone would risk seriously saying that a more efficient solution is possible. But this is only for the specified conditions; for arbitrary n, the picture changes completely; for methods, we observe the argument checking for validity with the generation of the corresponding exception, and for parentheses we see the restriction of the argument to the bit mask of the allowable range with quite predictable indefinite behavior, both cases fully comply with the documentation. Negative values ​​are handled quite correctly, since indices are treated as unsigned numbers.

Let's pay attention to the fact that the assigned value for the element of a set can be not only 0 and 1, as one would expect, but any integer to which the rule “What is one? Not zero, ”quite logical, but poorly reflected in the documentation. Somewhat oddly done, after all, the boolean values ​​would be more natural, tick and move on.

Comparing the code generated for the case of a statically undefined element number of a set shows that the code's efficiency in both variants ([] and methods) is very close and small, because a certain subroutine from the standard library is called to calculate (1 << n), and this subroutine shifts 32-bit number 0x00000001, placed in two registers. We cannot look at its source code, but the very fact of a challenge leads to sad thoughts. The fact is that in the architecture under consideration there is no shift to the left (and there is no right to the right either) for an arbitrary number of positions, as in all (many) ARMs. There is a shift by 1 position (it would be strange if it didn’t exist at all), there is a shift by 2,3,4 positions (but by a strictly fixed number in the command, not by a parameter), there is a REPT prefix (but its speed leaves want something better). You can implement the shift of the lower unit (this is important, only one unit), that is, obtaining a bit mask for the bit number in a relatively short time by means of tricks like swapping tetrads, but this will be a very dependent part and, generally, it is better not to do so.

Therefore, a universal and fast method would be to store bitmasks in an array and indexing, and this is also very efficient on this architecture, the code then looks like this:

 M[n/w] |= BitArray[n %w] 

getting an assembler like:

  bis.b BitArray(r0),M(r1) 

Since we are talking about patterns and w is a multiple of the size of a byte, division operations here are implemented very effectively. We note a clear advantage of the minimally implemented storage element, for a byte, an 8-byte array is required as a storage element, -2 * 16 = 32 bytes for the organization, and 4 * 32 = 128 bytes for a long word to store all 32 bits. masks, but why pay more if the result does not change. Remember another possible optimization and go ahead.

Note one more fact - much more efficient implementations of operations with the element of the set are possible if the target architecture contains a bit-marked memory area (this is where ARM rejected earlier is remembered), where the installation operation of the element turns into a pumpkin BitSetAddr [n] = 1, which translates into a single assembler command for constant n, but there are already quite effective shifts, so that such optimization will be rather redundant, especially considering its limitations. In principle, there is a similar bit-addressable area in both x51 and AVR, but there are effective commands only for constant element numbers, and the general case is not so good (frankly bad).

Well, now take a close look at the resulting code and note that we are seeing artifacts associated with storing sets in double words. The compiler for the operation of modifying an element of a set generates a sequence of commands that read the corresponding double word from memory in 2 registers (I remind you that the registers are 16-bit), modifies them and sends the results back to memory. If we change only one element, then the mask of the operation will contain exactly one out of 32 possible, the remaining zeros. When we apply a statically defined element number, at the optimization stage, operations that do not change the result should be excluded. As a rule, this happens, but for different operands something goes wrong and artifacts of the form seep into the final code:

 bic #0,r0 

which looks especially funny if you notice that the register is not written back into memory, although it is read. Strictly speaking, the results of optimizations are not regulated anywhere, so they can be any, and there is nothing to be offended by, but still “it doesn’t neatly work out”. We cannot directly affect this process, if we do not seriously consider correcting the source code of the compiler, so we will go around it - we will help the optimizer by facilitating his task.

By the way, I still cannot find the answer to the question of whether it is possible to define in C ++ at the macro or template level a different implementation for a statically defined compile-time versus a statically non-specific parameter. If anyone knows the path of the samurai, tell me in the comments, I tried constexpr, it did not work.

We continue to study and find that the compiler optimizes optimized calls to the set (of course, if optimization is enabled), that is, the order of installing / cleaning various elements is absolutely not guaranteed and is in no way connected with the order of the source code operators. I failed to create a volatile set, maybe I am doing something wrong? As in the case of any local optimization, accessing an external function forces the compiler to forcibly perform all the prepared operations for the global array, but this is too strong a solution and does not help with local ones. Well, there is probably nothing to be done and you just need to take into account this feature and not use sets to transfer information between threads using serial interfaces (that is, their software counterparts).

A general conclusion can be made: the use of the bitset in its current form for architectures with limited resources cannot be recommended due to excessive costs in terms of both memory and execution time. A possible modification, which takes into account all the data on the text of the comment, lies on Github, everyone can use it. The history of the creation of this mod will soon be posted on Habré, there were funny moments.

In conclusion, a small note is that the implementation of the data store “head-on,” even on the optimized version of the set, will require H / 8 bytes of data memory (128 switches will require 16 bytes) and, although the operation will take O (1) time, but the multiplier will be many units ( and even up to 10 or more) cycles MK. Therefore, taking into account the requirements of the task and introducing certain restrictions, we can offer an alternative implementation of data storage.

If we consider that no more than one switch can be closed at any given time (we ignore all the others until the currently pressed button opens), we can easily do with one byte (provided that the switches are no more than 256) and writing / reading will take O (1) processor cycles, and the multiplier will be quite modest.

This approach is easy to expand and store information about simultaneously closed n switches, but you should not make n too large, since the required memory increases, and the execution time of the circulation operations grows linearly with the number of elements in the set, although O (1) remains relative to the number of switches. The indicated increase in time can be significantly reduced by using a triangular implementation of a binary tree to O (loq2 (n)), but for small n this is not so important. And it is doubtful that the complication of calculating the next index in the search would compensate for the decrease in the number of simple iterations. The disadvantages of this implementation include the possible failure to write an element of the set, which should be handled in the calling program (we reject the variant with changing buffer size immediately and with indignation - this is not for embedded solutions).

The implementation of this approach will be given there.

Source: https://habr.com/ru/post/447688/


All Articles