How to blink 4 LEDs on CortexM using C ++ 17, tuple and perverted imagination

All good health!

When teaching students to develop embedded software for microcontrollers at the university, I use C ++ and sometimes I give especially interested students all sorts of tasks to identify particularly ~~sick and~~ talented students.

Once again, such students were given the task of blinking 4 LEDs using the C ++ 17 language and the standard C ++ library, without connecting additional libraries, such as CMSIS and their header files with the description of register structures, and so on ... The one with the code wins the ROM will occupy the smallest size and the least amount of RAM consumed. The compiler optimization should not be higher than Medium. The IAR compiler 8.40.1.
The winner ~~goes to the Canaries~~ gets 5 for the exam.
')
I myself didn’t solve this task before, so I’ll tell you how the students solved it and what happened with me. I warn you right away that it is unlikely that such code can be used in real applications, that's why I placed the publication in the section “Abnormal programming”, although who knows.

Conditions of the problem

There are 4 LEDs on the ports GPIOA.5, GPIOC.5, GPIOC.8, GPIOC.9. They need to blink. So that we have something to compare, we took the code written in C:

void delay() { for (int i = 0; i < 1000000; ++i){ } } int main() { for(;;) { GPIOA->ODR ^= (1 << 5); GPIOC->ODR ^= (1 << 5); GPIOC->ODR ^= (1 << 8); GPIOC->ODR ^= (1 << 9); delay(); } return 0 ; }

The delay() function here is a purely formal, ordinary cycle, it cannot be optimized.
It is assumed that the ports are already configured to exit and clocking is applied to them.
I’ll also say that bitbanging wasn’t used to make the code portable.

This code takes 8 bytes on the stack and 256 bytes in ROM on Medium optimization

255 bytes of readonly code memory
1 byte of readonly data memory
8 bytes of readwrite data memory

255 bytes due to the fact that part of the memory went under the table of interrupt vectors, IAR function calls to initialize the block with a floating point, all sorts of debugging functions and the __low_level_init function where the ports themselves were configured.

So, the full requirements:

The main () function should contain as little code as possible.
You can not use macros
Compiler IAR 8.40.1 supporting C ++ 17
You cannot use CMSIS header files, such as "#include" stm32f411xe.h "
You can use the __forceinline directive for inline functions.
Medium Compiler Optimization

Student solution

In general, there were several solutions, I will show only one ... it is not optimal, but I liked it.

Since you can’t use header files, students first made a Gpio class that should store a link to the port registers at their addresses. To do this, they use the overlay structure, most likely the idea was taken from here: Structure overlay :

 class Gpio { public: __forceinline inline void Toggle(const std::uint8_t bitNum) volatile { Odr ^= bitNum ; } private: volatile std::uint32_t Moder; volatile std::uint32_t Otyper; volatile std::uint32_t Ospeedr; volatile std::uint32_t Pupdr; volatile std::uint32_t Idr; volatile std::uint32_t Odr; //    static_assert(sizeof(Gpio) == sizeof(std::uint32_t) * 6); } ;

As you can see, they immediately identified the Gpio class with attributes that should be located at the addresses of the respective registers and the method for switching the state by the leg number:
Then we determined the structure for GpioPin , which contains a pointer to Gpio and the number of the leg:

 struct GpioPin { volatile Gpio* port ; std::uint32_t pinNum ; } ;

Then they made an array of LEDs that sit on the specific legs of the port and ran through it by calling the Toggle() method of each LED:

 const GpioPin leds[] = {{reinterpret_cast<volatile Gpio*>(GpioaBaseAddr), 5}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 5}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 9}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 9} } ; struct LedsDriver { __forceinline static inline void ToggelAll() { for (auto& it: leds) { it.port->Toggle(it.pinNum); } } } ;

Well, actually the whole code:

 constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ; constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800 ; class Gpio { public: __forceinline inline void Toggle(const std::uint8_t bitNum) volatile { Odr ^= bitNum ; } private: volatile std::uint32_t Moder; volatile std::uint32_t Otyper; volatile std::uint32_t Ospeedr; volatile std::uint32_t Pupdr; volatile std::uint32_t Idr; volatile std::uint32_t Odr; } ; //    static_assert(sizeof(Gpio) == sizeof(std::uint32_t) * 6); struct GpioPin { volatile Gpio* port ; std::uint32_t pinNum ; } ; const GpioPin leds[] = {{reinterpret_cast<volatile Gpio*>(GpioaBaseAddr), 5}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 5}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 9}, {reinterpret_cast<volatile Gpio*>(GpiocBaseAddr), 9} } ; struct LedsDriver { __forceinline static inline void ToggelAll() { for (auto& it: leds) { it.port->Toggle(it.pinNum); } } } ; int main() { for(;;) { LedsContainer::ToggleAll() ; delay(); } return 0 ; }

Statistics of their code for Medium optimization:

275 bytes of readonly code memory
1 byte of readonly data memory
8 bytes of readwrite data memory

Good decision, but it takes a lot of memory :)

My decision

I certainly decided not to look for simple ways and decided to act seriously :).
The LEDs are on different ports and different legs. The first thing that needs to be done is to make the Port class, but to get rid of pointers and variables that occupy RAM, you need to use static methods. The port class might look like this:

 template <std::uint32_t addr> struct Port { //  -  };

As a template parameter, it will have a port address. In the heading "#include "stm32f411xe.h" , for example, for port A, it is defined as GPIOABASE. But we are forbidden to use headings, so we just need to make our constant. As a result, the class can be used like this:

 constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ; constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800 ; using PortA = Port<GpioaBaseAddr> ; using PortC = Port<GpiocBaseAddr> ;

To blink you need the Toggle method (const std :: uint8_t bit), which will switch the required bit using the exclusive OR operation. The method should be static, add it to the class:

 template <std::uint32_t addr> struct Port { //   __forceinline,        __forceinline inline static void Toggle(const std::uint8_t bitNum) { *reinterpret_cast<std::uint32_t*>(addr+20) ^= (1 << bitNum) ; //addr + 20  ODR  } };

Excellent Port<> is, it can switch the status of the leg. The LED is sitting on a specific leg, so it is logical to make a class Pin , which will have Port<> and the leg number as template parameters. Since the Port<> type is template, i.e. different for a different port, we can only transmit the universal type T.

 template <typename T, std::uint8_t pinNum> struct Pin { __forceinline inline static void Toggle() { T::Toggle(pinNum) ; } } ;

It's bad that we can pass on any nonsense of type T which has the Toggle() method and it will work, although it is assumed that we should only transfer the type Port<> . To protect against this, let us make sure that Port<> inherited from the base class PortBase , and in the template we will check that our transferred type is really based on PortBase . We get the following:

 constexpr std::uint32_t OdrAddrShift = 20U; struct PortBase { }; template <std::uint32_t addr> struct Port: PortBase { __forceinline inline static void Toggle(const std::uint8_t bit) { *reinterpret_cast<std::uint32_t*>(addr ) ^= (1 << bit) ; } }; template <typename T, std::uint8_t pinNum, class = typename std::enable_if_t<std::is_base_of<PortBase, T>::value>> //   struct Pin { __forceinline inline static void Toggle() { T::Toggle(pinNum) ; } } ;

Now the template is instantiated only if our class has a base class PortBase .
In theory, you can already use these classes, let's see what happens without optimization:

 using PortA = Port<GpioaBaseAddr> ; using PortC = Port<GpiocBaseAddr> ; using Led1 = Pin<PortA, 5> ; using Led2 = Pin<PortC, 5> ; using Led3 = Pin<PortC, 8> ; using Led4 = Pin<PortC, 9> ; int main() { for(;;) { Led1::Toggle(); Led2::Toggle(); Led3::Toggle(); Led4::Toggle(); delay(); } return 0 ; }

271 bytes of readonly code memory
1 byte of readonly data memory
24 bytes of readwrite data memory

Where did these additional 16 bytes in RAM and 16 bytes in ROM come from? They come from the fact that we pass the bit parameter to the Toggle function (const std :: uint8_t bit) of the Port class, and the compiler, upon entering the main function, saves 4 additional registers on the stack through which it passes this parameter, then uses these registers in which the value of the pin number for each Pin is stored and when exiting from main, restores these registers from the stack. And although in fact this is some kind of completely useless work, since the functions are built-in, but the compiler acts in full compliance with the standard.
You can get rid of this by removing the port class in general, passing the port address as a template parameter for the Pin class, and calculating the ODR register address inside the Toggle() method:

 constexpr std::uint32_t OdrAddrShift = 20U; template <std::uint32_t addr, std::uint8_t pinNum, struct Pin { __forceinline inline static void Toggle() { *reinterpret_cast<std::uint32_t*>(addr + OdrAddrShift ) ^= (1 << bit) ; } } ; using Led1 = Pin<GpioaBaseAddr, 5> ;

But it does not look very good and user-friendly. So let's hope that the compiler removes this unnecessary saving of registers with a little optimization.

Put the optimization on Medium and see the result:

251 bytes of readonly code memory
1 byte of readonly data memory
8 bytes of readwrite data memory

Wow wow wow ... we have 4 bytes less

sishnogo code

255 bytes of readonly code memory
1 byte of readonly data memory
8 bytes of readwrite data memory

How can this be? Let's take a look at the assembler in the debugger for C ++ code (left) and C code (right):

It can be seen that, firstly, the compiler made all the functions built-in, now there are no calls at all, and secondly, it optimized the use of registers. It can be seen that in the case of C code, the compiler uses either the R1 register and R2 for storing the port addresses, and does additional operations every time after switching the bit (save the address in the register, then in R1, then in R2). In the second case, it uses only the R1 register, and since the last 3 calls to switch are always from port C, there is no need to keep the same port C address in the register. As a result, 2 teams and 4 bytes are saved.

Here it is a miracle of modern compilers :) Well, okay. In principle, it was possible to stop at this, but let's go further. I think it will not work out to optimize something else, although it may be wrong if you have ideas, write in the comments. But with the amount of code in main () you can work.

Now I want all the LEDs to be somewhere in the container, and one could call the method, switch everything ... That's something like this:

 int main() { for(;;) { LedsContainer::ToggleAll() ; delay(); } return 0 ; }

We will not stupidly insert the switching of 4 LEDs into the LedsContainer :: ToggleAll function, because it is not interesting :). We want to put the LEDs in a container and then go over them and call Toggle () method on each one.

Students used an array to store pointers to LEDs. But I have different types, for example: Pin<PortA, 5> , Pin<PortC, 5> , and I cannot store pointers to different types in an array. You can make a virtual base class for all Pin, but then a table of virtual functions will appear and I won’t be able to win students.

Therefore, we will use a tuple. It allows you to store objects of different types. This case will look like this:

 class LedsContainer { private: constexpr static auto records = std::make_tuple ( Pin<PortA, 5>{}, Pin<PortC, 5>{}, Pin<PortC, 8>{}, Pin<PortC, 9>{} ) ; using tRecordsTuple = decltype(records) ; }

Excellent there is a container, it stores all the LEDs. Now add the ToggleAll() method to it:

 class LedsContainer { public: __forceinline static inline void ToggleAll() { //        } private: constexpr static auto records = std::make_tuple ( Pin<PortA, 5>{}, Pin<PortC, 5>{}, Pin<PortC, 8>{}, Pin<PortC, 9>{} ) ; using tRecordsTuple = decltype(records) ; }

It is simply impossible to walk through the elements of a tuple like this, since the receipt of the element of a tuple should occur only at the compilation stage. To access elements of a tuple there is a template get method. Well, that is if we write as std::get<0>(records).Toggle() , then the Toggle() method will be called for the object of the class Pin<PortA, 5> , if std::get<1>(records).Toggle() , then the Toggle() method will be called for an object of the class Pin<Port, 5> and so on ...

It was possible to ~~wipe students nose~~ and just write like this:

  __forceinline static inline void ToggleAll() { std::get<0>(records).Toggle(); std::get<1>(records).Toggle(); std::get<2>(records).Toggle(); std::get<3>(records).Toggle(); }

But we do not want to strain a programmer who will support this code and allow him to do additional work, wasting the resources of his company, for example, if another LED appears. It is necessary to add code in two places, to a tuple and to this method - and this is not good, and the company owner will not be very pleased. Therefore, we bypass the tuple using helper methods:

 class class LedsContainer { friend int main() ; public: __forceinline static inline void ToggleAll() { //    3,2,1,0    ,     visit(std::make_index_sequence<std::tuple_size<tRecordsTuple>::value>()); } private: __forceinline template<std::size_t... index> static inline void visit(std::index_sequence<index...>) { Pass((std::get<index>(records).Toggle(), true)...); //    get<3>(records).Toggle(), get<2>(records).Toggle(), get<1>(records).Toggle(), get<0>(records).Toggle() } __forceinline template<typename... Args> static void inline Pass(Args... ) {//      } constexpr static auto records = std::make_tuple ( Pin<PortA, 5>{}, Pin<PortC, 5>{}, Pin<PortC, 8>{}, Pin<PortC, 9>{} ) ; using tRecordsTuple = decltype(records) ; }

It looks scary, but I warned at the beginning of the article that the ~~shizanuty~~ method is not very ordinary ...

At the compilation stage, all this magic does literally the following:

 //  LedsContainer::ToggleAll() ; //   4 : Pin<Port, 9>().Toggle() ; Pin<Port, 8>().Toggle() ; Pin<PortC, 5>().Toggle() ; Pin<PortA, 5>().Toggle() ; //     Toggle() inline,   : *reinterpret_cast<std::uint32_t*>(0x40020814 ) ^= (1 << 9) ; *reinterpret_cast<std::uint32_t*>(0x40020814 ) ^= (1 << 8) ; *reinterpret_cast<std::uint32_t*>(0x40020814 ) ^= (1 << 5) ; *reinterpret_cast<std::uint32_t*>(0x40020014 ) ^= (1 << 5) ;

Forward compile and check the size of the code without optimization:

Code that compiles

 #include <cstddef> #include <tuple> #include <utility> #include <cstdint> #include <type_traits> //#include "stm32f411xe.h" #define __forceinline _Pragma("inline=forced") constexpr std::uint32_t GpioaBaseAddr = 0x4002'0000 ; constexpr std::uint32_t GpiocBaseAddr = 0x4002'0800 ; constexpr std::uint32_t OdrAddrShift = 20U; struct PortBase { }; template <std::uint32_t addr> struct Port: PortBase { __forceinline inline static void Toggle(const std::uint8_t bit) { *reinterpret_cast<std::uint32_t*>(addr + OdrAddrShift) ^= (1 << bit) ; } }; template <typename T, std::uint8_t pinNum, class = typename std::enable_if_t<std::is_base_of<PortBase, T>::value>> struct Pin { __forceinline inline static void Toggle() { T::Toggle(pinNum) ; } } ; using PortA = Port<GpioaBaseAddr> ; using PortC = Port<GpiocBaseAddr> ; //using Led1 = Pin<PortA, 5> ; //using Led2 = Pin<PortC, 5> ; //using Led3 = Pin<PortC, 8> ; //using Led4 = Pin<PortC, 9> ; class LedsContainer { friend int main() ; public: __forceinline static inline void ToggleAll() { //    3,2,1,0    ,     visit(std::make_index_sequence<std::tuple_size<tRecordsTuple>::value>()); } private: __forceinline template<std::size_t... index> static inline void visit(std::index_sequence<index...>) { Pass((std::get<index>(records).Toggle(), true)...); } __forceinline template<typename... Args> static void inline Pass(Args... ) { } constexpr static auto records = std::make_tuple ( Pin<PortA, 5>{}, Pin<PortC, 5>{}, Pin<PortC, 8>{}, Pin<PortC, 9>{} ) ; using tRecordsTuple = decltype(records) ; } ; void delay() { for (int i = 0; i < 1000000; ++i){ } } int main() { for(;;) { LedsContainer::ToggleAll() ; //GPIOA->ODR ^= 1 << 5; //GPIOC->ODR ^= 1 << 5; //GPIOC->ODR ^= 1 << 8; //GPIOC->ODR ^= 1 << 9; delay(); } return 0 ; }

Assembly proof, unpacked as planned:

We see that the memory is brute force, 18 bytes more. The problems are the same, plus 12 more bytes. I did not understand where they came from ... maybe someone can explain.

283 bytes of readonly code memory
1 byte of readonly data memory
24 bytes of readwrite data memory

Now the same thing on Medium optimization and a miracle ... got a code identical to C ++ implementation head-on and more optimally than C code.

251 bytes of readonly code memory
1 byte of readonly data memory
8 bytes of readwrite data memory

Assembler

As you can see, I won, and ~~went to the Canaries~~ and enjoyed a rest in Chelyabinsk :), but the students are great too, the exam passed successfully!

Who cares, the code here

Where it is possible to use this, well, I thought of this, for example, we have parameters in the EEPROM memory and a class describing these parameters (Read, write, initialize to the initial value). The class is a template, of the type Param<float<>> , Param<int<>> and it is necessary, for example, reset all parameters to default values. Just here it is possible to put all of them in a tuple, since the type is different and call SetToDefault() method for each parameter. True, if there are 100 such parameters, then the ROM will play off a lot, but the RAM will not suffer.

PS I must admit that at maximum optimization this code is the same in size as in C and in my decision. And all the efforts of the programmer to improve the code are reduced to the same code in assembler.

P.S1 Thank you 0xd34df00d for good advice. You can simplify unpacking a tuple with std::apply() . The code of the ToggleAll() function is then simplified to this:

  __forceinline static inline void ToggleAll() { std::apply([](auto... args) { (args.Toggle(), ...); }, records); }

Unfortunately in IAR, std :: apply in the current version is not yet implemented, but it will also work, see for implementation with std :: apply

Source: https://habr.com/ru/post/457246/

All Articles

How to blink 4 LEDs on CortexM using C ++ 17, tuple and perverted imagination

Conditions of the problem

Student solution

My decision

More articles: