ARM7TDMI-S (ARMv4T) vs. Cortex-M3 (ARMv7-M)

Already a dozen years on the market represented a lot of microcontrollers on the core ARM7TDMI. This is quite a powerful core for single-chip solutions. It has a bit depth of 32 bits and a frequency of operation up to 100 MHz; moreover, the core is one-cycle, i.e. some instructions are executed in 1 clock cycle (mainly register operations, without accessing external processor buses). The core of ARM7TDMI surpasses all 8 and 16-bit chips (AVR, MSC-51, PIC12 / PIC16 / PIC18 / PIC24, MSP430, etc) in computing power.

However, relatively recently, the company ARM introduced a new family of Cortex cores, we will be interested in its version of the Cortex-M3, which is intended just to replace the ARM7TDMI in a niche of single-chip solutions.

')
I was lucky to work with NXP LPC1300 chips, or more precisely LPC1343, based on the Cortex-M3 core, right after their official release. Now a couple of projects have been transferred for them. And I will tell you how an experienced programmer under ARM: I really liked them, although they have their own fun in architecture.

So, Cortex-M3 is designed to replace the ARM7TDMI. During its development, ARM Ltd. set itself the goal without significantly complicating the logic of the processor circuits to increase the functionality, add useful instructions, thereby increasing the code density and performance. Because of this, we had to take an unprecedented step: for the first time, the ARM core is incompatible by binary code with previous families. Actually, this happened for the reason that the Cortex-M3 is not able to execute the 32-bit ARM code.

All previous cores had 2 modes of operation and each of them had its own set of commands. These modes were called ARM and Thumb. The first one worked with a 32-bit full set of instructions, and the second with a simplified set of 16-bit instructions. In fact, the kernel always executed the ARM code, however, in the Thumb mode, a certain decoder was connected, which on the map saw 16 bit instructions into their 32-bit counterparts.

In Cortex-M3, they abandoned 32-bit code as a class. In the Cortex family there are several more cores (Cortex-M0, M1, A0-A3). M3 is located in the middle. M0, M1 - are even more simplified, but the A-series, on the contrary, is designed for heavy and high-performance applications, and they did not remove the ability to execute ARM-code.

Massiveness and low code density is a big problem for ARM cores, 32 bits for any operation make themselves felt, plus it is impossible to encode a constant more than 1 byte in the instructions. It is because of this, and introduced an additional set of instructions Thumb. It provides a greater code density (on average, a gain of 20–30%), although sacrificing 5–10% of performance.

In Cortex, the idea of Thumb code was developed. The set of 16-bit Thumb instructions was expanded, the set of instructions was called Thumb-2. When compiling it, the performance drop (compared to the pure ARM code) is only a few percent, but the savings in volume are still the same 20-30%.

Separate attention in the Thumb-2 set is deserved by such high-level instructions as IT (the construction with its use is presented below), in general, the command system is just crammed with features designed to increase optimization when compiling C code. So, the construction on Thumb-2:

CMP r0, r1
ITE EQ ; if (r0 == r1)
MOVEQ r0, r2 ; then r0 = r2;
MOVNE r0, r3 ; else r0 = r3;

Something similar can be done in the ARM instruction set:

CMP r0, r1 ; if (r0 == r1)
MOVEQ r0, r2 ; then r0 = r2;
MOVNE r0, r3 ; else r0 = r3;

And in pure Thumb you will have to “pervert” somewhat:

CMP r0, r1 ; if (r0 == r1)
BNE .else
MOV r0, r2 ; then r0 = r2;
B .endif
.else:
MOV r0, r3 ; else r0 = r3;
.endif

Although if you count the volumes, then we get that in the case of the Thumb construction will take 2 * 5 = 10 bytes, on the Thumb-2 the volume will be 2 * 4 = 8 bytes, on the ARM the whole 4 * 3 = 12 bytes (although it has only 3 instructions ).

However, the Keil RealView MDK compiler is precisely this vaunted IT instruction, apparently unknown, since it was not found when studying the generated listings, and visually the assembler code at the output of the compiler still looks more like a normal Thumb. Tolley sources themselves are specific, or the compiler is not yet “finished” with the new kernel and command system. Unfortunately, I don’t have information at the expense of other compilers, although it would not be bad to see what GCC generates.

In general, a simple optimization of the code is advertised, supposedly the final size will be 30-50% less than the same source code compiled for 8 and even a 16-bit microcontroller (for example, in the document presented by the first link at the end of the article). I will say right away: this is a somewhat manipulated result, it is true only for 32-bit code, i.e. C code with an abundance of operations with int, long variables, as well as a large number of calculations (for example, the well-known Dhrystone test is well suited for these requirements). But if you transfer the code previously written and optimized for 8 bits, then when transferring to a 32-bit processor, the increase in the size of the binary code will be the other way round, in my experience the code increases in volume by almost 1.5-2 times.

Another significant innovation in the Cortex-M3 was the addition of the division command. Since ancient times, ARM-cores incorporate multiplication operations (with a 64-bit result) and multiplication with an accumulation (also a 64-bit result). Now, the division instruction has been added. Of course, it most likely eats a lot of cycles, however, it’s still much faster than a separate subroutine. No matter how paradoxical high-level people and people far from microcontrollers seem to be: hardware division is still rare in single-chip systems (there is no need to say anything about different sets of instructions for floating arithmetic and other co-processors, they are only available in the heaviest monsters sharpened under multimedia) .

Unlike the ARM7TDMI, the Cortex has a Harvard memory architecture (separate command and data buses). In the same AVR, this delivers certain inconveniences and when programming you should use some compiler macros and specific functions so that const variables do not fall into RAM. Here (in fact, it was in all ARM after ARMv4, such as ARM9, ARM11) when programming individual tires are not felt, inside the crystal, they are all the same united into a single address space. All ARM chips have a 32-bit linear address space of 4 GB in size (for x86 programmers, this corresponds to the flat memory model), and all peripheral addresses, ROM and RAM are mixed in it.

Note (1): Despite all the advantages, it is a huge address space that is a significant misfortune when optimizing a code: we have 32-bit addressing; in ARM / Thumb and even Thumb-2 instructions you cannot directly encode the full address of an object, therefore the address is put in data in the code, and then gets a separate instruction. It also negatively affects the amount of code. For example, in MSC-51, 2 bytes can be enough to read a variable from RAM, while in ARM you will have to store at least 2 bytes of the instruction itself and 4 bytes directly use for address storage.

Note (2): I always wanted to try to place a code in the peripheral register (for example, a return instruction) and transfer control to it, having observed the reaction of the kernel. On ARM7TDMI, this trick can roll because of the Von Neumann organization of memory, but Cortex with his Harvard will almost certainly send it to distant lands, falling into one of the abortions.

The next major difference: single stack. If in ARM7TDMI for different kernel modes (this is not about ARM / Thumb, but about the modes in which the processor switches when entering interrupts and for exception handling), separate stacks are allocated, then there is only one stack. I do not know how to relate to this, in theory it is less flexible, but in practice it is damn convenient. RAM is saved because there is no need to reserve a bunch of stacks, the logic of nested interrupts and system calls is simplified (try using the ARM7TDMI to make a system call using a SWI program interrupt with more than 4 parameters, this garden will be needed here, too, but easier). In addition, due to this, input and output delays from interrupts were reduced, as well as switching between interrupts.

The second change to speed up interrupt handling is a rejection of the VIC. Yes, no more monster called VIC (Vector Interrupt Controller). Yes, this is again a step from flexibility to simplicity, but in the microcontroller system there is a case when you need to reassign interrupt handlers on the fly, it is easier to write your own bicycle for this than to do VIC tuning in each project. Moreover, it is possible to place the interrupt table in the RAM and it is easy to change the addresses of handlers in it.

Instead of VIC, we now have NVIC and a bunch of interrupt vectors at the beginning of FLASH. If the ARM7TDMI interrupt vector occupied 32 bytes at the beginning, here several hundred bytes are allocated for interrupts from various devices. Moreover, now these are not jump instructions, but real vectors with addresses. Those. the kernel does not transfer control by address to the interrupt table, but makes address sampling at the desired offset and transfers control over it, from the programmer’s point of view it is more convenient, more beautiful and more transparent.

But the main surprise is the first 2 interrupt vectors. Think reset and something else? NOT! At the 0th address is ... the value of the stack, it is hardware entered by the kernel into the stack register when reset. And at offset 4 - the address of the entry point. What does this give? And this is what: we can immediately start the execution of the program from C-code without preliminary initialization. Of course, in this case, you will have to manually copy the RW section into RAM and reset the ZI (if you completely refuse the help of the compiler).

Such an explicit C-orientation is noticeable in the example projects for Cortex. All initializations are transferred from assembler to C. Due to the rejection of multiple stacks, it became unnecessary to initialize them at the very beginning. At the same time, other initializations migrated to the C code.

Another interesting difference in the system of commands: added high-level instructions WFI (wait for interrupt), WFE (wait for event) and others that simplify the creation of multi-threaded applications and operating systems. The set contains instructions for multiprocessor systems, which suggests that multi-core single-chip solutions may soon be released.

Note: Although multi-core microcontrollers exist in the form of the same Parallax Propeller (it already has 8 32-bit cores), but it is fully functional and suitable for commercial use (and not for amateur handicrafts), it cannot be named.

Also in the description of the core Cortex-M3 added 1 timer. The timer is simple, it can generate an interrupt with a certain frequency, however, for example, for the kernel of the operating system, more is not required.

Note: the timer in the kernel description is a very useful and important thing. Since it is described in the documentation on the kernel and is actually part of the licensed core, all manufacturers will contribute it to their chips, and most importantly, all of them will have the same implementation. This is very useful for code compatibility: it is not necessary to write support modules for a heap of implementations of timers from different manufacturers (as is the case with ARM7TDMI). However, additional timers, each manufacturer will still implement in its own way, but already one standard - it is a good step towards universality.

In conclusion, it should be said that the documentation for the kernel also describes the MPU (Memory Protection Unit) module. A very useful thing in complex devices, when several threads are running and you really do not want a violation of the entire firmware due to a failure in any particular stream. However, this module is optional and chip manufacturers are not in a hurry to embed it. Even in the older NXP LPC1700 family, it is missing. Other manufacturers also have not been noticed. Still, memory protection, not to mention virtual memory, remains the lot of expensive and large monsters.

Related Links:

Source: https://habr.com/ru/post/92494/

All Articles

ARM7TDMI-S (ARMv4T) vs. Cortex-M3 (ARMv7-M)

More articles: