OpenOCD, ThreadX and your processor

This note may be useful for people who write bare-metal code and use ThreadX in their tasks (by choice or by imposing the SDK). The problem is that in order to effectively debug the code under ThreadX or another multi-threaded operating system, you need to be able to see these threads, to be able to see the stack-trace, the state of registers for each thread.

OpenOCD ( Open On Chip Debugger ) declares support for ThreadX , but does not explicitly state its breadth. And normally, at the time of this writing, in version 0.8.0, these are just two cores: Cortex M3 and Cortex R4. I, by fate, had to work with a Cypress FX3 chip which is built on the basis of the ARM926E-JS core.

Under the cat, consider what needs to be done to add support for your version of ThreadX for your CPU. The emphasis is on ARM, but, theoretically, it may well be suitable for other processors. In addition, the case is considered when access to the source code of ThreadX is not and is not expected.

From the first lines immediately upset: no assembler anywhere. No, we will not have to write on it, but to read the code - yes.
')
Let's start with an introduction to the implementation of ThreadX support in OpenOCD. This is just one file: src / rtos / ThreadX.c .

The supported system is described by the structure ThreadX_params , which contains information about the name of the target, the “width” of the pointer in bytes, a set of offsets in the TX_THREAD structure to the required service fields, as well as information about how the context of the thread is maintained when switching (so-called stacking info ). The supported systems themselves are registered using the ThreadX_params_list array.

With all the parameters except the last one, there are no problems: the width of the pointer is usually equal to the processor's width, the offsets are considered to be handles (and almost always they are unchanged).

An interesting question: where to get information on stacking? But there is a lot of information:

stack growth direction (well, it's easy)
the number of registers in the system (this is also easy, we run on the existing version of OpenOCD info registers and count the number of lines).
frame alignment on the stack, I got this value randomly, for Cortex M3 / R4 it is indicated 8 bytes, for ARM926E-JS - 0 (that is, without alignment). In fact, alignment is at 4, but the memory allocated with tx_byte_alloc () is already aligned, and stack usage is always a multiple of 4. In general, try the values 0, 4, and so on.
an array of offsets in the stack (relative to the current vertex) on which the values of specific registers lie (the size of the array is equal to the number of registers in the output “info registers”).

Here is the last and the most difficult and incomprehensible. I can immediately convince - there is no standard approach here. At random, it is extremely difficult, if not impossible, to pick up these values.

Moreover, looking ahead, as it turned out, one stacking scheme is used for the Cortex M3 / R4 cores, and two for the ARM926E-JS! All for the sake of economy.

Briefly (as well as very rude and inaccurate) how the sheduler works in ThreadX: it also provides a cooperative and repressive approach to multitasking.

The cooperative approach works for threads of the same priority which are not given a time slice (0). Those. if flow A and B have the same priority, flow A has started, then flow B will not receive control until A:

not complete
will not cause a function that leads to a rescheduling (sleep, waiting on the queue, mutex, semaphore, etc.)

If the time slice is set, then at its completion the flow will be interrupted and control will be transferred to another next one in the Ready state (for the case when the thread falls asleep but has not developed its slice, the cooperative approach will also work). The preemptive approach is already working here. For his work need a timer and interrupt from it with a certain frequency. Also, the flow A from the example above can be superseded by the flow B if its priority is higher.

It is clear that the flow context is saved when it transfers control to someone and is restored when it receives control. Let's understand how this happens - understand what needs to be described in the array of register offsets.

I will not go into details, as I found out where and how the main parts of the scheduler were hidden, a lot was done here: savvy, luck, and Google, and disassembler. But I will give the main components thereof:

_tx_timer_interrupt () - the function is called from the context of the timer interruption, in fact, is responsible for the displacing part of the scheduler.
_tx_thread_context_save () (or _tx_thread_vectored_context_save () ) and _tx_thread_context_restore () are a pair of functions designed to be called from interrupts to save and restore the context. When restoring the context, an attempt will be made to resolve.
_tx_thread_system_return () is part of the cooperative approach. It is called at the end of any call chain that causes resolving.
and, finally, _tx_thread_schedule () is the most important function for analysis and, perhaps, the simplest of the above.

I studied the listings of all these functions, but if you need to tighten support for an unsupported processor again, I will focus on the last three. But I will start with the latter, and only after that (if there is not enough information) I will study others.

Let's look at its listing (I replaced some indirect addressing with real symbols, the symbols themselves
look in the elf file using arm-none-eabi-nm):

40004c7c <_tx_thread_schedule>: 40004c7c: e10f2000 mrs r2, CPSR 40004c80: e3c20080 bic r0, r2, #128 ; 0x80 40004c84: e12ff000 msr CPSR_fsxc, r0 40004c88: e59f104c ldr r1, [pc, #76] ; 40004cdc <_tx_thread_schedule+0x60> 40004c8c: e5910000 ldr r0, [r1] 40004c90: e3500000 cmp r0, #0 40004c94: 0afffffc beq 40004c8c <_tx_thread_schedule+0x10> 40004c98: e12ff002 msr CPSR_fsxc, r2 40004c9c: e59f103c ldr r1, [pc, #60] ; 40004ce0 <_tx_thread_schedule+0x64> 40004ca0: e5810000 str r0, [r1] 40004ca4: e5902004 ldr r2, [r0, #4] 40004ca8: e5903018 ldr r3, [r0, #24] 40004cac: e2822001 add r2, r2, #1 40004cb0: e5802004 str r2, [r0, #4] 40004cb4: e59f2028 ldr r2, [pc, #40] ; 40004ce4 <_tx_thread_schedule+0x68> 40004cb8: e590d008 ldr sp, [r0, #8] 40004cbc: e5823000 str r3, [r2] 40004cc0: e8bd0003 pop {r0, r1} 40004cc4: e3500000 cmp r0, #0 40004cc8: 116ff001 msrne SPSR_fsxc, r1 40004ccc: 18fddfff ldmne sp!, {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc}^ 40004cd0: e8bd4ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp, lr} 40004cd4: e12ff001 msr CPSR_fsxc, r1 40004cd8: e12fff1e bx lr 40004cdc: 4004b754 .word 0x4004b754 ; _tx_thread_execute_ptr 40004ce0: 4004b750 .word 0x4004b750 ; _tx_thread_current_ptr 40004ce4: 4004b778 .word 0x4004b778 ; _tx_timer_time_slice

The function is crazy:

allow interrupts (lines 40004c7c-40004c84)
wait for someone to call _tx_thread_execute_ptr (40004c88-40004c94) - the next thread to execute
disable interrupts, or rather, restore status register (40004c98)
save the _tx_thread_current_ptr pointer to r0 (40004c9c-40004ca0)
increase the value of tx_thread_run_count of the current thread by 1 (40004ca4, 40004cac-40004cb0)
get the tx_thread_time_slic e value of the current thread and assign it to _tx_timer_time_slice (40004ca8, 40004cb4, 40004cbc)
set a new pointer to the stack stored in the thread structure (read tx_thread_stack_ptr ) (40004cb8)

But starting from 40004cb8, there is a code that actually restores the context of the new thread.

First, two values are read into registers r0 , r1 :

 40004cc0: e8bd0003 pop {r0, r1}

Next comes the comparison of r0 with zero:

 40004cc4: e3500000 cmp r0, #0

Obviously, these values, at least r0 , are part of the context (after all, the stack register is already tuned to the stack of the thread being restored), but it does not quite look like these are registers. A comparison with zero implies some kind of branching. Continuing the analysis, we see that if r0! = 0 , then the code is executed:

 40004cc8: 116ff001 msrne SPSR_fsxc, r1 40004ccc: 18fddfff ldmne sp!, {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc}^

In fact, this is similar to the restoration of the context. Moreover, the value of the register r1 is the saved value of the status register CPSR. If the 40004ccc line is executed, then the control will not go further: the pc ( r15 ) register will be restored and the program after this point will return to the place from which it was interrupted.

Great, now we can write this label:

   Offset Register
   -------- -------
   0 flag
   4 CPSR
   8 r0
   12 r1
   16 r2
   20 r3
   24 r4
   28 r5
   32 r6
   36 r7
   40 r8
   44 r9
   48 sl (r10)
   52 fp (r11)
   56 ip (r12)
   60 lr (r14)
   64 pc (r15)

Each register and each flag is 32 bits or 4 bytes, respectively, this context requires 17 * 4 = 68 bytes. It is logical that the stack goes further, as it was at the time of the interruption.

But, as we see, this is part of the job. We have this very flag. And if its value is 0, then the code is executed:

 40004cd0: e8bd4ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp, lr} 40004cd4: e12ff001 msr CPSR_fsxc, r1 40004cd8: e12fff1e bx lr

Apparently, this is also a context, only somewhat abbreviated. Moreover, the return from it occurs as from a normal function, and not by restoring the pc register. By rewriting the label above, we get:

   Offset Register
   -------- -------
   0 flag
   4 CPSR
   8 r4
   12 r5
   16 r6
   20 r7
   24 r8
   28 r9
   32 sl (r10)
   36 fp (r11)
   40 lr (r14)

This context requires only 11 * 4 = 44 bytes.

Using Google, viewing the disassembler listings, as well as studying the conventions for calling procedures, we come to understand that this type of context is used when cooperative multitasking works: i.e. when we called tx_thread_sleep () or others like them. And since Such a switch is, in essence, just a function call, then the context can be saved according to the calling conventions, according to which, we have the right between calls not to save the values of the registers r0-r3, r12 . Moreover, we do not need to save pc - all the necessary information is already contained in the rl - return address from tx_thread_sleep () . Benefit on the face. Cortexes are usually used on systems with a large amount of memory than ARM9E, they do not resort to such tricks and use one type of stacking.

According to information from the Internet, I’ve dug up that the first type of context is called interrupt, and is used when the thread is interrupted by interruption, that is, it can be interrupted anywhere, so you need to save all possible registers. The second type of context is called solicited and is used when a thread is interrupted by a system call, which leads to resetting.

That's actually all ready, to understand what modifications are needed in OpenOCD:

it is necessary to refine the registration mechanism of the target, so that there would be an opportunity to use several stacking options for one target;
actually make a description of the target.

The code for the first item, I will not, see the patch. For point two, I’ll explain a little how to make a label of the OpenOCD clear offsets.

First of all, we look at the output of the 'info registers' command, we look at how many registers and in what order it is output, we make up such a fish:

 static const struct stack_register_offset rtos_threadx_arm926ejs_stack_offsets_solicited[] = { { , 32 }, /* r0 */ { , 32 }, /* r1 */ { , 32 }, /* r2 */q { , 32 }, /* r3 */ { , 32 }, /* r4 */ { , 32 }, /* r5 */ { , 32 }, /* r6 */ { , 32 }, /* r7 */ { , 32 }, /* r8 */ { , 32 }, /* r9 */ { , 32 }, /* r10 */ { , 32 }, /* r11 */ { , 32 }, /* r12 */ { , 32 }, /* sp (r13) */ { , 32 }, /* lr (r14) */ { , 32 }, /* pc (r15) */ { , 32 }, /* xPSR */ };

Here is the 32 bits of the register. For ARM, it is always 32. The first column is filled with the help of the plates that we recorded above when analyzing the context recovery. We take into account the special values: -1 - this register is not saved, -2 is a stack register, it is restored from the stream structure.

The filled fish for the solicited context is:

 static const struct stack_register_offset rtos_threadx_arm926ejs_stack_offsets_solicited[] = { { -1, 32 }, /* r0 */ { -1, 32 }, /* r1 */ { -1, 32 }, /* r2 */ { -1, 32 }, /* r3 */ { 8, 32 }, /* r4 */ { 12, 32 }, /* r5 */ { 16, 32 }, /* r6 */ { 20, 32 }, /* r7 */ { 24, 32 }, /* r8 */ { 28, 32 }, /* r9 */ { 32, 32 }, /* r10 */ { 36, 32 }, /* r11 */ { -1, 32 }, /* r12 */ { -2, 32 }, /* sp (r13) */ { 40, 32 }, /* lr (r14) */ { -1, 32 }, /* pc (r15) */ { 4, 32 }, /* xPSR */ };

To interrupt the context, try to write it yourself or look at the source.

What will it give:

displaying threads list by “info threads”
for a thread individually: “thread apply all bt”
switching between threads: "thread 3"
switching between frames: "frame 5"
individual viewing of the status of registers of each stream

commands are given for gdb.

In general, happy debugging!

Resources:

Patch: ThreadX-arm926ejs.diff
Building under Win32 / 64, patched source codes, patch, auxiliary scripts: openocd-0.8.0-20150206-win.tar.xz
Discussion on the Cypress forum: www.cypress.com/?app=forum&id=167&rID=106353
OpenOCD mailing list: sourceforge.net/p/openocd/mailman/message/33287429

PS there is not enough hub "reverse engineering" and lighting for different assemblers ;-)

UPD / 2015-08-15 /: Changes hit the main OpenOCD branch: openocd.zylin.com/#/c/2848

Source: https://habr.com/ru/post/249991/

All Articles

OpenOCD, ThreadX and your processor

More articles: