Part 5/2 bldg. 1: Crossroads of RocketChip Avenue and slippery instrumentation track

In the previous four parts, preparations were made for experiments with the RISC-V RocketChip core, namely, porting this core to a “non-standard” board for it with Altera FPGAs (now Intel). Finally, in the last part , it turned out to run Linux on this board. Do you know what amused me about all this? That at the same time I had to work with the assembler RISC-V, C and Scala, and of all of them, Scala was the lowest level language (because the processor is written on it).

In this article, let's make C not be offensive either. Moreover, if the Scala + Chisel bundle was used only as a domain-specific language to explicitly describe the hardware, today we will learn how to "pull" simple C functions into the processor in the form of instructions.

The ultimate goal is the trivial implementation of trivial AFL-like instrumentations by analogy with QInst , and the implementation of separate instructions is only a by-product.

It is clear that there is (and not one) commercial OpenCL to RTL converter. I also came across information about a certain COPILOT project for RISC-V with similar goals (much more advanced), but something is googling badly, and besides, it is most likely also a commercial product. I’m primarily interested in OpenSource solutions, but even if they are, it’s still fun to try to implement one yourself - at least as a simplified case study, and then how it goes ...

Disclaimer (in addition to the usual warning about “dancing with a fire extinguisher”): I strongly do not recommend recklessly applying the resulting software core, especially with untrusted data - so far I have not so much confidence as even understanding why the data being processed cannot “Flow” in some boundary case between processes and / or the core. Well, about the fact that the data can "beat", I think, and so it is clear. In general, there is still validation and validation ...

For starters, what do I call a "simple function"? For the purposes of this article, this means a function in which all transitions (conditional and unconditional) only increase the instruction counter by a constant value. That is, the graph of all possible transitions is (directional) acyclic, without "dynamic" edges. The ultimate goal in the framework of this article is to be able to take a simple function from the program and, replacing it with an assembler plug-in, “sew” it into the processor at the synthesis stage, optionally making it a side effect of another instruction. Specifically, branching will not be shown in this article, but in the simplest case it will not be difficult to make them.

Learning to understand C (actually, no)

First you need to understand how we will parse C? That's right, no way - it was not in vain that I learned to parse ELF files : you just need to compile our code in C / Rust / something else into an eBPF bytecode, and parse it already. Some difficulties are caused by the fact that in Scala you can’t just connect elf.h and read structure fields. You could, of course, try using JNAerator - if necessary, they can do binders to the library — not only structures, but also generate code to work through JNA (not to be confused with JNI). As a real programmer, I’ll write my bike and carefully write the enumeration and offset constants from the header file. The result and intermediate structures are described by the following structure of case classes:

 sealed trait SectionKind case object RegularSection extends SectionKind case object SymtabSection extends SectionKind case object StrtabSection extends SectionKind case object RelSection extends SectionKind final case class Elf64Header( sectionHeaders: Seq[ByteBuffer], sectionStringTableIndex: Int ) final case class Elf64Section( data: ByteBuffer, linkIndex: Int, infoIndex: Int, kind: SectionKind ) final case class Symbol( name: String, value: Int, size: Int, shndx: Int, isInstrumenter: Boolean ) final case class Relocation( relocatedSection: Int, offset: Int, symbol: Symbol ) final case class BpfInsn( opcode: Int, dst: Int, src: Int, offset: Int, imm: Either[Long, Symbol] ) final case class BpfProg( name: String, insns: Seq[BpfInsn] )

I will not particularly describe the parsing process - this is just a dull byte transfer from java.nio.ByteBuffer - all the interesting things have already been described in the article on parsing ELF files . I will only say that you need to carefully handle opcode == 0x18 (loading 64-bit immediate values into the register), since it takes up two 8-byte words at once (maybe there are other such opcodes, but I haven’t come across them yet) , and this is not always loading the memory address associated with relocation, as I thought initially. For example, __builtin_popcountl honestly uses the 64-bit constant 0x0101010101010101 . Why am I not doing an “honest” relocation with patching the downloaded file - because I want to see the characters in a symbolic form (sorry for the pun), so that later the characters from the COMMON section could be replaced with registers without using crutches with special handling of addresses of a special kind (and it means, even with dances with constant / non-constant UInt ).

We build hardware according to a set of instructions

So, by assumption, all possible execution paths go exclusively down the list of instructions, which means that the data flows along an oriented acyclic graph, and all its edges are defined statically. At the same time, we have purely combinational logic (that is, without registers on the way), obtained from operations on registers, as well as delays during load / store operations with memory. Thus, in the general case, the operation may not be possible to complete in one clock cycle. We will do simple: we will transfer the value to in the form of UInt , but like (UInt, Bool) : the first element of the pair is the value, and the second is a sign of its correctness. That is, it does not make much sense to read from memory, as long as the address is incorrect, and writing in general is impossible.

The eBPF bytecode execution model assumes some kind of RAM with 64-bit addressing, as well as a set of 16 (or even ten) 64-bit registers. A primitive recursive algorithm is proposed:

we start with a context in which operands of instructions lie in r1 and r2 , in the rest - zeros, all are valid (more precisely, the validity is equal to the “readiness” of the coprocessor instruction)
if we see an arithmetic-logical instruction, we extract its operands registers from the context, call ourselves for the tail of the list and the context in which the output operand is replaced by a pair (data1 op data2, valid1 && valid2)
if we encounter a branch, we simply build both branches recursively: if the branch occurs, and if not
if we encounter loading or saving into memory, somehow we get out: we execute the transferred callback, assuming the invariant that once the set valid cannot be recalled during the execution of this instruction. The validity of the save operation is AND by us with the globalValid flag, which must be set before returning control. At the same time, we must do reading and writing along the front in order to correctly process increments and other modifications.

Thus, the operations will be performed as parallel as possible, and not in steps. At the same time, I ask you to pay attention that all operations on a specific byte of memory should naturally be completely ordered, otherwise the result is unpredictable, UB. Those. *addr += 1 - this is normal, writing will not start exactly until the reading is completed (corny because we still do not know what to write), but *addr += 1; return *addr; *addr += 1; return *addr; I generally safely gave out zero or something like that. Maybe it would be worth debugging (maybe it hides some more tricky problem), but the appeal itself is in any case a so-so idea, since you have to keep track of which memory addresses the work has already been done, but I have a desire Validate values valid possible statically. This is exactly what will be done for fixed-sized global variables.

The result is an abstract class BpfCircuitConstructor , which has no implemented methods doMemLoad , doMemStore and resolveSymbol :

 trait BpfCircuitConstructor { // ... sealed abstract class LdStType(val lgsize: Int) { val byteSize = 1 << lgsize val bitSize = byteSize * 8 val mask: UInt = if (bitSize == 64) mask64 else ((1l << bitSize) - 1).U } case object u8 extends LdStType(0) case object u16 extends LdStType(1) case object u32 extends LdStType(2) case object u64 extends LdStType(3) def doMemLoad(addr: UInt, tpe: LdStType, valid: Bool): (UInt, Bool) def doMemStore(addr: UInt, tpe: LdStType, data: UInt, valid: Bool): Bool sealed trait Resolved { def asPlainValue: UInt def load(ctx: Context, offset: Int, tpe: LdStType, valid: Bool): LazyData def store(offset: Int, tpe: LdStType, data: UInt, valid: Bool): Bool } def resolveSymbol(sym: BpfLoader.Symbol): Resolved // ... }

CPU core integration

For starters, I decided to go the simple way: connect to the processor core using the standard RoCC (Rocket Custom Coprocessor) protocol. As far as I understand, this is a regular extension not for all RISC-V-compatible kernels, but only for Rocket and BOOM (Berkeley Out-of-Order Machine), therefore, when dragging the upstream work on compilers, the assembler custom0 mnemonics were thrown out of them - custom3 responsible for accelerator commands.

In general, each Rocket / BOOM processor core can have up to four RoCC accelerators added via the config, there are also implementation examples:

Configs.scala:

 class WithRoccExample extends Config((site, here, up) => { case BuildRoCC => List( (p: Parameters) => { val accumulator = LazyModule(new AccumulatorExample(OpcodeSet.custom0, n = 4)(p)) accumulator }, (p: Parameters) => { val translator = LazyModule(new TranslatorExample(OpcodeSet.custom1)(p)) translator }, (p: Parameters) => { val counter = LazyModule(new CharacterCountExample(OpcodeSet.custom2)(p)) counter }) })

The corresponding implementation is in the LazyRoCC.scala file.

The accelerator implementation represents two classes already familiar from the memory controller: one of them in this case is inherited from LazyRoCC , the other from LazyRoCCModuleImp . The second class has an io port of type RoCCIO , which contains the cmd request port, resp response port, mem access port to the L1D cache, busy and interrupt outputs, and an exception input. There is also a page table walker port and FPUs that we don’t seem to need yet (anyway, there is no real arithmetic in eBPF). So far I want to try to do something with this approach, so I won’t touch interrupt . Also, as I understand it, there is a TileLink interface for non-cached memory access, but for now I will not touch it either.

Query organizer

So, we have a port for accessing the cache, but only one. At the same time, a function can, for example, increment a variable (which, at least, can be turned into a single atomic operation) or, in general, somehow transform it nontrivially by loading, updating and saving. In the end, a single instruction can make several unrelated requests. Maybe this is not the best idea in terms of performance, but, on the other hand, why not, say, load three words (which, quite possibly, are already in the cache), somehow process them in parallel with combinational logic (then eat in one step) and save the result. Therefore, we need some kind of circuitry that effectively "resolves" attempts to access parallel to a single cache port.

The logic will be something like the following: at the beginning of the generation of the implementation of a specific subsync (a 7-bit field funct in terms of RoCC), an instance of the query serializer is created (making one global seems pretty harmful to me, because it creates a bunch of extra dependencies between requests that can never be executed at the same time, and squander Fmax, most likely, will be). Next, each created “saver” / “loader” is registered in the serializer. In a live queue, so to speak. At each measure, the first request in the registration order is selected - he is given permission at the next measure . Naturally, such logic needs to be well covered with tests (I really don't have a lot of them yet, so this is not just verification, but the minimum necessary set to get at least something intelligible). I used the standard PeekPokeTester from a more or less official component to test Chisel designs. I already described it once .

The result is such a contraption:

 class Serializer(isComputing: Bool, next: Bool) { def monotonic(x: Bool): Bool = { val res = WireInit(false.B) val prevRes = RegInit(false.B) prevRes := res && isComputing res := (x || prevRes) && isComputing res } private def noone(bs: Seq[Bool]): Bool = !bs.foldLeft(false.B)(_ || _) private val previousReqs = ArrayBuffer[Bool]() def nextReq(x: Bool): (Bool, Int) = { val enable = monotonic(x) val result = RegInit(false.B) val retired = RegInit(false.B) val doRetire = result && next val thisReq = enable && !retired && !doRetire val reqWon = thisReq && noone(previousReqs) when (isComputing) { when(reqWon) { result := true.B } when(doRetire) { result := false.B retired := true.B } } otherwise { result := false.B retired := false.B } previousReqs += thisReq (result, previousReqs.length - 1) } }

Please note that here in the process of creating a digital circuit, the Scala code is safely executed. If you take a closer look, you can even notice an ArrayBuffer into which the pieces of the circuit are stacked ( Boolean is a type from Scala, Bool is a Chisel type representing live equipment, and not some boolean known at runtime).

Working with L1D Cache

Work with the cache mostly occurs through the io.mem.req request io.mem.req and the io.mem.resp response io.mem.resp . At the same time, the request port is equipped with the traditional ready and valid signals: the first one tells you that it is ready to accept the request, the second we say that the request is ready and already has the correct structure, along the front, valid && resp request is considered accepted. In some such interfaces, there is a requirement of “non-repudiation” of signals from the moment of setting to true and to the subsequent positive edge of valid && resp (this expression can be constructed using the fire() method for convenience).

The resp response port, in turn, has only the valid sign, and this is the processor’s problem to rake answers in one clock cycle: it is “always ready” by assumption, and fire() returns just valid .

Also, as I already said, you can’t make requests when it’s horrible: you can’t write something, I don’t know what, and read again what will be overwritten later on the basis of the subtracted value is also somehow strange. But the Serializer class is already sorting this out, but we only give it a sign that the current request has already gone into cache: next = io.mem.req.fire() . All that can be done is to ensure that in the "reader" the answer is updated only when it really arrived - not earlier and not later. There is a convenient holdUnless method for holdUnless . The result is approximately the following implementation:

  class Constructor extends BpfCircuitConstructor { val serializer = new Serializer(isComputing, io.mem.req.fire()) override def doMemLoad(addr: UInt, tpe: LdStType, valid: Bool): (UInt, Bool) = { val (doReq, thisTag) = serializer.nextReq(valid) when (doReq) { io.mem.req.bits.addr := addr require((1 << io.mem.req.bits.tag.getWidth) > thisTag) io.mem.req.bits.tag := thisTag.U io.mem.req.bits.cmd := M_XRD io.mem.req.bits.typ := (4 | tpe.lgsize).U io.mem.req.bits.data := 0.U io.mem.req.valid := true.B } val doResp = isComputing && serializer.monotonic(doReq && io.mem.req.fire()) && io.mem.resp.valid && io.mem.resp.bits.tag === thisTag.U && io.mem.resp.bits.cmd === M_XRD (io.mem.resp.bits.data holdUnless doResp, serializer.monotonic(doResp)) } override def doMemStore(addr: UInt, tpe: LdStType, data: UInt, valid: Bool): Bool = { val (doReq, thisTag) = serializer.nextReq(valid) when (doReq) { io.mem.req.bits.addr := addr require((1 << io.mem.req.bits.tag.getWidth) > thisTag) io.mem.req.bits.tag := thisTag.U io.mem.req.bits.cmd := M_XWR io.mem.req.bits.typ := (4 | tpe.lgsize).U io.mem.req.bits.data := data io.mem.req.valid := true.B } serializer.monotonic(doReq && io.mem.req.fire()) } override def resolveSymbol(sym: BpfLoader.Symbol): Resolved = sym match { case BpfLoader.Symbol(symName, _, size, ElfConstants.Elf64_Shdr.SHN_COMMON, false) if size <= 8 => RegisterReference(regs.getOrElseUpdate(symName, RegInit(0.U(64.W)))) } }

An instance of this class is created for each generated subinstruction.

Not all on the heap is a global variable

Hmm, what is a model example? What performance would I like to ensure? Of course, AFL instrumentation! It looks in the classic version like this:

 #include <stdint.h> extern uint8_t *__afl_area_ptr; extern uint64_t prev; void inst_branch(uint64_t tag) { __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1; prev = tag; }

As you can see, it has more or less logical loading and saving (and between them the increment) of one byte from __afl_area_ptr , but here the register is asking for the prev role!

This is why the Resolved interface is needed: it can either wrap a regular memory address or be a register reference. At the same time, so far I only consider scalar registers of 1, 2, 4 or 8 bytes in size, which are always read at zero offset, therefore, for registers, you can relatively calmly implement ordering of calls. In this case, it is very useful to know that prev must first be subtracted and used to calculate the index, and only then rewritten.

And now the instrumentation

At some point, we got a separate and more or less working accelerator with the RoCC interface. What now? Re-implement all the same, pushing through the processor pipeline? It seemed to me that fewer crutches would be needed if, in parallel with the instrumented instruction, the coprocessor with the automatically issued utility value funct would simply be activated. In principle, I also had to torment myself for this: I even learned to use SignalTap, because debugging is almost blind, and even with a five-minute recompilation after the slightest change (except for changing bootrom - everything is fast there) - this is already too much.

As a result, the command decoder was corrected and the pipeline was slightly “straightened up” to take into account the fact that no matter what the decoder said about the original instruction, the suddenly activated RoCC itself does not mean that there will be a long latency write to the output register, as during the division operation and miss data cache.

In general, a description of an instruction is a pair ([pattern for recognizing an instruction], [set of values configuring data path blocks of the processor core]). For example, default (unrecognized instruction) looks like this (taken as it is from IDecode.scala , in the desktop Habr it looks, frankly, ugly):

 def default: List[BitPat] = // jal renf1 fence.i // val | jalr | renf2 | // | fp_val| | renx2 | | renf3 | // | | rocc| | | renx1 s_alu1 mem_val | | | wfd | // | | | br| | | | s_alu2 | imm dw alu | mem_cmd mem_type| | | | mul | // | | | | | | | | | | | | | | | | | | | | | div | fence // | | | | | | | | | | | | | | | | | | | | | | wxd | | amo // | | | | | | | | scie | | | | | | | | | | | | | | | | | dp List(N,X,X,X,X,X,X,X,X,A2_X, A1_X, IMM_X, DW_X, FN_X, N,M_X, MT_X, X,X,X,X,X,X,X,CSR.X,X,X,X,X)

... and a typical description of one of the extensions in Rocket core is implemented like this:

 class IDecode(implicit val p: Parameters) extends DecodeConstants { val table: Array[(BitPat, List[BitPat])] = Array( BNE-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SNE, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), BEQ-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SEQ, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), BLT-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SLT, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), BLTU-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SLTU, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), BGE-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SGE, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), BGEU-> List(Y,N,N,Y,N,N,Y,Y,N,A2_RS2, A1_RS1, IMM_SB,DW_X, FN_SGEU, N,M_X, MT_X, N,N,N,N,N,N,N,CSR.N,N,N,N,N), // ...

The fact is that in RISC-V (not only in RocketChip, but in the command architecture in principle) ISA splitting into mandatory subset I (integer operations), as well as optional M (integer multiplication and division), A (atomics) is regularly supported etc.

As a result, the original method

  def decode(inst: UInt, table: Iterable[(BitPat, List[BitPat])]) = { val decoder = DecodeLogic(inst, default, table) val sigs = Seq(legal, fp, rocc, branch, jal, jalr, rxs2, rxs1, scie, sel_alu2, sel_alu1, sel_imm, alu_dw, alu_fn, mem, mem_cmd, mem_type, rfs1, rfs2, rfs3, wfd, mul, div, wxd, csr, fence_i, fence, amo, dp) sigs zip decoder map {case(s,d) => s := d} this }

has been replaced by

the same, but with a decoder for instrumentation and clarification of the reason for the activation of rocc

 def decode(inst: UInt, table: Iterable[(BitPat, List[BitPat])], handlers: Seq[OpcodeHandler]) = { val decoder = DecodeLogic(inst, default, table) val sigs=Seq(legal, fp, rocc_explicit, branch, jal, jalr, rxs2, rxs1, scie, sel_alu2, sel_alu1, sel_imm, alu_dw, alu_fn, mem, mem_cmd, mem_type, rfs1, rfs2, rfs3, wfd, mul, div, wxd, csr, fence_i, fence, amo, dp) sigs zip decoder map {case(s,d) => s := d} if (handlers.isEmpty) { handler_rocc := false.B handler_rocc_funct := 0.U } else { val handlerTable: Seq[(BitPat, List[BitPat])] = handlers.map { case OpcodeHandler(pattern, funct) => pattern -> List(Y, BitPat(funct.U)) } val handlerDecoder = DecodeLogic(inst, List(N, BitPat(0.U)), handlerTable) Seq(handler_rocc, handler_rocc_funct) zip handlerDecoder map { case (s,d) => s:=d } } rocc := rocc_explicit || handler_rocc this }

Of the changes in the processor pipeline, the most unobvious, perhaps, was this:

  io.rocc.exception := wb_xcpt && csr.io.status.xs.orR io.rocc.cmd.bits.status := csr.io.status io.rocc.cmd.bits.inst := new RoCCInstruction().fromBits(wb_reg_inst) + when (wb_ctrl.handler_rocc) { + io.rocc.cmd.bits.inst.opcode := 0x0b.U // custom0 + io.rocc.cmd.bits.inst.funct := wb_ctrl.handler_rocc_funct + io.rocc.cmd.bits.inst.xd := false.B + io.rocc.cmd.bits.inst.rd := 0.U + } io.rocc.cmd.bits.rs1 := wb_reg_wdata io.rocc.cmd.bits.rs2 := wb_reg_rs2

It is clear that some parameters of the request to the accelerator need to be corrected: no response is written to the register, and funct is equal to what the decoder returned. But there is a slightly less obvious change: the fact is that this command does not go directly to the accelerator (four of them - which one?), But to the router, so you need to pretend that the command has opcode == custom0 (yes, process, and it is precisely the zero accelerator!).

Check

In fact, this article assumes a continuation in which an attempt will be made to bring this approach to a more or less production level. At a minimum, you need to learn how to save and restore the context (state of the coprocessor registers) when switching tasks. In the meantime, I’ll check that it somehow works in greenhouse conditions:

 #include <stdint.h> uint64_t counter; uint64_t funct1(uint64_t x, uint64_t y) { return __builtin_popcountl(x); } uint64_t funct2(uint64_t x, uint64_t y) { return (x + y) * (x - y); } uint64_t instMUL() { counter += 1; *((uint64_t *)0x81005000) = counter; return 0; }

Now add to the bootrom/sdboot/sd.c in the main line

 #include "/path/to/freedom-u-sdk/riscv-pk/machine/encoding.h" // ... ////    -   RoCC #define STR1(x) #x #define STR(x) STR1(x) #define EXTRACT(a, size, offset) (((~(~0 << size) << offset) & a) >> offset) #define CUSTOMX_OPCODE(x) CUSTOM_##x #define CUSTOM_0 0b0001011 #define CUSTOM_1 0b0101011 #define CUSTOM_2 0b1011011 #define CUSTOM_3 0b1111011 #define CUSTOMX(X, rd, rs1, rs2, funct) \ CUSTOMX_OPCODE(X) | \ (rd << (7)) | \ (0x7 << (7+5)) | \ (rs1 << (7+5+3)) | \ (rs2 << (7+5+3+5)) | \ (EXTRACT(funct, 7, 0) << (7+5+3+5+5)) #define CUSTOMX_R_R_R(X, rd, rs1, rs2, funct) \ asm ("mv a4, %[_rs1]\n\t" \ "mv a5, %[_rs2]\n\t" \ ".word "STR(CUSTOMX(X, 15, 14, 15, funct))"\n\t" \ "mv %[_rd], a5" \ : [_rd] "=r" (rd) \ : [_rs1] "r" (rs1), [_rs2] "r" (rs2) \ : "a4", "a5"); int main(void) { // ... //  RoCC extension write_csr(mstatus, MSTATUS_XS & (MSTATUS_XS >> 1)); //   bootrom       uint64_t res; CUSTOMX_R_R_R(0, res, 0xabcdef, 0x123456, 1); CUSTOMX_R_R_R(0, res, 0xabcdef, 0x123456, 2); // ...     uint64_t x = 1; for (int i = 0; i < 123; ++i) x *= *(volatile uint8_t *)0x80000000; kputc('0' + x % 10); //   !!! // ... }

write_csr , custom0 - custom3 . , illegal instruction, , , , . define - - , «» binutils customX RocketChip, , , .

sdboot , , .

 $ /hdd/trosinenko/rocket-tools/bin/riscv32-unknown-elf-gdb -q -ex "target remote :3333" -ex "set directories bootrom" builds/zeowaa-e115/sdboot.elf Reading symbols from builds/zeowaa-e115/sdboot.elf...done. Remote debugging using :3333 0x0000000000000000 in ?? () (gdb) x/d 0x81005000 0x81005000: 123 (gdb) set variable $pc=0x10000 (gdb) c Continuing. ^C Program received signal SIGINT, Interrupt. 0x0000000000010488 in crc16_round (data=<optimized out>, crc=<optimized out>) at sd.c:151 151 crc ^= data; (gdb) x/d 0x81005000 0x81005000: 246

funct1

 $ /hdd/trosinenko/rocket-tools/bin/riscv32-unknown-elf-gdb -q -ex "target remote :3333" -ex "set directories bootrom" builds/zeowaa-e115/sdboot.elf Reading symbols from builds/zeowaa-e115/sdboot.elf...done. Remote debugging using :3333 0x0000000000010194 in main () at sd.c:247 247 CUSTOMX_R_R_R(0, res, 0xabcdef, 0x123456, 1); (gdb) set variable $a5=0 (gdb) set variable $pc=0x10194 (gdb) set variable $a4=0xaa (gdb) display/10i $pc-10 1: x/10i $pc-10 0x1018a <main+46>: sw a3,124(a3) 0x1018c <main+48>: addiw a0,a0,1110 0x10190 <main+52>: mv a4,s0 0x10192 <main+54>: mv a5,a0 => 0x10194 <main+56>: 0x2f7778b 0x10198 <main+60>: mv s0,a5 0x1019a <main+62>: lbu a5,0(a1) 0x1019e <main+66>: addiw a3,a3,-1 0x101a0 <main+68>: mul a2,a2,a5 0x101a4 <main+72>: bnez a3,0x1019a <main+62> (gdb) display/x $a5 2: /x $a5 = 0x0 (gdb) si 0x0000000000010198 247 CUSTOMX_R_R_R(0, res, 0xabcdef, 0x123456, 1); 1: x/10i $pc-10 0x1018e <main+50>: li a0,25 0x10190 <main+52>: mv a4,s0 0x10192 <main+54>: mv a5,a0 0x10194 <main+56>: 0x2f7778b => 0x10198 <main+60>: mv s0,a5 0x1019a <main+62>: lbu a5,0(a1) 0x1019e <main+66>: addiw a3,a3,-1 0x101a0 <main+68>: mul a2,a2,a5 0x101a4 <main+72>: bnez a3,0x1019a <main+62> 0x101a6 <main+74>: li a5,10 2: /x $a5 = 0x4 (gdb) set variable $a4=0xaabc (gdb) set variable $pc=0x10194 (gdb) si 0x0000000000010198 247 CUSTOMX_R_R_R(0, res, 0xabcdef, 0x123456, 1); 1: x/10i $pc-10 0x1018e <main+50>: li a0,25 0x10190 <main+52>: mv a4,s0 0x10192 <main+54>: mv a5,a0 0x10194 <main+56>: 0x2f7778b => 0x10198 <main+60>: mv s0,a5 0x1019a <main+62>: lbu a5,0(a1) 0x1019e <main+66>: addiw a3,a3,-1 0x101a0 <main+68>: mul a2,a2,a5 0x101a4 <main+72>: bnez a3,0x1019a <main+62> 0x101a6 <main+74>: li a5,10 2: /x $a5 = 0x9

Source

Source: https://habr.com/ru/post/461577/

All Articles