I originally wrote this document a few years ago, as an execution core verification engineer in ARM. Of course, my opinion was influenced by in-depth work with the executive cores of different processors. So do it for a discount, please: maybe I'm too categorical.
However, I still believe that the creators of RISC-V could do much better. On the other hand, if I had designed a 32-bit or 64-bit processor today, I would probably have implemented just such an architecture to take advantage of the existing tools.
The article originally described the RISC-V 2.0 instruction set. For version 2.2, it made some updates.
Original Foreword: Some Personal Opinion
The RISC-V instruction set has been reduced to an absolute minimum. Much attention is paid to minimizing the number of instructions, normalizing coding, etc. This desire for minimalism has led to false orthogonality (such as reusing the same instruction for transitions, calls, and returns) and mandatory verbosity, which inflates both size and quantity instructions.
')
For example, here is the C code:
int readidx(int *p, size_t idx) { return p[idx]; }
This is a simple case of indexing an array, a very common operation. This is the compilation for x86_64:
mov eax, [rdi+rsi*4] ret
or ARM:
ldr r0, [r0, r1, lsl #2] bx lr // return
However, for RISC-V, the following code is required:
slli a1, a1, 2 add a0, a1, a1 lw a0, a0, 0 jalr r0, r1, 0 // return
Simplification RISC-V simplifies the decoder (i.e., the CPU front end) by executing more instructions. But scaling the width of the pipeline is a complex problem, while decoding slightly (or strongly) irregular instructions is well implemented (the main difficulty arises when it is difficult to determine the length of the instruction: this is especially evident in the x86 instruction set with numerous prefixes).
The simplification of the set of instructions should not be brought to the limit. Register and register addition with a shift of the register memory is a simple and very common instruction in programs, and it is very easy for the processor to effectively implement it. If the processor is not able to implement the instruction directly, then it can be relatively easy to break it down into its components; this is a much simpler problem than merging sequences of simple operations.
We must distinguish between “complex” specific instructions of CISC processors - complicated, rarely used and inefficient instructions - from “functional” instructions common to CISC and RISC processors, which combine a small sequence of operations. The latter are used frequently and with high performance.
Mediocre implementation
- Almost unlimited extensibility. Although this is the goal of RISC-V, it creates a fragmented, incompatible ecosystem that must be managed with extreme caution.
- The same instruction (
JALR
) is used for calls, and for returns, and for register-indirect branches, where additional decoding is required for branch prediction
- Call:
Rd
= R1
- Return:
Rd
= R0
, Rs
= R1
- Indirect transition:
Rd
= R0
, Rs
≠R1
- (Strange transition:
Rd
≠R0
, Rd
≠R1
)
- Encoding with a variable length of the recording field is not self-synchronizing (this is often found - for example, a similar problem with x86 and Thumb-2 - but this causes various problems with both implementation and security, for example, reverse-oriented programming, i.e. ROP attacks )
- RV64I requires a character extension for all 32-bit values. This leads to the fact that the upper half of 64-bit registers becomes impossible to use for storing intermediate results, which leads to unnecessary special placement of the upper half of the registers. It is more optimal to use the extension with zeros (since it reduces the number of switches and can usually be optimized by tracking the “zero” bit, when the upper half is known to be zero)
- Multiplication is optional. Although fast multiplication blocks can occupy a fairly substantial area on tiny crystals, you can always use slightly slower circuits that actively use the existing ALU for multiple multiplication cycles.
LR
/ SC
strict progression requirements for a limited subset of applications. Although this restriction is rather strict, it potentially creates some problems for small implementations (especially without a cache)
- This seems like a replacement for CAS instruction, see comment below
- Memory sticky bits FP and rounding mode are in the same register. This requires serialization of the FP channel if the RMW operation is performed to change the rounding mode.
FP
instructions are encoded for 32, 64, and 128-bit precision, but not 16-bit (which is much more common in hardware than 128 bits)
- It can be easily fixed:
2'b10
encoding 2'b10
free
- Update: Decimal placeholder appeared in version 2.2, but there is no half precision placeholder. The mind is incomprehensible.
- The way FP values ​​are represented in the FP register file is not defined, but observable (via load / store)
- Emulator authors will hate you
- Migration of virtual machines may become impossible
- Update: version 2.2 requires wider NaN-boxing values
poorly
- There are no condition codes, and instead, compare-and-branch statements are used. This is not a problem in itself, but the consequences are unpleasant:
- Reduced coding space in conditional branches due to the need to encode one or two register specifiers
- No conditional selection (useful for very unpredictable transitions)
- No carry-over addition / subtraction with carry-over or borrowing
- (Note that this is still better than sets of commands that write flags to the general register, and then switch to the received flags)
- High-precision counters only seem to be required at the ISA user level. In practice, providing them with applications is an excellent vector for attacks on third-party channels
- Multiplication and division are part of the same expansion, and it seems that if one is implemented, then the other should also be. Multiplication is much simpler than division, and is common on most processors, but division is not.
- There are no atomic instructions in the basic instruction set architecture. Multi-core microcontrollers are becoming more common, so atomic instructions like LL / SC are inexpensive (for minimal implementation within a single [multi-core] processor, only 1 bit of processor status is needed)
LR
/ SC
are in the same extension as more complex atomic instructions, which limits flexibility for small implementations
- General atomic instructions (not
LR
/ SC
) do not include CAS
primitive
- The
CmpHi:CmpLo
to avoid the need for an instruction that reads five registers ( Addr
, CmpHi:CmpLo
, SwapHi:SwapLo
), but this will likely impose less implementation overhead than the guaranteed forward LR
/ SC
, which is provided as replacements
- Atomic instructions are offered that work on 32-bit and 64-bit values, but not 8-bit or 16-bit ones
- For RV32I, there is no way to transfer the DP FP value between an integer and an FP register file, except through memory, that is, from 32-bit integer registers it is impossible to make a 64-bit double-precision floating-point number, you must first write the intermediate value to memory and load him into the register file from there
- For example, a 32-bit instruction
ADD
in RV32I and 64-bit ADD
in RVI64 same encoding and RVI64 added more and other coding ADD.W
. This is an unnecessary complication for a processor that implements both instructions - it would be preferable to add a new 64-bit encoding instead.
- No
MOV
instructions. The mnemonic code of the MV
command is translated by the assembler into the instruction MV rD, rS
-> ADDI rD, rS, 0
. High-performance processors typically optimize MOV
instructions MOV
, making extensive use of reordering instructions. An instruction with a direct 12-bit operand was chosen as the canonical form of the MV
instruction in RISC-V.
- In the absence of
MOV
the ADD rD, rS, r0
instruction ADD rD, rS, r0
actually becomes preferable to the canonical MOV
, since it is easier to decode, and operations with zero register (r0) in the CPU are usually optimized
Awful
JAL
spends 5 bits on encoding the communication register, which is always equal to R1
(or R0
for transitions)
- This means that the RV32I uses 21-bit branch displacement. This is not enough for large applications - for example, web browsers - without using multiple sequences of commands and / or “branch islands”
- This is a deterioration compared to version 1.0 of the command architecture!
- Despite the great effort to uniformly encode, load / store instructions are encoded differently (case and immediate fields change)
- Apparently, the orthogonality of the encoding of the output register was preferable to the orthogonality of the encoding of two strongly related instructions. This choice seems a bit odd given that address generation is more time critical
- There are no memory loading commands with register offsets (
Rbase
+ Roffset
) or indexes ( Rbase
+ Rindex
<< Scale
).
FENCE.I
implies a complete synchronization of the instruction cache with all previous repositories, with or without fenced. Implementations need to either clear all I $ on the fence, or look for D $ and the storage buffer
- In RV32I, reading 64-bit counters requires reading the upper half twice, comparing and branching in the case of transferring between the lower and upper half during a read operation
- Typically, 32-bit ISAs include a read special pair register instruction to avoid this problem.
- There is no architecturally defined space for hint coding, so that instructions from this space do not cause an error on older processors (processed as
NOP
), but do something on the most modern CPUs
- Typical examples of pure NOP hints are things like spinlock yield
- Newer processors also have more sophisticated hints (with visible side effects on the newer processors; for example, x86 border check instructions are encoded in hint space so that the binaries remain backward compatible)