Compact RSA implementation for embedded applications

RSA is a well-known public key encryption algorithm. On its basis, in addition to asymmetric encryption, you can also implement an electronic signature ( EDS ). These features are attractive for embedded systems, microcontrollers. The encryption method itself is extremely simple in appearance:

C = (M ^e ) mod n

(one)

where C, M, e, n are integers, M is a plaintext, numbers e and n are a public key, C is a ciphertext. mod is the remainder of the division.

The extension looks just as simple:

M = (C ^d ) mod n

(2)

where C, M, n play the same role as for encryption, d is the private key.

In this case, n = p * q, where p and q are prime numbers (secret), e is usually 65537, d is calculated on the basis of e, p and q. Cryptographic strength is based on the fact that, for sufficiently large p and q, the task of decomposing n into factors or reversing an encryption formula without knowing p and q is not solved in a reasonable time.
')
But this seeming simplicity is deceptive. Behind it lies a huge amount of detail and complexity of implementation. Especially if the goal is to get an effective in speed and memory implementation, suitable for use in microcontrollers. I did not find the appropriate libraries on the Internet, and attempts to study the source code of libgcrypt lead to such jungle from which you can’t get out. Therefore, I wrote my compact library, which I share with respected readers.

1. Long arithmetic (Multi-precision integer arithmetics)

The first difficulties begin when trying to test the performance of RSA for small numbers. For example, if we take p = 7, q = 11, then we get n = 77. The chosen e must be mutually simple with (p-1) * (q-1), therefore 2, 3, 4 and 5 do not fit, the minimum is 7. Leaving aside the calculation method d, I’ll just give the result: d = 43. And now, although encrypting a small message is not a problem, for decryption, you need to raise C to a power of 43, and this leads to overflow even of 64-bit integers already at C = 3.

However, even without this example, it was clear that one could not do without the use of long arithmetic. After all, to ensure cryptographic resistance, p and q must be large. In modern practice of using RSA, these numbers are of the order of 2 ¹⁰²⁴ each, and the order of n is 2 ²⁰⁴⁸ when using a key length of 2048 bits. It is possible to reduce the order of numbers by 2 times (512 bits for p and q, 1024 for n), but in recent publications one can often find the opinion that 1024-bit RSA-keys no longer provide the proper level of cryptographic strength.

Long arithmetic is a loose concept. There is a desire to use the C ++ class to represent long numbers and implement the addition, subtraction, multiplication, etc. operators. My practice shows that this approach is suitable only for initial acquaintance with RSA. For implementation in microcontrollers, we need to implement not the entire set of operations, but only the most necessary ones. And implement effectively. Not in C ++, but in C or even assembly language.

Historically, at first I implemented the necessary algorithms of long-term arithmetic, in C ++, then translated them into dsPIC33F assembler and optimized, and then translated back to C. In this connection, the C code was much more efficient than the original C ++ and only slightly inferior to assembler.

1.1. Exponentiation

Obviously, first of all we need an exponentiation algorithm. The algorithm studied in school [1], which reduces the operation to a series of multiplications and has complexity O (log e), where e is the exponent, is suitable for this.

Even when using algorithms of long arithmetic, raising the 2048-bit number M to the power e = 65537 results in a bit length of 2048 * 65537 bits, which will require more than 16 megabytes to be stored in memory. The execution time of the multiplication for such numbers completely exceeds all imaginable limits. Fortunately, we do not need the result M ^{e itself} , but only the remainder of its division by n. Since exponentiation is reduced to a series of multiplications, it is necessary after each multiplication to reduce the length of the intermediate result, finding the remainder from dividing it by n. There are theorems that guarantee a correct result. Thus, the algorithm of exponentiation and the subsequent calculation of the remainder of dividing the result by n is transformed into a unified exponentiation algorithm modulo , which calculates the results of both operations simultaneously and thereby fully realizes RSA encryption using formula (1).

To implement this algorithm, we will need long multiplication operations and finding the remainder of the division. Private does not interest us. At the same time, the operand width for multiplication will be 2048 bits each; 4096-bit numbers should be divided into 2048-bit, the remainder of the division will be 2048 bits long.

Despite the good performance, the above algorithm is only suitable for encryption, where the “good” exponent e = 65537 is used. Its value is large enough for cryptographic strength; at the same time, it is quite small compared with the key width; it is a prime number, and it also has an advantageous binary representation that increases the efficiency of exponentiation. At the same time, the number d cannot be chosen arbitrarily, it depends on p, q and e, and in order of magnitude it is comparable to n, 2048 bits. If you can raise to the power of 65537 in 17 long multiplications and divisions, then to raise to the power d will take an average of about 3,072 such operations, which will reduce the decryption speed compared to encryption by more than 180 times. However, in the case of decryption, you can significantly increase the speed by using the knowledge of p and q (for the encrypting side, these numbers are usually unknown). As a result, the decryption algorithm is significantly different from the encryption with all the mathematical similarity of these operations.

My embedded implementation uses only encryption (or an electronic signature verification operation), so raising to the power of e modulo n can be done in the form of a specialized procedure rigidly bound to the value of e = 65537. In the source code text, this procedure is called mpi_powm65537 , it takes as input the numbers M and n. In her work, she uses the procedures of long multiplication and finding the remainder of the long division.

1.2. Long multiplication

There are a number of algorithms for long multiplication. The most widely known is the multiplication "in a column." This method reduces the multiplication of long numbers to a series of multiplications and additions of individual numbers. If there is a hardware multiplier in the processor, long multiplication can be implemented in the base number system of 256, 65536 and more, depending on the multiplier capacity. He will multiply the individual "digits" of long numbers.

Column multiplication is not the most efficient method. Its complexity is O (k ² ), where k is the bit width of the operands in the selected number system. There are more efficient algorithms: Karatsuba multiplication , multiplication by Fourier transform [2, 7], etc. However, these advanced methods are more difficult to implement. In conditions of lack of time, I limited myself to multiplying by a bar. This gave an acceptable speed on the selected microcontroller.

So, multiplication "in a column" reduces a long multiplication to a series of short multiplications and additions. It is reasonable to use the capacity of the existing hardware multiplier completely. For example, in PIC24 and dsPIC microcontrollers there is a multiplier 16 * 16 => 32, that is, multiplying 16-bit numbers and giving a 32-bit result. Thus, the maximum possible base of the number system for these microcontrollers is 2 ¹⁶ = 65536.

Moreover, it is important that the result is 32-bit, because all these 32 bits will be needed in the course of further calculations. For example, in x86 processors, the bit depth of the result is equal to the bit width of the operands, i.e. 32 bits. To avoid overflow, it is necessary to limit the width of the operands to 16 bits. The same is true for some other 32-bit processors. They also have to use the number system on the base of 65536, as in a 16-bit microcontroller.

When multiplying "in a column", you can calculate intermediate results in a different order. Let's say, as we are accustomed to in school, you can first multiply the entire first factor by one digit of the second, then by the second, etc., by writing the results into lines. It turns out the matrix, which further must be added line by line. But for dsPIC microcontrollers, a different approach is more effective [6]. Namely, the calculation of the matrix is not in rows, but in columns. First the first column, then the second, and so on. When a column is calculated - its values can be immediately added together, we get the figure of the final result and the transfer. Of course, it makes no sense to store all the numbers from the column - you can instead add them to the sum instead. With this approach, the multiplication operations alternate with the operations of adding the result to the accumulator. And consequently, it becomes possible to use powerful DSP- means of this microcontroller, REPEAT and MAC instructions, which take two operands from the memory in one clock cycle, multiply them, add the result to the accumulator and increase the values of the pointers. Although the algorithm still remains inefficient - O (k ² ), such an optimized implementation can compete with more efficient algorithms that spend more processor cycles for each pass of the internal cycle.

Long unsigned multiplication of 2048 * 2048 => 4096 is in my library routine mpi_muluu .

1.3. Number Representation - Little Endian or Big Endian?

Any developer of a long arithmetic library must decide in what form it represents numbers in memory. Are the low-order and then most significant digits of the number (Little Endian) located in the memory or vice versa (Big Endian). For a person, the Big Endian format is more natural, because all the numbers we work with are represented in this format. However, for the internal representation of numbers in the computer may be different.

In our case, the conditions are dictated by the algorithm of long multiplication. When it works, numbers are represented in memory in the form of 16-bit "numbers". It would be desirable to read these "numbers" from the memory as a whole, rather than one byte. In this case, you have to use the instructions of the 16-bit read processor, and they are in the case of dsPIC, and x86 - Little Endian. There are other considerations for which Little Endian is preferable, but this is the most important thing. To improve performance, we want to read the processed information from the memory as large as possible, while avoiding unnecessary transformations. Therefore, I chose Little Endian, despite the fact that many textbooks use the Big Endian format, and I do not regret it. This choice led to a beautiful and optimal code. Of course, for Big-endian processors, you should choose Big-endian format.

1.4. Long division

Effective methods of long division reduce it to multiplying by the reciprocal of the divisor number [7]. In turn, this inverse number is also calculated using multiplication, for example, by an iterative method based on the Newton method for solving equations [3, 7]. Theoretically, this could be very beneficial for RSA, because it is necessary to divide many times into the same number, so that the reverse number needs to be calculated only once. However, I did not bother with this, limiting myself to the implementation of the division “in a column” in relation to computers, which is described in [4].

If we recall the method of dividing "in a column" studied in school and try to implement it "on the forehead" on a computer, then you come across the fact that this method, with the exception of working in the binary number system, does not reduce the long division to short or other simple operations. In fact, having demolished a certain number of divisible digits, it is necessary to divide them by the whole long divisor in order to get the next private digit. How to be here? In [4], a theorem is given which makes it possible to “guess” the digit of a particular with a small error. It is valid for cases when:

the bit width of the divider is one digit less than the bit width of the partial remainder;
the first digit of the divisor is greater than or equal to half the base of the number system.

The first condition is met due to the fact that we demolish the figures of the dividend into a partial remainder in the correct way (similar to the school procedure). To satisfy the second condition, you can multiply the dividend and the divisor by the same number. However, for RSA keys, it is guaranteed that the high bit n (of the divider) is equal to 1. And this means that the 16-bit “digit” formed from the higher 16 bits of the key will be greater than or equal to 32768. Therefore, the RSA does not require the division .

In order to "guess" the digit of the quotient, it is necessary to divide the first two digits of the partial remainder by the first digit of the divisor. The resulting value is either equal to the true private digit, or exceeds it by no more than 2. Indeed, if we divide 499 by 50, then dividing 49 by 5, we get 9, which coincides with the true first digit of the private. If we divide 890 by 99 - then dividing 89 by 9, we get 9, which is 1 more than necessary. In other cases, the estimate turns out to be overestimated by 2, but never again, this is guaranteed by a theorem.

The division of two digits of the partial remainder by one digit of the divisor is a “short” division, the hardware support of which is very desirable. In dsPIC33 there is hardware support for dividing 32-bit numbers into 16-bit, with 16-bit quotients and the remainder. This is quite enough for the implementation of long division in the base number system 2 ¹⁶ = 65536. For x86, the maximum bit depth of the dividend is 32 bits, which also provides the basis for the number system 65536.

After “guessing” the quotient numbers, it is necessary to multiply this number by the divisor and subtract the product from the partial remainder. If the result is less than 0, add the divisor back and reduce the estimate of the quotient by 1. If the remainder is less than 0, add the divider again and decrease the quotient again. As a result, we will get the correct value of the quotient and the correct value of the partial balance to continue the division process.

RSA does not need to know the quotient, but only the remainder of the division is needed. In this regard, I wrote a function mpi_moduu , which calculates only the remainder. For a 2048-bit RSA, you need to divide 4096-bit numbers into 2048-bit, with a 2048-bit remainder. My procedure performs division “in place”: the dividend is replaced by the remainder after the procedure ends. This increases its efficiency, since the "demolition" of the next digit from the dividend turns into a simple increment of the pointer. Also, no additional memory is required to store the partial balance and the private (as it is not saved).

The following components of long arithmetic are necessary for the operation of the above described division algorithm:

multiplying a long number by a short (one digit);
subtraction of long numbers;
addition of long numbers;
comparison of long numbers

which are described below.

The complexity of the long division procedure is O (k ² ) for short multiplications and O (k) for short divisions, where k is the number of digits of the divider in the base number system of 65536.

1.5. Long addition and subtraction

Writing the mpi_add addition and subtraction mpi_sub procedures does not require much ingenuity and some complicated methods. Each processor has an ADC type instruction that adds two numbers and a carry bit from the previous addition. If you cascade ADC commands, then you can add operands of any capacity. This is true for subtraction and SBC instructions. The complexity is O (k). My procedures return a carry flag when adding or subtracting, which is used in the division procedure.

In the C language, there is no access to the processor flags, so you have to attract numbers that are larger than necessary. For example, when adding two 16-bit numbers, it is necessary to use at least 32-bit for storing the result and transfer, in order to avoid overflow. This again leads us to the need to work in the number system on the basis of 65536 on 32-bit processors. Well, in the assembler, the digits of digits of long numbers coincide with the digits of the processor, 16 bits in the case of dsPIC.

1.6. Long comparison

Comparison can be implemented more efficiently than subtraction, if you start processing with higher digits. If the high digit of one number is greater than the high digit of the other, then the whole number is greater, the remaining numbers can not be compared. Thus, the complexity of the comparison procedure is the same as for subtracting O (k), but statistically the comparison is faster. The procedure in the source is called mpi_cmp .

1.7. Multiply a long number by a short

Multiplying a long number by a short one (ie, by one digit), which is used when dividing long numbers, is not fundamentally different from a long multiplication. However, it lacks one level of nesting cycles. Multiplying the number of digits k by a digit in the system, based on 65536, leads to the number of digits k + 1. Since in the division procedure such multiplication immediately follows subtraction of the product from the partial remainder, I implemented both operations simultaneously in the mpi_mulsubuuk procedure, which multiplies the 2048-bit number by the 16-bit and subtracts the product from the 2064-bit number, which leads to some saving of memory and time execution.

That's all the long arithmetic needed for RSA encryption.

2. Key generation

Key generation in RSA begins with the selection of two large primes p and q. When they say: “RSA-key is 2048 bits long”, they mean that the product p * q is a 2048-bit number, the highest bit of which is 1 (otherwise, in some interpretations, this would be a 2047-bit number). The size of the numbers p and q separately is not specified, but I noticed that such p and q are generated in GnuPG , that each of them is a 1024-bit number with the most significant bit equal to 1.

On the generation of large prime numbers, you can write books, not so much as articles. But in many situations, including mine, it is not required to generate keys on the microcontroller. So I tried, and successfully, to generate keys using GnuPG.

First you need to generate a key pair (open + closed) in GnuPG in the usual way. At the same time, make sure that the RSA-key of the desired length is obtained (2048 bits are now set by default there). Despite persistent calls from GnuPG, you should refuse to protect this key with a password and store it on disk in clear text. What difference does it make? We are all the same going to “gut” this key into its p, q, n, d numbers and save them to disk in the clear, at least temporarily.
The key can be exported from the GnuPG key database undocumented in new gpg versions.

gpg --export-secret-keys --armor [key_id] >filename.asc

Where key_id is the key identifier that can be obtained by calling gpg --list-secret-keys.

You’ll get a text file with content like:

 -----BEGIN PGP PRIVATE KEY BLOCK----- Version: GnuPG v2.0.17 (MingW32) lQOXBE9IdyABCADGT3+Dj0dsVPSkzW+zPlfXc4AsuKpkE9GJNAYD3mrcF70nwk1L ...

How do we extract the numbers that make up the key from here? It turned out that in the gpg program itself and the attached utilities there are NO such tools. There is only a third-party utility " pgpdump ", which interprets the contents of the PGP Private Key Block and displays it in human-readable form. Calling it like this:

 pgpdump -i filename.asc >filename.txt

We get the parameters we need:

 Old: Secret Key Packet(tag 5)(919 bytes) Ver 4 - new Public key creation time - Sat Feb 25 07:52:32 FLE Standard Time 2012 Pub alg - RSA Encrypt or Sign(pub 1) RSA n(2048 bits) - c6 4f 7f ... RSA e(17 bits) - 01 00 01 RSA d(2040 bits) - a5 c5 ce ... RSA p(1024 bits) - d0 28 ad ... RSA q(1024 bits) - f3 e3 61 ... RSA u(1024 bits) - 90 8b 22 ... Checksum - 3e 22 ...

Here I don’t completely give real numbers, but you will easily get them for your key. It remains only to copy them where necessary, not forgetting that in this file they are presented in the format Big Endian . So in order to use them in conjunction with my procedures, you need to convert them to Little Endian, that is, rearrange all the bytes in reverse order.

Also do not forget that the public key is only the numbers e and n. The remaining numbers are the secret (private) key, and knowing any of them (d, p, q, u) will make it easy to calculate the rest and decrypt your message, or forge an electronic signature.

3. Decryption

Since most RSA implementations hide somewhere inside the number that the keys consist of, and wrap encrypted messages in all sorts of packets, then at least to check the subroutines listed in section 1, it is necessary, in addition to the encryption operation, to also implement decryption.

As mentioned above, the formula (2) decryption can be implemented "in the forehead." And although it will work a hundred times slower than encryption, due to the large value of the exponent d, but this will be enough for many applications. If there is a desire to speed up the operation, then the trick using the Chinese theorem on residues , described in [5], is applied. I do not have beautiful sources that implement all this, therefore I give only a verbal description of the algorithm:

First, the number d “decomposes” into two parts, d _p = d mod (p-1) and d _q = d mod (q-1) . At the same time, the width of each of the parts is 2 times smaller than the “key length” (digit capacity n). Further, q _inv is calculated such that (q * q _inv ) mod p = 1. That is, q _inv is the inverse of q in the ring modulo p. In English terminology, this is called " Modular multiplicative inverse ". To calculate the inverse of the element is usually used the advanced Euclidean algorithm .

When the numbers d _p , d _q and q _inv are found, the decryption of the message using the RSA method is performed as follows:

m ₁ : = (C ^{d _p} ) mod p
m ₂ : = (C ^{d _q} ) mod q
h: = ((m ₁ -m ₂ ) * q _inv ) mod p
M: = m ₂ + h * q

Note that the calculations in steps 1-3 are made in a lower bit depth: 1024 bits instead of 2048 for the formula (2). Since the modular exponentiation algorithms have a cubic complexity (quadratic of multiplication and division, and still linear of “fast” exponentiation), the speed gain is obtained when working with half the number of digits 8 times. But since we now have two expensive exponentiation operations modulo, then the total gain is 4 times compared with the direct application of formula (2). Although it is a lot, it is still not so significant that in all cases it is necessary to reject direct application of formula (2), especially considering the difficulties in calculating q _inv .

It is worth mentioning the interesting vulnerability associated with “rapid” exponentiation. This is mentioned in the source code libgcrypt, file mpi-pow.c. The attack is called “Yarom / Falkner flush + reload cache side-channel attack” and allows using the processor’s second level cache (which is common to all cores) for unauthorized applications to fully find out the value of d or its parts d _p and d _q . Both means full disclosure of the secret key, allowing the attacker to read all the messages encrypted with this key or generate an electronic signature. That is, for example, your program decrypts something with a secret key, and a malicious program at this moment is working on the same system, but on behalf of another user, so you do not know about it. And at this moment there is a key leak.

Conclusion

I presented a C language module suitable for RSA encryption or electronic signature verification on 32-bit embedded processors with clock frequencies of the order of 40 MHz and a hardware multiplier. When translating the C source code into assembler and using materials [6], you can get a realization for dsPIC33F that encrypts one RSA-2048 data block in a time on the order of 10ms at a processor clock frequency of 80 MHz.

Sources can be downloaded from GitHub .

RSA decryption (or electronic signature generation) can also be implemented on microcontrollers of this class, but this operation will be approximately 45-180 times slower.

It is possible to accelerate the library by attracting efficient multiplication and division algorithms mentioned in the article by reference.

I recommend using my routines only for embedded applications, and for “large” computers in most cases it is better to use standard tools like Windows CryptoAPI or libgcrypt.

Bibliography

1. A. G. Kushnirenko, G.V. Lebedev, R. A. Svoren. Fundamentals of computer science and computing. M.: 1990, p. 133.
2. Wikipedia. Schönhage-Strassen multiplication method
3. Wikipedia. " Long arithmetic "
4. Donald Whip. "The Art of Programming", volume 2. The algorithms
5. Wikipedia, “ RSA (cryptosystem) ” article, “A worked example” section
6. Erich Wegener, Mario Werner. "Evaluating 16-bit Processors for Elliptic Curve Cryptography". Institute for Applied Information Processing and Communications (IAIK), Graz University of Technology
7. William H. Press et al. Numerical Recipes in C ++, Second edition, Cambridge University Press 2002, page 922

Source: https://habr.com/ru/post/243425/

All Articles