New solutions to the old problem

Or transfer the bike to jet thrust

There is one very old problem, the age of which is equal to the age of the American Standard Code for Information Exchange. More specifically, the task of converting an integer to its hexadecimal representation of an ASCII string.
In this publication, we will consider converting a whole unsigned sixty-four-bit number into a string of fixed length without truncating the leading zeros.
The task at first glance seems elementary. It would be such if the ASCII table was different. But we have, what we have.
All solutions will be only for IA-32 and Intel 64 architectures.

Consider the input-output data and the algorithm for solving this problem.
Entrance:
64-bit integer unsigned number, which occupies 8 bytes at addresses from low to high. The digits of the number are in the same order. Each bit occupies 4 bits, there are two bits in each byte.
Output:
A string of ASCII characters, each occupying one byte. Each byte represents one digit of the original number. The order of the characters - the first byte (lowest address) - the most significant digit of the original number.

Algorithm:
1) Go from the older tetrads to the younger ones
2) Take a tetrad and convert it to a byte with zero extension.
3) Add 30h
4) If the value turned out more than 39h then
4.1) Add another 17 (decimal)
4.2) Go to 5
4.3) Otherwise go to 5
5) Save received byte in line
6) While not all tetrads are processed, go to 2

Decision number 1

Head-on

mov cx,8 mov si,value mov di,hexstr add si,cx ;highest byte of value dec si next_tetrade: std lodsb mov bl,al and al,0fh call digit cld stosb mov al,bl shr al,4 call digit loop next_tetrade ret digit: add al,30h cmp al,39h jnb _zero-nine ;digit greater than 9 add al,11h _zero-nine: ret

')
Like most solutions to the forehead, it is no different neither beauty nor speed. Home ugliness - conditional transition, which bypasses only one team. There are two ways of pointing beauty.
The first is to use the ancient "magic" as a sequence of AAA AAD 17 commands.
Decision number 2

AAA AAD 17

 mov cx,8 mov si,value mov di,hexstr add si,cx ;highest byte of value dec si next_tetrade: std lodsb mov bl,al and al,0fh call digit cld stosb mov al,bl shr al,4 call digit loop next_tetrade ret digit: mov ah,30h aaa aad 11h ret

Another way is to use the XLATB command.
Decision number 3

XLATB

  mov cx,8 mov si,value mov di,hexstr mov bx,hextable add si,cx dec si m3: std lodsb mov ah,al and al,0fh xlatb cld stosb mov al,ah shr al,4 xlatb stosb loop m3 ret hextable db "0123456789ABCDEF"

Both solutions are almost identical. But Solution # 2 will not work in 64-bit processor mode, due to the elimination of support for AAA and AAD commands in it.
But is it really possible, with the ability to process 8 bytes at a time, will we accept the processing of only 4 bits?
Are there ways to turn 9 (1001) into 39h (00111001) and A (1010) into 41h (01000001)?
Let's try to reveal the essence of a pair of AAA AAD teams, and select a replacement for them.

AAA AAD Replacement

  mov bl,0ah xor ah,ah div bl ; al - , ah -. ,     9    1. mov bh,al ; shl al,4 ;       11h add al,bh add al,30h ; voila!

This code, of course, is not suitable for our purposes, but it gives valuable information. Indicates that you can get a hyphenation unit for numbers greater than 9. And it shows how to use this unit later. Now, if each tetrade could be turned into a byte, then these bytes could be processed in parallel. Eight digits at a time!
Let the initial number be in rax. To isolate the lower tetrad, you just need to perform a conjunction with a mask. And so that the older tetrads are not lost, we copy them into rbx.

Unpuck

  mov rdx,0f0f0f0f0f0f0f0fh mov rbx,rax and rax,rdx shr rbx,4 and rbx,rdx

Unfortunately, it will not be possible to get the necessary transfer by division. Are there any other methods?
Actually, elementary: 10h-0ah = 6. It is enough to add 6 to each byte and get the necessary unit in the high tetrad.

Carry

  mov rdx,0f0f0f0f0f0f0f0fh mov rcx,0606060606060606h mov rbx,rax and rax,rdx mov rdi,rax ;   shr rbx,4 and rbx,rdx add rax,rcx shr rax,4 and rax,rdx ;    ,     9,  .   - .

Unlike the previous method of obtaining units of transfer using division, instead of the remainder of division, we still have the original number. That is, where was the number A - there it will remain, and will not turn into 0. Therefore, you need to add not 11h, but 41h-3ah = 7.
And now another problem arises, how to make a seven out of one? Yes, so as not to affect the neighboring bytes. After all, 7 = 0111b, it means you can get what you need by two shifts to the left and two disjunctions.

One to seven

  mov rsi,rax shl rsi,1 or rax,rsi ;11b shl rsi,1 or rax,rsi ; 111b

Now let's put the pieces together and see what happens.

Unsigned to hex ver 0.1

  mov rax,[value] mov rdx,0f0f0f0f0f0f0f0fh mov rcx,0606060606060606h mov rbx,rax mov rbp,3030303030303030h shr rbx,4 and rax,rdx and rbx,rdx mov rdi,rax mov r9,rbx add rax,rcx add rbx,rcx shr rax,4 shr rbx,4 and rax,rdx and rbx,rdx mov rsi,rax mov r8,rbx shl rsi,1 shl r8,1 or rax,rsi or rbx,r8 shl rsi,1 shl r8,1 or rax,rsi or rbx,r8 add rax,rbp add rbx,rbp add rax,rdi add rbx,r9 mov [hexstr],rax mov [hexstr+8],rbx

If you compile and run this code, one trouble will be revealed - the order of the numbers in the string is not in order. First, the numbers go backwards, and secondly, the even and odd numbers are grouped together. You can, of course, give to the conclusion and so, but we are honest and still bring order.

Deinterleaving

  mov rcx,4 highpart: rol rbx,8 shrd [hexstr],rbx,8 rol rax,8 shrd [hexstr],rax,8 loop highpart mov rcx,4 lowpart: rol rbx,8 shrd [hexstr+8],rbx,8 rol rax,8 shrd [hexstr+8],rax,8 loop lowpart ret

From what they wanted to leave, they came to that. Again byte processing in 64 mode. In order to flip the bytes, put them backwards, Intel has already made the bswap command for a long time. But for deinterliving you will have to look towards the MMX, SSE and their development. And there is such a command and its name is punpcklbw. We use our finds.

Deinterleaving ver. 2

  bswap rax bswap rbx mov [hexstr],rax mov [hexstr+8],rbx movdqu xmm0,[hexstr] movdqu xmm1,[hexstr+8] punpcklbw xmm1,xmm0 movdqu [hexstr],xmm1

Stop stop stop. If we started using SSE, maybe there is something else useful? Maybe rewrite our code completely on SSE.

Unsigned to hex ver 1.1

  movdqu xmm0,[value] pxor xmm1,xmm1 punpcklbw xmm0,xmm1 movdqa xmm1,xmm0 pand xmm1,[efes] psllq xmm0,4 pand xmm0,[efes] por xmm0,xmm1 movdqa xmm1,xmm0 paddb xmm1,[sixes] psrlq xmm1,4 pand xmm1,[efes] pxor xmm9,xmm9 psubb xmm9,xmm1 pand xmm9,[sevens] paddb xmm0,xmm9 paddb xmm0,[zeroes] movdqu [hexstr],xmm0 mov rax,[hexstr] mov rbx,[hexstr+8] bswap rax bswap rbx mov [hexstr],rbx mov [hexstr+8],rax ret efes: dq 0f0f0f0f0f0f0f0fh dq 0f0f0f0f0f0f0f0fh zeroes: dq 3030303030303030h dq 3030303030303030h sixes: dq 0606060606060606h dq 0606060606060606h sevens: dq 0707070707070707h dq 0707070707070707h

Here we have simplified unpacking, and getting the sevens is organized in another way, using one subtraction and one conjunction.

And what did we win at all? Let's compare the speed of each method.

CPU	one	2	3	four	five	6	7	eight
Core 2 Quad Q8200	670	600	150	77	70	77	76	1170
AMD C-60	290	195	290	120	105	120	140	290

Lodsb stosb with jnb
Lodsb stosb with xlatb
General purpose registers with shrd
General purpose registers with punpck
SSE 64 bit
SSE 64 bit unaligned
SSE 128 bit
Lodsb stosb with xlatb 128 bit

The values in parrots are the average number of processor ticks.
If Intel's numbers fit the theory well, then they are somewhat mysterious in AMD. A pleasant surprise was the high speed of the SSE code on the Intel processor. You can safely increase the digit capacity of the converted numbers to 256 bits with a small increase in the required time, the benefit is still a lot of free xmm registers in 64-bit mode. In general, initially it would seem a consistent task, it became possible to solve a very parallel method.
The inverse problem of converting a hexadecimal string to a number is no less interesting.

For a snack:

SSE 128 bit

  movdqa xmm0,[value] pxor xmm1,xmm1 movdqa xmm2,xmm0 punpcklbw xmm0,xmm1 movdqa xmm1,xmm0 punpckhbw xmm2,xmm1 pand xmm1,[efes] movdqa xmm3,xmm2 psllq xmm0,4 pand xmm3,[efes] pand xmm0,[efes] psllq xmm3,4 por xmm0,xmm1 pand xmm2,[efes] movdqa xmm1,xmm0 por xmm2,xmm3 paddb xmm1,[sixes] movdqa xmm3,xmm2 psrlq xmm1,4 paddb xmm3,[sixes] pand xmm1,[efes] psrlq xmm3,4 pxor xmm9,xmm9 pand xmm3,[efes] psubb xmm9,xmm1 pxor xmm10,xmm10 pand xmm9,[sevens] psubb xmm10,xmm3 paddb xmm0,xmm9 pand xmm10,[sevens] paddb xmm0,[zeroes] paddb xmm2,xmm10 movdqa [hexstr],xmm0 paddb xmm2,[zeroes] mov rax,[hexstr] movdqa [hexstr+16],xmm2 mov rbx,[hexstr+8] mov rcx,[hexstr+16] bswap rax mov rdx,[hexstr+16+8] bswap rbx bswap rcx mov [hexstr],rbx bswap rdx mov [hexstr+8+16],rax mov [hexstr+8],rcx mov [hexstr+16],rdx ret

Source: https://habr.com/ru/post/239375/

All Articles

New solutions to the old problem

Or transfer the bike to jet thrust

More articles: