📜 ⬆️ ⬇️

Compact varint - uniqueness and great values ​​for the same value.

UPD 2018.03.15: Git has been using its compact varint version for a long time. Differences in the afterword .


Attention: The code presented in the article is slightly different from the original EncodeVarint and DecodeVarint and gives other results. Be careful.


In a multiformats / unsigned-varint discussion of the correct entry of a number in a varint, it was noted that many numbers in the original varint can be written in a sequence of different lengths. This will give different blocks and their hashes with identical values ​​encoded in the protobuffer .


Original varint


The original varint simply divides the number into pieces of 7 bits. And it writes them in order from low to high, adding to each bit an upper 8th bit (MSB - the most significant bit). The value of this bit depends on whether the last bit is (0) or not (1).


Thus, for example, the value 0 can be written in many variants:


  1. 0000 0000 (0x00) varint = 0
  2. 1000 0000 0000 0000 (0x8000) varint = 0
  3. 1000 0000 1000 0000 0000 0000 (0x808000) varint = 0
    and so on.

Compact varint


I thought that it is possible to begin the values ​​of a larger container from the maximum value of the previous container + 1. After all, if we use a container of this size, that number should be greater than the maximum of the previous container.


Benefits:


  1. A unique value for each set of bytes.
    1. 0000 0000 (0x00) varint = 0
    2. 1000 0000 0000 0000 (0x8000) varint = 128
    3. 1000 0000 1000 0000 0000 0000 (0x808000) varint = 16 512
  2. Larger values ​​for a set of bytes of the same size.
    1. For 2 bytes a maximum of 16 511 which is 128 more than 16 383 for the original varint
    2. For 8 bytes, the maximum is already 567 382 630 219 904 more

We code in compact varint


As I said above, the code is a little different from the original. Here I just added one line:


 x -= 1 

And she gave uniqueness and savings.


https://github.com/ivan386/protobuf-1/blob/b76098a2adb7a080231cf14903ab7f3687667ce6/proto/encode.go#L95-L106


 func EncodeVarint(x uint64) []byte { var buf [maxVarintBytes]byte var n int for n = 0; x > 127; n++ { buf[n] = 0x80 | uint8(x&0x7F) x >>= 7 x -= 1 } buf[n] = uint8(x) n++ return buf[0:n] } 

Example


300 = 000 0010 010 1100


  1. separate the first 7 bits: " uint8(x&0x7F) , x >>= 7 "
    010 1100


  2. add the most significant bit to them (1 since there are more bits): " 0x80 | "
    1 010 1100


  3. subtract one from the remaining: " x -= 1 "
    (000 0010) - 1 = 000 0001
    This is the key and only difference from the original EncodeVarint. Due to it, we can write a larger value in the same number of bytes.


  4. add the most significant bit to them (0 since this is the last part)
    0 000 0001


  5. glue
    1 010 1100 ++ 0 000 0001 = 1 010 1100 0 000 0001

300 = 1010 1100 0000 0001 (0xAC01) compact varint


Decoding compact varint


Decryption also has not undergone great changes. Here I changed one line:


 x |= (b & 0x7F) << shift 

on


 x += (b & 0xFF) << shift 

https://github.com/ivan386/protobuf-1/blob/b76098a2adb7a080231cf14903ab7f3687667ce6/proto/decode.go#L63-L78


 func DecodeVarint(buf []byte) (x uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0 } b := uint64(buf[n]) n++ x += (b & 0xFF) << shift if (b & 0x80) == 0 { return x, n } } // The number is too large to represent in a 64-bit value. return 0, 0 } 

Example


300 in compact varint number = 1 010 1100 0000 000 1 (0xAC01)


  1. share


     1 010 1100 0000 000 1 

  2. add zeros or shift " (b & 0xFF) << shift "


     1 010 1100 = 0000 000 1 010 1100 0000 000 1 = 0000 000 1 000 0000 

    To the first byte, we simply added the higher 7 zeros to align and the second byte was shifted 7 bits forward (b & 0xFF) << shift . In this case, the most significant bit is saved, unlike the original varint.


  3. now add " x += "


     0000 000 1 010 1100 + 0000 000 1 000 0000 = 0000 001 0 010 1100 → 256 + 32 + 8 + 4 = 300 

    This is a key moment of difference from the original DecodeVarint. Instead of operation "or" we make addition. Due to the most significant bit saved at the previous stage, we get a better result.



Why is it more compact:


Calculate the 2 byte maximum value


 a := []byte{255,127} // 1111 1111 0111 1111 

2 byte maximum value of compact varint: 16511
coded: 1 111 1111 0111 111 1 (0xFF7F)


  1. share
     1 111 1111 0111 111 1 
  2. add zeros or shift


     1 111 1111 = 0000 000 1 111 1111 0111 111 1 = 0111 111 1 000 0000 

  3. we fold
     0000 000 1 111 1111 + 0111 111 1 000 0000 = 1000 000 0 111 1111 = 16511 

result: 16511


2-byte maximum value of the original varint: 16383
coded: 1 111 1111 0 111 1111 (0xFF7F)


  1. share


     1 111 1111 0 111 1111 

  2. discard the high bit


     111 1111 111 1111 

  3. glue the bits
     111 1111 ++ 111 1111 → 111 1111 111 1111 = 16383 

Result: 16383


the difference between the maximum values: 16511 (compact varint) - 16383 (varint) = 128


For 2 bytes, it is not large but saves bytes with values ​​from 16384 to 16511 in comparison with the original varint.


Calculate the savings for an 8 byte record.


 // 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0111 1111 a := []byte{255,255,255,255,255,255,255,127} 

 72624976668147839 (    8  compact varint) - 72057594037927935 (    8   varint ) = 567382630219904 (  ) 

Here we save the 9th byte on a much larger segment of values.


Code to calculate the difference:


Here, the volume needed to record all values ​​up to the specified value is calculated and the statistics of the difference between the old and new versions is displayed.


 package main import ( "fmt" ) func DecodeVarintDifference(buf []byte) (difference uint64, newVarint uint64, oldVarint uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0, 0, 0 } b := uint64(buf[n]) n++ newVarint += (b & 0xFF) << shift oldVarint |= (b & 0x7F) << shift if (b & 0x80) == 0 { return newVarint - oldVarint, newVarint, oldVarint, n } } // The number is too large to represent in a 64-bit value. return 0, 0, 0, 0 } func convert(i uint64)(string){ v := []string{"B","KB","MB", "GB", "TB", "PB", "EB"} l := 0 for ; i > 1024; i = i / 1024 {l++} return fmt.Sprintf("(%d %s)",i, v[l]) } func main() { a := []byte{255,255,255,255,255,255,255,255,127} var newByteCount, newStart, oldByteCount, oldStart uint64 for i := len(a) - 1; i >= 0; i-- { difference, nVarint, oVarint, n := DecodeVarintDifference(a[i:]) newByteCount += (nVarint - newStart + 1) * uint64(n) oldByteCount += (oVarint - oldStart + 1) * uint64(n) fmt.Println("size:", n, "B") fmt.Println() fmt.Println( "new start value:", newStart, "\nnew max value:",nVarint) fmt.Println() fmt.Println( "old start value:", oldStart, "\nold max value:",oVarint) fmt.Println() fmt.Println("max value diff:", difference) oldByteCountForSameRange := oldByteCount + ( difference * uint64(n + 1) ) sizeDifference := oldByteCountForSameRange - newByteCount fmt.Println() fmt.Println("for range from 0 to", nVarint) fmt.Println("new varint data size:", newByteCount, "B", convert(newByteCount)) fmt.Println("old varint data size:", oldByteCountForSameRange, "B", convert(oldByteCountForSameRange)) fmt.Println("size differece:", sizeDifference, "B", convert(sizeDifference)) fmt.Println("percent from data size new:", float64(sizeDifference) / float64(newByteCount) * 100, "%") fmt.Println("percent from data size old:", float64(sizeDifference) / float64(oldByteCountForSameRange) * 100, "%") fmt.Println("average byte per value new:", float64(newByteCount) / float64( nVarint + 1 ) ) fmt.Println("average byte per value old:", float64(oldByteCountForSameRange) / float64( nVarint + 1 ) ) fmt.Println("-------------------------------------------------------") fmt.Println() newStart = nVarint oldStart = oVarint } } 

Testing on The Go Playground: https://play.golang.org/p/HPyb-sXMSjJ


Calculation results
 size: 1 B new start value: 0 new max value: 127 old start value: 0 old max value: 127 max value diff: 0 for range from 0 to 127 new varint data size: 128 B (128 B) old varint data size: 128 B (128 B) size differece: 0 B (0 B) percent from data size new: 0 % percent from data size old: 0 % average byte per value new: 1 average byte per value old: 1 ------------------------------------------------------- size: 2 B new start value: 127 new max value: 16511 old start value: 127 old max value: 16383 max value diff: 128 for range from 0 to 16511 new varint data size: 32898 B (32 KB) old varint data size: 33026 B (32 KB) size differece: 128 B (128 B) percent from data size new: 0.38908140312481004 % percent from data size old: 0.38757342699691155 % average byte per value new: 1.9923691860465116 average byte per value old: 2.0001211240310077 ------------------------------------------------------- size: 3 B new start value: 16511 new max value: 2113663 old start value: 16383 old max value: 2097151 max value diff: 16512 for range from 0 to 2113663 new varint data size: 6324357 B (6 MB) old varint data size: 6340997 B (6 MB) size differece: 16640 B (16 KB) percent from data size new: 0.26310975171072726 % percent from data size old: 0.2624193009395841 % average byte per value new: 2.9921297803245928 average byte per value old: 3.0000023655604675 ------------------------------------------------------- size: 4 B new start value: 2113663 new max value: 270549119 old start value: 2097151 old max value: 268435455 max value diff: 2113664 for range from 0 to 270549119 new varint data size: 1080066185 B (1 GB) old varint data size: 1082196489 B (1 GB) size differece: 2130304 B (2 MB) percent from data size new: 0.19723828313354705 % percent from data size old: 0.19685001953466882 % average byte per value new: 3.992126032418808 average byte per value old: 4.000000033265678 ------------------------------------------------------- size: 5 B new start value: 270549119 new max value: 34630287487 old start value: 268435455 old max value: 34359738367 max value diff: 270549120 for range from 0 to 34630287487 new varint data size: 172878758030 B (161 GB) old varint data size: 173151437454 B (161 GB) size differece: 272679424 B (260 MB) percent from data size new: 0.15772870369226125 % percent from data size old: 0.15748031203751395 % average byte per value new: 4.992125984801758 average byte per value old: 5.00000000040427 ------------------------------------------------------- size: 6 B new start value: 34630287487 new max value: 4432676798591 old start value: 34359738367 old max value: 4398046511103 max value diff: 34630287488 for range from 0 to 4432676798591 new varint data size: 26561157824660 B (24 TB) old varint data size: 26596060791572 B (24 TB) size differece: 34902966912 B (32 GB) percent from data size new: 0.13140604465515907 % percent from data size old: 0.1312335957776889 % average byte per value new: 5.992125984257845 average byte per value old: 6.000000000004512 ------------------------------------------------------- size: 7 B new start value: 4432676798591 new max value: 567382630219903 old start value: 4398046511103 old max value: 562949953421311 max value diff: 4432676798592 for range from 0 to 567382630219903 new varint data size: 3967210831773851 B (3 PB) old varint data size: 3971678411539355 B (3 PB) size differece: 4467579765504 B (4 TB) percent from data size new: 0.1126126126124338 % percent from data size old: 0.1124859392574144 % average byte per value new: 6.992125984252029 average byte per value old: 7.000000000000048 ------------------------------------------------------- size: 8 B new start value: 567382630219903 new max value: 72624976668147839 old start value: 562949953421311 old max value: 72057594037927935 max value diff: 567382630219904 for range from 0 to 72624976668147839 new varint data size: 580427963135197347 B (515 PB) old varint data size: 580999813345182755 B (516 PB) size differece: 571850209985408 B (520 TB) percent from data size new: 0.09852216748768333 % percent from data size old: 0.0984251968503923 % average byte per value new: 7.9921259842519685 average byte per value old: 8 ------------------------------------------------------- size: 9 B new start value: 72624976668147839 new max value: 9295997013522923647 old start value: 72057594037927935 old max value: 9223372036854775807 max value diff: 72624976668147840 for range from 0 to 9295997013522923647 new varint data size: 9803799999989973164 B (8 EB) old varint data size: 9876996826868106412 B (8 EB) size differece: 73196826878133248 B (65 PB) percent from data size new: 0.7466168922071861 % percent from data size old: 0.7410838351088465 % average byte per value new: 1.0546259842519685 average byte per value old: 1.0625 ------------------------------------------------------- 

Conclusion


My pull request to golang / protobuf lay for a year before they looked at it and sent it to the right place for my proposal.


Multiformats / unsigned-varint still uses the original varint. Now it's too late to change it.


And I hope that at least someone to comact varint will help save some bytes.


Afterword


2018.03.15


A few days later I came across an article in the Wikipedia Variable-length quantity . There, in the Removing Redundancy section, the idea presented in my article is described.


It turns out that Git has been using compact varint for a long time, but it records them from older to younger. This has several disadvantages:


  1. When encoding, an additional buffer is used which at the end is copied to the final one:
    git / varint.c # L22 (encode_varint)


     unsigned char varint[16]; 

    git / varint.c # L28 (encode_varint)


     memcpy(buf, varint + pos, sizeof(varint) - pos); 

    In the version presented in my article, the recording is done directly to the final buffer:
    proto / encode.go # L99 (EncodeVarint)


     buf[n] = 0x80 | uint8(x&0x7F) 

  2. When decoding, the unit is added separately and the high-order bit is reset:
    git / varint.c # L10 (decode_varint)


     val += 1; 

    git / varint.c # L14 (decode_varint)


     val = (val << 7) + (c & 127); 

    In my version, the most significant bit is added to the number, replacing these two operations:
    proto / decode.go # L70 (DecodeVarint)


     x += (b & 0xFF) << shift 

  3. git / varint.c is not compatible with mine since it uses a different order of bytes.

Sources


  1. Encoding | Protocol Buffers | Google Developers / Base 128 Varints
  2. Multiformats / unsigned-varint
  3. New varint with unique values
  4. Compact varint
  5. Byte order
  6. Variable-length quantity

')

Source: https://habr.com/ru/post/350796/


All Articles