UPD 2018.03.15: Git has been using its compact varint version for a long time. Differences in the afterword .
Attention: The code presented in the article is slightly different from the original EncodeVarint and DecodeVarint and gives other results. Be careful.
In a multiformats / unsigned-varint discussion of the correct entry of a number in a varint, it was noted that many numbers in the original varint can be written in a sequence of different lengths. This will give different blocks and their hashes with identical values encoded in the protobuffer .
The original varint simply divides the number into pieces of 7 bits. And it writes them in order from low to high, adding to each bit an upper 8th bit (MSB - the most significant bit). The value of this bit depends on whether the last bit is (0) or not (1).
Thus, for example, the value 0 can be written in many variants:
0000 0000 (0x00)
varint = 01000 0000 0000 0000 (0x8000)
varint = 01000 0000 1000 0000 0000 0000 (0x808000)
varint = 0I thought that it is possible to begin the values of a larger container from the maximum value of the previous container + 1. After all, if we use a container of this size, that number should be greater than the maximum of the previous container.
Benefits:
0000 0000 (0x00)
varint = 01000 0000 0000 0000 (0x8000)
varint = 1281000 0000 1000 0000 0000 0000 (0x808000)
varint = 16 512As I said above, the code is a little different from the original. Here I just added one line:
x -= 1
And she gave uniqueness and savings.
func EncodeVarint(x uint64) []byte { var buf [maxVarintBytes]byte var n int for n = 0; x > 127; n++ { buf[n] = 0x80 | uint8(x&0x7F) x >>= 7 x -= 1 } buf[n] = uint8(x) n++ return buf[0:n] }
300 = 000 0010 010 1100
separate the first 7 bits: " uint8(x&0x7F)
, x >>= 7
"010 1100
add the most significant bit to them (1 since there are more bits): " 0x80 |
"1 010 1100
subtract one from the remaining: " x -= 1
"(000 0010) - 1 = 000 0001
This is the key and only difference from the original EncodeVarint. Due to it, we can write a larger value in the same number of bytes.
add the most significant bit to them (0 since this is the last part)0 000 0001
1 010 1100 ++ 0 000 0001 = 1 010 1100 0 000 0001
300 = 1010 1100 0000 0001 (0xAC01) compact varint
Decryption also has not undergone great changes. Here I changed one line:
x |= (b & 0x7F) << shift
on
x += (b & 0xFF) << shift
func DecodeVarint(buf []byte) (x uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0 } b := uint64(buf[n]) n++ x += (b & 0xFF) << shift if (b & 0x80) == 0 { return x, n } } // The number is too large to represent in a 64-bit value. return 0, 0 }
300 in compact varint number = 1 010 1100 0000 000 1 (0xAC01)
share
1 010 1100 0000 000 1
add zeros or shift " (b & 0xFF) << shift
"
1 010 1100 = 0000 000 1 010 1100 0000 000 1 = 0000 000 1 000 0000
To the first byte, we simply added the higher 7 zeros to align and the second byte was shifted 7 bits forward (b & 0xFF) << shift
. In this case, the most significant bit is saved, unlike the original varint.
now add " x +=
"
0000 000 1 010 1100 + 0000 000 1 000 0000 = 0000 001 0 010 1100 → 256 + 32 + 8 + 4 = 300
This is a key moment of difference from the original DecodeVarint. Instead of operation "or" we make addition. Due to the most significant bit saved at the previous stage, we get a better result.
a := []byte{255,127} // 1111 1111 0111 1111
2 byte maximum value of compact varint: 16511
coded: 1 111 1111 0111 111 1 (0xFF7F)
1 111 1111 0111 111 1
add zeros or shift
1 111 1111 = 0000 000 1 111 1111 0111 111 1 = 0111 111 1 000 0000
0000 000 1 111 1111 + 0111 111 1 000 0000 = 1000 000 0 111 1111 = 16511
result: 16511
2-byte maximum value of the original varint: 16383
coded: 1 111 1111 0 111 1111 (0xFF7F)
share
1 111 1111 0 111 1111
discard the high bit
111 1111 111 1111
111 1111 ++ 111 1111 → 111 1111 111 1111 = 16383
Result: 16383
the difference between the maximum values: 16511 (compact varint) - 16383 (varint) = 128
For 2 bytes, it is not large but saves bytes with values from 16384 to 16511 in comparison with the original varint.
// 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0111 1111 a := []byte{255,255,255,255,255,255,255,127}
72624976668147839 ( 8 compact varint) - 72057594037927935 ( 8 varint ) = 567382630219904 ( )
Here we save the 9th byte on a much larger segment of values.
Here, the volume needed to record all values up to the specified value is calculated and the statistics of the difference between the old and new versions is displayed.
package main import ( "fmt" ) func DecodeVarintDifference(buf []byte) (difference uint64, newVarint uint64, oldVarint uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0, 0, 0 } b := uint64(buf[n]) n++ newVarint += (b & 0xFF) << shift oldVarint |= (b & 0x7F) << shift if (b & 0x80) == 0 { return newVarint - oldVarint, newVarint, oldVarint, n } } // The number is too large to represent in a 64-bit value. return 0, 0, 0, 0 } func convert(i uint64)(string){ v := []string{"B","KB","MB", "GB", "TB", "PB", "EB"} l := 0 for ; i > 1024; i = i / 1024 {l++} return fmt.Sprintf("(%d %s)",i, v[l]) } func main() { a := []byte{255,255,255,255,255,255,255,255,127} var newByteCount, newStart, oldByteCount, oldStart uint64 for i := len(a) - 1; i >= 0; i-- { difference, nVarint, oVarint, n := DecodeVarintDifference(a[i:]) newByteCount += (nVarint - newStart + 1) * uint64(n) oldByteCount += (oVarint - oldStart + 1) * uint64(n) fmt.Println("size:", n, "B") fmt.Println() fmt.Println( "new start value:", newStart, "\nnew max value:",nVarint) fmt.Println() fmt.Println( "old start value:", oldStart, "\nold max value:",oVarint) fmt.Println() fmt.Println("max value diff:", difference) oldByteCountForSameRange := oldByteCount + ( difference * uint64(n + 1) ) sizeDifference := oldByteCountForSameRange - newByteCount fmt.Println() fmt.Println("for range from 0 to", nVarint) fmt.Println("new varint data size:", newByteCount, "B", convert(newByteCount)) fmt.Println("old varint data size:", oldByteCountForSameRange, "B", convert(oldByteCountForSameRange)) fmt.Println("size differece:", sizeDifference, "B", convert(sizeDifference)) fmt.Println("percent from data size new:", float64(sizeDifference) / float64(newByteCount) * 100, "%") fmt.Println("percent from data size old:", float64(sizeDifference) / float64(oldByteCountForSameRange) * 100, "%") fmt.Println("average byte per value new:", float64(newByteCount) / float64( nVarint + 1 ) ) fmt.Println("average byte per value old:", float64(oldByteCountForSameRange) / float64( nVarint + 1 ) ) fmt.Println("-------------------------------------------------------") fmt.Println() newStart = nVarint oldStart = oVarint } }
Testing on The Go Playground: https://play.golang.org/p/HPyb-sXMSjJ
size: 1 B new start value: 0 new max value: 127 old start value: 0 old max value: 127 max value diff: 0 for range from 0 to 127 new varint data size: 128 B (128 B) old varint data size: 128 B (128 B) size differece: 0 B (0 B) percent from data size new: 0 % percent from data size old: 0 % average byte per value new: 1 average byte per value old: 1 ------------------------------------------------------- size: 2 B new start value: 127 new max value: 16511 old start value: 127 old max value: 16383 max value diff: 128 for range from 0 to 16511 new varint data size: 32898 B (32 KB) old varint data size: 33026 B (32 KB) size differece: 128 B (128 B) percent from data size new: 0.38908140312481004 % percent from data size old: 0.38757342699691155 % average byte per value new: 1.9923691860465116 average byte per value old: 2.0001211240310077 ------------------------------------------------------- size: 3 B new start value: 16511 new max value: 2113663 old start value: 16383 old max value: 2097151 max value diff: 16512 for range from 0 to 2113663 new varint data size: 6324357 B (6 MB) old varint data size: 6340997 B (6 MB) size differece: 16640 B (16 KB) percent from data size new: 0.26310975171072726 % percent from data size old: 0.2624193009395841 % average byte per value new: 2.9921297803245928 average byte per value old: 3.0000023655604675 ------------------------------------------------------- size: 4 B new start value: 2113663 new max value: 270549119 old start value: 2097151 old max value: 268435455 max value diff: 2113664 for range from 0 to 270549119 new varint data size: 1080066185 B (1 GB) old varint data size: 1082196489 B (1 GB) size differece: 2130304 B (2 MB) percent from data size new: 0.19723828313354705 % percent from data size old: 0.19685001953466882 % average byte per value new: 3.992126032418808 average byte per value old: 4.000000033265678 ------------------------------------------------------- size: 5 B new start value: 270549119 new max value: 34630287487 old start value: 268435455 old max value: 34359738367 max value diff: 270549120 for range from 0 to 34630287487 new varint data size: 172878758030 B (161 GB) old varint data size: 173151437454 B (161 GB) size differece: 272679424 B (260 MB) percent from data size new: 0.15772870369226125 % percent from data size old: 0.15748031203751395 % average byte per value new: 4.992125984801758 average byte per value old: 5.00000000040427 ------------------------------------------------------- size: 6 B new start value: 34630287487 new max value: 4432676798591 old start value: 34359738367 old max value: 4398046511103 max value diff: 34630287488 for range from 0 to 4432676798591 new varint data size: 26561157824660 B (24 TB) old varint data size: 26596060791572 B (24 TB) size differece: 34902966912 B (32 GB) percent from data size new: 0.13140604465515907 % percent from data size old: 0.1312335957776889 % average byte per value new: 5.992125984257845 average byte per value old: 6.000000000004512 ------------------------------------------------------- size: 7 B new start value: 4432676798591 new max value: 567382630219903 old start value: 4398046511103 old max value: 562949953421311 max value diff: 4432676798592 for range from 0 to 567382630219903 new varint data size: 3967210831773851 B (3 PB) old varint data size: 3971678411539355 B (3 PB) size differece: 4467579765504 B (4 TB) percent from data size new: 0.1126126126124338 % percent from data size old: 0.1124859392574144 % average byte per value new: 6.992125984252029 average byte per value old: 7.000000000000048 ------------------------------------------------------- size: 8 B new start value: 567382630219903 new max value: 72624976668147839 old start value: 562949953421311 old max value: 72057594037927935 max value diff: 567382630219904 for range from 0 to 72624976668147839 new varint data size: 580427963135197347 B (515 PB) old varint data size: 580999813345182755 B (516 PB) size differece: 571850209985408 B (520 TB) percent from data size new: 0.09852216748768333 % percent from data size old: 0.0984251968503923 % average byte per value new: 7.9921259842519685 average byte per value old: 8 ------------------------------------------------------- size: 9 B new start value: 72624976668147839 new max value: 9295997013522923647 old start value: 72057594037927935 old max value: 9223372036854775807 max value diff: 72624976668147840 for range from 0 to 9295997013522923647 new varint data size: 9803799999989973164 B (8 EB) old varint data size: 9876996826868106412 B (8 EB) size differece: 73196826878133248 B (65 PB) percent from data size new: 0.7466168922071861 % percent from data size old: 0.7410838351088465 % average byte per value new: 1.0546259842519685 average byte per value old: 1.0625 -------------------------------------------------------
My pull request to golang / protobuf lay for a year before they looked at it and sent it to the right place for my proposal.
Multiformats / unsigned-varint still uses the original varint. Now it's too late to change it.
And I hope that at least someone to comact varint will help save some bytes.
A few days later I came across an article in the Wikipedia Variable-length quantity . There, in the Removing Redundancy section, the idea presented in my article is described.
It turns out that Git has been using compact varint for a long time, but it records them from older to younger. This has several disadvantages:
When encoding, an additional buffer is used which at the end is copied to the final one:
git / varint.c # L22 (encode_varint)
unsigned char varint[16];
git / varint.c # L28 (encode_varint)
memcpy(buf, varint + pos, sizeof(varint) - pos);
In the version presented in my article, the recording is done directly to the final buffer:
proto / encode.go # L99 (EncodeVarint)
buf[n] = 0x80 | uint8(x&0x7F)
When decoding, the unit is added separately and the high-order bit is reset:
git / varint.c # L10 (decode_varint)
val += 1;
git / varint.c # L14 (decode_varint)
val = (val << 7) + (c & 127);
In my version, the most significant bit is added to the number, replacing these two operations:
proto / decode.go # L70 (DecodeVarint)
x += (b & 0xFF) << shift
Source: https://habr.com/ru/post/350796/
All Articles