Compact varint - uniqueness and great values for the same value.

UPD 2018.03.15: Git has been using its compact varint version for a long time. Differences in the afterword .

Attention: The code presented in the article is slightly different from the original EncodeVarint and DecodeVarint and gives other results. Be careful.

In a multiformats / unsigned-varint discussion of the correct entry of a number in a varint, it was noted that many numbers in the original varint can be written in a sequence of different lengths. This will give different blocks and their hashes with identical values encoded in the protobuffer .

Original varint

The original varint simply divides the number into pieces of 7 bits. And it writes them in order from low to high, adding to each bit an upper 8th bit (MSB - the most significant bit). The value of this bit depends on whether the last bit is (0) or not (1).

Thus, for example, the value 0 can be written in many variants:

0000 0000 (0x00) varint = 0
1000 0000 0000 0000 (0x8000) varint = 0
1000 0000 1000 0000 0000 0000 (0x808000) varint = 0
and so on.

Compact varint

I thought that it is possible to begin the values of a larger container from the maximum value of the previous container + 1. After all, if we use a container of this size, that number should be greater than the maximum of the previous container.

Benefits:

A unique value for each set of bytes.
1. 0000 0000 (0x00) varint = 0
2. 1000 0000 0000 0000 (0x8000) varint = 128
3. 1000 0000 1000 0000 0000 0000 (0x808000) varint = 16 512
Larger values for a set of bytes of the same size.
1. For 2 bytes a maximum of 16 511 which is 128 more than 16 383 for the original varint
2. For 8 bytes, the maximum is already 567 382 630 219 904 more

We code in compact varint

As I said above, the code is a little different from the original. Here I just added one line:

 x -= 1

And she gave uniqueness and savings.

https://github.com/ivan386/protobuf-1/blob/b76098a2adb7a080231cf14903ab7f3687667ce6/proto/encode.go#L95-L106

 func EncodeVarint(x uint64) []byte { var buf [maxVarintBytes]byte var n int for n = 0; x > 127; n++ { buf[n] = 0x80 | uint8(x&0x7F) x >>= 7 x -= 1 } buf[n] = uint8(x) n++ return buf[0:n] }

Example

300 = 000 0010 010 1100

separate the first 7 bits: " uint8(x&0x7F) , x >>= 7 "
010 1100
add the most significant bit to them (1 since there are more bits): " 0x80 | "
1 010 1100
subtract one from the remaining: " x -= 1 "
(000 0010) - 1 = 000 0001
This is the key and only difference from the original EncodeVarint. Due to it, we can write a larger value in the same number of bytes.
add the most significant bit to them (0 since this is the last part)
0 000 0001
glue
1 010 1100 ++ 0 000 0001 = 1 010 1100 0 000 0001

300 = 1010 1100 0000 0001 (0xAC01) compact varint

Decoding compact varint

Decryption also has not undergone great changes. Here I changed one line:

 x |= (b & 0x7F) << shift

 x += (b & 0xFF) << shift

https://github.com/ivan386/protobuf-1/blob/b76098a2adb7a080231cf14903ab7f3687667ce6/proto/decode.go#L63-L78

 func DecodeVarint(buf []byte) (x uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0 } b := uint64(buf[n]) n++ x += (b & 0xFF) << shift if (b & 0x80) == 0 { return x, n } } // The number is too large to represent in a 64-bit value. return 0, 0 }

Example

300 in compact varint number = 1 010 1100 0000 000 1 (0xAC01)

share
```
 1 010 1100 0000 000 1 
```
add zeros or shift " (b & 0xFF) << shift "
```
 1 010 1100 = 0000 000 1 010 1100 0000 000 1 = 0000 000 1 000 0000 
```
To the first byte, we simply added the higher 7 zeros to align and the second byte was shifted 7 bits forward (b & 0xFF) << shift . In this case, the most significant bit is saved, unlike the original varint.
now add " x += "
```
 0000 000 1 010 1100 + 0000 000 1 000 0000 = 0000 001 0 010 1100 → 256 + 32 + 8 + 4 = 300 
```
This is a key moment of difference from the original DecodeVarint. Instead of operation "or" we make addition. Due to the most significant bit saved at the previous stage, we get a better result.

Why is it more compact:

Calculate the 2 byte maximum value

 a := []byte{255,127} // 1111 1111 0111 1111

2 byte maximum value of compact varint: 16511
coded: 1 111 1111 0111 111 1 (0xFF7F)

share
```
 1 111 1111 0111 111 1 
```

add zeros or shift

 1 111 1111 = 0000 000 1 111 1111 0111 111 1 = 0111 111 1 000 0000

we fold

 0000 000 1 111 1111 + 0111 111 1 000 0000 = 1000 000 0 111 1111 = 16511

result: 16511

2-byte maximum value of the original varint: 16383
coded: 1 111 1111 0 111 1111 (0xFF7F)

share
```
 1 111 1111 0 111 1111 
```
discard the high bit
```
 111 1111 111 1111 
```

glue the bits

 111 1111 ++ 111 1111 → 111 1111 111 1111 = 16383

Result: 16383

the difference between the maximum values: 16511 (compact varint) - 16383 (varint) = 128

For 2 bytes, it is not large but saves bytes with values from 16384 to 16511 in comparison with the original varint.

Calculate the savings for an 8 byte record.

 // 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0111 1111 a := []byte{255,255,255,255,255,255,255,127}

 72624976668147839 (    8  compact varint) - 72057594037927935 (    8   varint ) = 567382630219904 (  )

Here we save the 9th byte on a much larger segment of values.

Code to calculate the difference:

Here, the volume needed to record all values up to the specified value is calculated and the statistics of the difference between the old and new versions is displayed.

 package main import ( "fmt" ) func DecodeVarintDifference(buf []byte) (difference uint64, newVarint uint64, oldVarint uint64, n int) { for shift := uint(0); shift < 64; shift += 7 { if n >= len(buf) { return 0, 0, 0, 0 } b := uint64(buf[n]) n++ newVarint += (b & 0xFF) << shift oldVarint |= (b & 0x7F) << shift if (b & 0x80) == 0 { return newVarint - oldVarint, newVarint, oldVarint, n } } // The number is too large to represent in a 64-bit value. return 0, 0, 0, 0 } func convert(i uint64)(string){ v := []string{"B","KB","MB", "GB", "TB", "PB", "EB"} l := 0 for ; i > 1024; i = i / 1024 {l++} return fmt.Sprintf("(%d %s)",i, v[l]) } func main() { a := []byte{255,255,255,255,255,255,255,255,127} var newByteCount, newStart, oldByteCount, oldStart uint64 for i := len(a) - 1; i >= 0; i-- { difference, nVarint, oVarint, n := DecodeVarintDifference(a[i:]) newByteCount += (nVarint - newStart + 1) * uint64(n) oldByteCount += (oVarint - oldStart + 1) * uint64(n) fmt.Println("size:", n, "B") fmt.Println() fmt.Println( "new start value:", newStart, "\nnew max value:",nVarint) fmt.Println() fmt.Println( "old start value:", oldStart, "\nold max value:",oVarint) fmt.Println() fmt.Println("max value diff:", difference) oldByteCountForSameRange := oldByteCount + ( difference * uint64(n + 1) ) sizeDifference := oldByteCountForSameRange - newByteCount fmt.Println() fmt.Println("for range from 0 to", nVarint) fmt.Println("new varint data size:", newByteCount, "B", convert(newByteCount)) fmt.Println("old varint data size:", oldByteCountForSameRange, "B", convert(oldByteCountForSameRange)) fmt.Println("size differece:", sizeDifference, "B", convert(sizeDifference)) fmt.Println("percent from data size new:", float64(sizeDifference) / float64(newByteCount) * 100, "%") fmt.Println("percent from data size old:", float64(sizeDifference) / float64(oldByteCountForSameRange) * 100, "%") fmt.Println("average byte per value new:", float64(newByteCount) / float64( nVarint + 1 ) ) fmt.Println("average byte per value old:", float64(oldByteCountForSameRange) / float64( nVarint + 1 ) ) fmt.Println("-------------------------------------------------------") fmt.Println() newStart = nVarint oldStart = oVarint } }

Testing on The Go Playground: https://play.golang.org/p/HPyb-sXMSjJ

Calculation results

 size: 1 B new start value: 0 new max value: 127 old start value: 0 old max value: 127 max value diff: 0 for range from 0 to 127 new varint data size: 128 B (128 B) old varint data size: 128 B (128 B) size differece: 0 B (0 B) percent from data size new: 0 % percent from data size old: 0 % average byte per value new: 1 average byte per value old: 1 ------------------------------------------------------- size: 2 B new start value: 127 new max value: 16511 old start value: 127 old max value: 16383 max value diff: 128 for range from 0 to 16511 new varint data size: 32898 B (32 KB) old varint data size: 33026 B (32 KB) size differece: 128 B (128 B) percent from data size new: 0.38908140312481004 % percent from data size old: 0.38757342699691155 % average byte per value new: 1.9923691860465116 average byte per value old: 2.0001211240310077 ------------------------------------------------------- size: 3 B new start value: 16511 new max value: 2113663 old start value: 16383 old max value: 2097151 max value diff: 16512 for range from 0 to 2113663 new varint data size: 6324357 B (6 MB) old varint data size: 6340997 B (6 MB) size differece: 16640 B (16 KB) percent from data size new: 0.26310975171072726 % percent from data size old: 0.2624193009395841 % average byte per value new: 2.9921297803245928 average byte per value old: 3.0000023655604675 ------------------------------------------------------- size: 4 B new start value: 2113663 new max value: 270549119 old start value: 2097151 old max value: 268435455 max value diff: 2113664 for range from 0 to 270549119 new varint data size: 1080066185 B (1 GB) old varint data size: 1082196489 B (1 GB) size differece: 2130304 B (2 MB) percent from data size new: 0.19723828313354705 % percent from data size old: 0.19685001953466882 % average byte per value new: 3.992126032418808 average byte per value old: 4.000000033265678 ------------------------------------------------------- size: 5 B new start value: 270549119 new max value: 34630287487 old start value: 268435455 old max value: 34359738367 max value diff: 270549120 for range from 0 to 34630287487 new varint data size: 172878758030 B (161 GB) old varint data size: 173151437454 B (161 GB) size differece: 272679424 B (260 MB) percent from data size new: 0.15772870369226125 % percent from data size old: 0.15748031203751395 % average byte per value new: 4.992125984801758 average byte per value old: 5.00000000040427 ------------------------------------------------------- size: 6 B new start value: 34630287487 new max value: 4432676798591 old start value: 34359738367 old max value: 4398046511103 max value diff: 34630287488 for range from 0 to 4432676798591 new varint data size: 26561157824660 B (24 TB) old varint data size: 26596060791572 B (24 TB) size differece: 34902966912 B (32 GB) percent from data size new: 0.13140604465515907 % percent from data size old: 0.1312335957776889 % average byte per value new: 5.992125984257845 average byte per value old: 6.000000000004512 ------------------------------------------------------- size: 7 B new start value: 4432676798591 new max value: 567382630219903 old start value: 4398046511103 old max value: 562949953421311 max value diff: 4432676798592 for range from 0 to 567382630219903 new varint data size: 3967210831773851 B (3 PB) old varint data size: 3971678411539355 B (3 PB) size differece: 4467579765504 B (4 TB) percent from data size new: 0.1126126126124338 % percent from data size old: 0.1124859392574144 % average byte per value new: 6.992125984252029 average byte per value old: 7.000000000000048 ------------------------------------------------------- size: 8 B new start value: 567382630219903 new max value: 72624976668147839 old start value: 562949953421311 old max value: 72057594037927935 max value diff: 567382630219904 for range from 0 to 72624976668147839 new varint data size: 580427963135197347 B (515 PB) old varint data size: 580999813345182755 B (516 PB) size differece: 571850209985408 B (520 TB) percent from data size new: 0.09852216748768333 % percent from data size old: 0.0984251968503923 % average byte per value new: 7.9921259842519685 average byte per value old: 8 ------------------------------------------------------- size: 9 B new start value: 72624976668147839 new max value: 9295997013522923647 old start value: 72057594037927935 old max value: 9223372036854775807 max value diff: 72624976668147840 for range from 0 to 9295997013522923647 new varint data size: 9803799999989973164 B (8 EB) old varint data size: 9876996826868106412 B (8 EB) size differece: 73196826878133248 B (65 PB) percent from data size new: 0.7466168922071861 % percent from data size old: 0.7410838351088465 % average byte per value new: 1.0546259842519685 average byte per value old: 1.0625 -------------------------------------------------------

Conclusion

My pull request to golang / protobuf lay for a year before they looked at it and sent it to the right place for my proposal.

Multiformats / unsigned-varint still uses the original varint. Now it's too late to change it.

And I hope that at least someone to comact varint will help save some bytes.

Afterword

2018.03.15

A few days later I came across an article in the Wikipedia Variable-length quantity . There, in the Removing Redundancy section, the idea presented in my article is described.

It turns out that Git has been using compact varint for a long time, but it records them from older to younger. This has several disadvantages:

When encoding, an additional buffer is used which at the end is copied to the final one:
git / varint.c # L22 (encode_varint)
```
 unsigned char varint[16]; 
```
git / varint.c # L28 (encode_varint)
```
 memcpy(buf, varint + pos, sizeof(varint) - pos); 
```
In the version presented in my article, the recording is done directly to the final buffer:
proto / encode.go # L99 (EncodeVarint)
```
 buf[n] = 0x80 | uint8(x&0x7F) 
```
When decoding, the unit is added separately and the high-order bit is reset:
git / varint.c # L10 (decode_varint)
```
 val += 1; 
```
git / varint.c # L14 (decode_varint)
```
 val = (val << 7) + (c & 127); 
```
In my version, the most significant bit is added to the number, replacing these two operations:
proto / decode.go # L70 (DecodeVarint)
```
 x += (b & 0xFF) << shift 
```
git / varint.c is not compatible with mine since it uses a different order of bytes.

Sources

Source: https://habr.com/ru/post/350796/

All Articles

Compact varint - uniqueness and great values ​​for the same value.

Original varint

Compact varint

We code in compact varint

Example

Decoding compact varint

Example

Why is it more compact:

Calculate the 2 byte maximum value

Calculate the savings for an 8 byte record.

Code to calculate the difference:

Conclusion

Afterword

2018.03.15

Sources

More articles:

Compact varint - uniqueness and great values for the same value.