📜 ⬆️ ⬇️

Absurdly fast base64 encoding and decoding

About the author: Daniel Lemer is a professor of computer science at the University of Quebec (Canada). His research affects software performance and data engineering.

Computers store data as a stream of bits. As images, audio or video files, and binary files can contain almost any sequence of bits.

However, we often use text formats; for example, web pages and emails must be in text format. How do we send images by email? How do we embed pictures on web pages? One of the options is to put a link to a real binary file. Another typical approach is to embed a binary file directly into the body of a letter or web page using base64 . Base64 is just a standard text format that can be used to encode any binary data. To be precise, the base64 code is always valid ASCII text (and therefore also valid UTF-8). Each base64 code byte contains 6 data bits. That is, we "lose" about 2 bits per byte. Therefore, the equivalent of the base64 binary file will be about 33% more. In practice, such an increase in size rarely becomes a problem. As far as I know, email attachments are almost always base64 encoded.
')
When writing HTML, I found it convenient to embed images directly into HTML code using the data URI scheme . For example, in a recent article, I thus encoded a PNG file. The largest websites like Google constantly use this scheme. A small disadvantage is that web pages slightly increase in size (which is obvious) and you cannot take advantage of image caching. But the browser saves one network request.

If you are a web developer, you can use Web Storage to create a database for your application on the client side. This client database will store images and arbitrary data, but they must all be encoded in base64.

Most database engines support binary data, but some require coding in base64 at some point: these are MongoDB, Elasticsearch, Amazon SimpleDB and Amazon DynamoDB. Probably some more.

Base64 is commonly used for key exchange in cryptography. The base64 form is also used to transfer arbitrary data as part of a URI.

Fortunately, base64 encoding and decoding is fast. Although there are cases where insufficient speed can be a problem. Matt Crane and Jimmy Lin discovered slow decoding of base64 binary attributes in Amazon DynamoDB.

How fast can you decode base64 data? On the latest Intel processor, this takes about two cycles per byte (from the cache) when using a fast decoder like the one built into the Chrome browser. This fast decoder is mainly busy with table calls. This is much slower than copying data in the cache (which takes less than 0.05 cycles per byte).

Is this the best you can get?

A few years ago, Alfred Klopp showed that much better results could be achieved using vector instructions. Wojciech Mula, I myself and several colleagues (including Howard and Kurtz) decided to seriously reconsider the problem. Mula opened a webpage dedicated to this topic.

We found that you can speed up processing 10 times and use only about 0.2 cycles per byte on the latest Intel processors using vector instructions. This is still more than copying, but much less than the limit that could ever become the bottleneck of the system. I should note that error handling enters into these 0.2 cycles per byte: the decoder must decode and check the input data (for example, if invalid characters are found, then decoding is canceled).

The code for our research is available , so you can reproduce the results. Our article is published on arXiv and accepted for publication in the web version of ACM Transactions.

As far as I understand, our good results are integrated into Klomp's base64 library .

Additional materials:


Wojciech Mule, Daniel Lemer, “ Faster Base64 encoding and decoding using AVX2 instructions ”, web version of ACM Transactions (coming soon)

Source: https://habr.com/ru/post/347864/


All Articles