Translation of one of Ben Johnson's articles from the "Go Walkthrough" series on a more in-depth study of the standard Go library in the context of real-world tasks.
So far, we have considered working with streams and slices of bytes, but few of the programs just drive the bytes back and forth. Bytes in themselves do not carry much meaning, but when we encode data structures using these bytes, then we can create really useful applications.
This post is one of a series of articles on a more in-depth analysis of the standard library. Despite the fact that standard documentation provides a lot of useful information, in the context of real-world tasks it can be difficult to figure out what to use and when. This series of articles aims to show the use of standard library packages in the context of real-world applications. If you have questions or comments, you can always email me on Twitter - @benbjohnson .
Programming often uses simple words for simple concepts. Even more than that - often for one concept, there are several muddled words. Encoding is one of those words. Sometimes it is called serialization ( marshaling ) - which means the same thing: adding a logical structure to the raw bytes.
In the standard Go library, we use the terms encoding and marshaling for two different but related ideas. An encoder in Go is an object that adds a logical structure to a stream of bytes, while marshaling works with a limited set of bytes in memory.
For example, in the encoding / json package there are json.Encoder and json.Decoder for working with io.Writer and io.Reader threads, respectively. And also in this package we see json.Marshaler and json.Unmarshaler for writing and reading bytes from a slice.
There is one more important difference in coding. Some packages for encoding operate on primitives - strings, integers, etc. Strings are encoded with encodings like ASCII or Unicode or any other encodings . Integers can be encoded differently, depending on endianness or using integer coding with an arbitrary length. Even the bytes themselves can often be encoded using schemes like Base64 to turn them into printable characters.
But more often, when we talk about coding, we think about coding objects. This means turning complex structures in memory such as structures, maps, and slices into a set of bytes. In this transformation, we have to deal with a lot of compromises and over many years, people have come up with many different ways of encoding .
Converting a logical structure to bytes may initially seem like a simple task - these structures are already in memory in the form of bytes. Why not just use it?
There are many reasons why the byte format in memory is not suitable for saving to disk or sending to the network. First, compatibility. The format of placing bytes in the memory of Go objects does not match the format of objects in Java, so it will be impossible for two systems written in different languages to understand each other. Also, sometimes we need compatibility not only with a different programming language, but also with a person. CSV , JSON and XML are all examples of human readable formats that can be easily viewed and modified manually.
However, the addition of human readability to the format puts us before a compromise. Formats that are easy to read by humans are harder and longer to read by computer. Integers are a good example. People read numbers in decimal form, while the computer operates on numbers in binary form. People also read numbers of various lengths, like 1 or 1000, while computers work with fixed-size numbers — 32 or 64 bits. The difference in performance may seem insignificant, but it will quickly become noticeable when parsing millions or millions of numbers.
There is also a compromise that we usually do not think about at first. Our data structures may change over time, but we must be able to work with data encoded many years ago. Some encodings, such as Protocol Buffers , allow you to describe a scheme for your data and add a version to the fields — old fields can be declared obsolete and new ones added. The downside is that you need to know the definition of the circuit along with the data in order to be able to encode or decode the data. Go's own gob format uses a different approach and saves the data schema right during encoding. But here the disadvantage is that the size of the encoded data becomes quite large.
Some formats generally bypass this moment and go without a scheme. JSON and MessagePack allow you to encode structures on the fly, but do not provide any guarantees for safe decoding from older versions.
We also use systems that do coding for us, but which we don’t think of as encoders. For example, databases are also one of the ways to take our logical structures and save them as a set of bytes on the disk. Most likely there will be a lot of things - network calls, SQL parsing, query scheduling, but, in fact, this is coding bytes.
In the end, if speed is the most important thing to you, you can use the internal format of Go memory and save the data as is. I even wrote a library for this called raw . The encoding and decoding time here is literally 0 seconds. But it is better not to use it in production.
If you are one of the few people who have looked into the encoding package, you may be a little disappointed. This is the second smallest package after errors and it contains only 4 interfaces.
The first two are the interfaces BinaryMarshaler and BinaryUnmarshaler .
type BinaryMarshaler interface { MarshalBinary() (data []byte, err error) } type BinaryUnmarshaler interface { UnmarshalBinary(data []byte) error }
They are intended for objects that provide a way to convert to and from a binary format. These interfaces are used in several places in the standard library, for example, time.Time.MarshalBinary () . You will not find them in many places, because there is usually no single way to convert data into binary form. As we have seen, there are a huge number of different serialization formats.
But at the application level, you will most likely choose one format for encoding. For example, you can select Protocol Buffers for all data. There is usually no point in supporting multiple binary formats in an application at once, so the BinaryMarshaler implementation may make sense.
The following two interfaces are TextMarshaler and TextUnmarshaler :
type TextMarshaler interface { MarshalText() (text []byte, err error) } type TextUnmarshaler interface { UnmarshalText(text []byte) error }
These two interfaces are very similar to the previous ones, but they work with data in UTF-8 format.
Some formats define their own marshaling interfaces, for example json.Marshaler , and they follow the same name logic.
The standard library has many useful packages for encoding data. We will look at them in more detail in the following articles, but here I would like to make a brief overview. Some of the packages are in encoding /, and some are in other places.
Most likely, the first package you used when you first met Go was the fmt package (pronounced "fumpt"). It uses the C-style printf () format to encode and decode numbers, strings, bytes, and even has the basic coding capabilities of objects. The fmt package is a great and easy way to create human-readable strings based on templates, but template parsing may not be very fast.
If you need better performance, you can get away from printf templates and use the strconv package. This is a low-level package for basic formatting and scanning strings, integers and fractional numbers, logical values, and in general it is quite fast.
These packages, like Go himself, imply that you work with strings in UTF-8. The almost complete lack of support for non-Unicode encodings in the standard library is most likely due to the fact that the Internet in recent years has very quickly agreed that everything should be in UTF-8, and possibly also because Rob Pike has invented Go, and UTF-8, who knows. I was probably lucky and did not have to deal with non-UTF-8 encodings in Go, but, however, there are such packages as unicode / utf16 , encoding / ascii85 and the whole golang.org/x/text branch. This thread contains a large number of excellent packages that are part of the Go project, but do not fall under the guarantees of Go 1 backward compatibility .
For encoding numbers, the encoding / binary package provides big endian and little endian encoding, as well as encoding numbers of variable length. Endianness - means the order in which the bytes go one by one. For example, the uint16 representation of the number 1000 (0x03e8 in hexadecimal form) consists of two bytes - 03 and e8. In the big endian form, these bytes are written in this order - "03 e8". In little endian, the reverse order is "e8 03". Many popular CPU architectures are little endian. But big endian is usually used to send data over the network. He even called that - network byte order.
In conclusion, there are a couple of packets for directly encoding the bytes themselves. Typically, byte encoding is used to translate them into a printable format. For example, the encoding / hex packet is used to represent data in hexadecimal form. I personally used it only for debugging purposes. On the other hand, sometimes you need printable characters because you want to send data through protocols in which, for historical reasons, there is limited support for binary data (email, for example). The encoding / base32 and encoding / base64 packages are good examples. Another example is the encoding / pem package, which is used to encode TLS certificates.
For encoding objects in the standard library, there are slightly fewer packages. But, in practice, these packages are more than enough.
If you haven’t spent the last 10 years in a tank, you’ve probably noticed that JSON has become the format for encoding objects by default. As mentioned earlier, JSON has its drawbacks, but it is very easy to use and its implementation is available in almost all languages, therefore, it is very popular. The encoding / json package provides excellent support for this format, and also Go has third-party, faster, parser implementations, such as ffjson .
And although JSON has become the dominant exchange protocol between machines, the CSV format is still popular for exporting data to people. The encoding / csv package provides a good interface for working with tabular data in this format.
If you work with systems written around the 2000s, you will probably need to work with XML . The encoding / xml package provides an interface in the SAX style for additional tag-based encoding / decoding, similar to a similar package for json . If you need more complex manipulations and things like DOM, XPath, XSD and XSLT, then you probably need to use libxml2 via cgo.
Go also has its own streaming encoding format, gob . This package is used in net / rpc to implement remote procedure calls between Go services. Gob is easy to use, but it is not supported in other languages. gRPC looks like a more popular alternative if you need a cross-language tool.
Finally, there is the encoding / asn1 package . Documentation on it is modest and the only link leads to a 25 page wall of the text - an introduction for newcomers to ASN.1 . ASN.1 is a complex encryption scheme that is used mainly by X.509 SSL / TLS certificates.
Encoding provides the foundation for giving a logical structure to raw bytes. Without it, we would have no lines, no data structure, no databases, or at least some useful applications. What seems like a simple concept actually has a rich history of implementations and a huge set of trade-offs.
In this article, we examined various implementations of the encodings implemented in the standard library and some of their trade-offs. We saw how these packages for working with base types and objects are built on our understanding of working with bytes and byte streams in Go. In the following articles, we will look deeper into these packages and see how to use them in the context of real-world applications.
Source: https://habr.com/ru/post/309834/
All Articles