Understanding Go: bytes and strings packages

Translation of one of Ben Johnson's articles from the "Go Walkthrough" series on a more in-depth study of the standard Go library in the context of real-world tasks.

In the previous post, we figured out how to work with byte streams, but sometimes we need to work with a specific set of bytes in memory. Although byte slices are quite suitable for many tasks, there are quite a few cases when it is better to use the bytes package. We will also look at the strings package today, since its API is almost identical to bytes, only it works with strings.

This post is one of a series of articles on a more in-depth analysis of the standard library. Despite the fact that standard documentation provides a lot of useful information, in the context of real-world tasks it can be difficult to figure out what to use and when. This series of articles aims to show the use of standard library packages in the context of real-world applications. If you have questions or comments, you can always email me on Twitter - @benbjohnson .

Short digression about strings and bytes

Rob Pike wrote an excellent and deep post about strings, bytes, runes and symbols , but for this post I would like to give a simpler definition from the point of view of the developer.

Slice byte is a variable sequential set of bytes. Lightly verbose, so let's try to understand what it means.

We have slice bytes:

buf := []byte{1,2,3,4}

It is changeable, so you can change the elements in it:

 buf[3] = 5 // []byte{1,2,3,5}

You can also change its size:

 buf = buf[:2] // []byte{1,2} buf = append(buf, 100) // []byte{1,2,100}

And it is sequential, since the bytes in memory go one by one:

 1|2|3|4

Strings, on the other hand , are an unchangeable sequential set of fixed-size bytes. This means that you can not change the lines - just create new ones. This is important to understand in the context of program performance. In programs that need very high performance, the constant creation of a large number of lines will create a significant burden on the garbage collector.

From the developer's point of view, strings are best used when you work with data in UTF-8 — they can be used as keys to the map, unlike byte slices, for example, and most APIs use strings to represent string data. On the other hand, byte slices are much better suited when you need to work with raw bytes, when processing data streams, for example. They are also more convenient if you want to avoid new allocations of memory and want to reuse memory.

Adapting strings and slices for streams

One of the most important features of the bytes and strings packages is that they implement io.Reader and io.Writer interfaces for working with bytes and strings in memory.

In-memory readers

The two self underused functions in the standard Go library are bytes.NewReader and strings.NewReader :

 func NewReader(b []byte) *Reader func NewReader(s string) *Reader

These functions return the implementation of the io.Reader interface, which serves as a wrapper around a slice of bytes or a string in memory. But these are not only readers - they also implement other related interfaces, such as io.ReaderAt , io.WriterTo , io.ByteReader , io.ByteScanner , io.RuneReader , io.RuneScanner and io.Seeker .

I regularly see code where byte and line slices are first written in bytes.Buffer , and then the buffer is used as a reader:

 var buf bytes.Buffer buf.WriteString("foo") http.Post("http://example.com/", "text/plain", &buf)

This approach requires extra memory allocations and can be slow. It will be much more efficient to use strings.Reader :

 r := strings.NewReader("foobar") http.Post("http://example.com", "text/plain", r)

This method also works when you have many rows or slices of bytes that can be combined with [io.MultiReader] ():

 r := io.MultiReader( strings.NewReader("HEADER"), bytes.NewReader([]byte{0,1,2,3,4}), myFile, strings.NewReader("FOOTER"), )

In-memory writer

The bytes package also implements the io.Writer interface for slicing bytes in memory using the Buffer type. It implements almost all interfaces of the io package, except io.Closer and io.Seeker . It also has a WriteString () helper method for writing a string to the end of the buffer.

I actively use Buffer in unit tests to capture the output of service logs. You can pass a buffer as an argument to log.New () and check the output later:

 var buf bytes.Buffer myService.Logger = log.New(&buf, "", log.LstdFlags) myService.Run() if !strings.Contains(buf.String(), "service failed") { t.Fatal("expected log message") }

But in production code, I rarely use Buffer . Despite the name, I do not use it for buffered reading and writing, as there is a bufio package in the standard library especially for this.

Package organization

At first glance, the bytes and strings packages seem very large, but in reality they are just a collection of simple utility functions. We can group them in several categories:

Comparison functions
Verification functions
Prefix / Suffix Functions
Replacement functions
Merge and split functions

When we understand how these functions are grouped, the seemingly large API will look much more comfortable.

Comparison functions

When you have two slices of bytes or two lines, you may need to get an answer to two questions. First, are these two objects equal? And the second - which of the objects goes before when sorting?

Equality

The Equal () function answers the first question:

 func Equal(a, b []byte) bool

This function is only in the bytes package, since strings can be compared using the == operator.

Although checking for equality may seem like a simple task, there is a popular mistake in using strings.ToUpper () to check for equality without registering:

 if strings.ToUpper(a) == strings.ToUpper(b) { return true }

This approach is wrong, it uses 2 allocations for new lines. A much more correct approach is to use EqualFold () :

 func EqualFold(s, t []byte) bool func EqualFold(s, t string) bool

The word Fold here means Unicode case-folding . It covers the rules for upper and lower case not only for AZ, but also for other languages, and can convert φ to ϕ.

Comparison

To find out the order for sorting two slices of bytes or strings, we have the Compare () function:

 func Compare(a, b []byte) int func Compare(a, b string) int

This function returns -1 if a is less than b, 1 if a is greater than b, and 0 if a and b are equal. This function is present in the strings package exclusively for symmetry with bytes. Russ Cox even calls for the fact that " no one should use strings . Compare ." It is easier to use the built-in operators <and>.

"no one should use strings. Compare", Russ Cox

Usually you need to compare byte slices or strings when sorting data. The sort.Interface interface needs a comparison function for the Less () method. To translate the ternary form of the Compare () return value into a Boolean value for Less (), it is enough just to check for equality with -1:

 type ByteSlices [][]byte func (p ByteSlices) Less(i, j int) bool { return bytes.Compare(p[i], p[j]) == -1 }

Verification functions

The bytes and strings packages provide several ways to check or find the value in a string or in a slice byte.

Count

If you are validating the input data, it may be necessary to check for the presence (or absence) of certain bytes in them. For this you can use the function Contains () :

 func Contains(b, subslice []byte) bool func Contains(s, substr string) bool

For example, you can check for certain bad words:

 if strings.Contains(input, "darn") { return errors.New("inappropriate input") }

If you need to find the exact number of occurrences of the required substring, you can use Count ():

 func Count(s, sep []byte) int func Count(s, sep string) int

Another use of Count () is to count the number of runes in a line. If you pass an empty slice or an empty string as a sep argument, Count () returns the number of runes + 1. This is different from the output of len (), which returns the number of bytes. This difference is important if you are working with Unicode multibyte characters:

 strings.Count("I ", "") // 6 len("I ") // 9

The first line may seem strange, because in fact there are 5 runes, but do not forget that Count () returns the number of runes plus one.

Indexing

Checking for an entry is an important task, but sometimes you need to find the exact position of the substring or the desired slice. You can do this using indexing functions:

 Index(s, sep []byte) int IndexAny(s []byte, chars string) int IndexByte(s []byte, c byte) int IndexFunc(s []byte, f func(r rune) bool) int IndexRune(s []byte, r rune) int

There are several functions for different cases. Index () is looking for multibyte slices. IndexByte () finds a single byte in a slice. IndexRune () searches for a Unicode code point in a UTF-8 string. IndexAny () works similarly to IndexRune () , but searches for several code points simultaneously. Finally , IndexRune () allows you to use your own function to search for an index.

There is also a similar set of functions for finding the first position from the end:

 LastIndex(s, sep []byte) int LastIndexAny(s []byte, chars string) int LastIndexByte(s []byte, c byte) int LastIndexFunc(s []byte, f func(r rune) bool) int

I usually use little indexing functions, because more often I have to write something more complicated, like parsers.

Prefixes, suffixes and deletes

Prefixes in programming you will meet quite often. For example, paths in HTTP addresses are often grouped by functionality using prefixes. Or, another example - a special character at the beginning of a line, like "@", is used to mention a user.

The HasPrefix () and HasSuffix () functions allow you to check such cases:

 func HasPrefix(s, prefix []byte) bool func HasPrefix(s, prefix string) bool func HasSuffix(s, suffix []byte) bool func HasSuffix(s, suffix string) bool

These functions may seem too simple to bother with them, but I regularly see the following error when developers forget to check for a zero line size:

 if str[0] == '@' { return true }

This code looks simple, but if str is an empty string, you get a panic. The HasPrefix () function contains this check:

 if strings.HasPrefix(str, "@") { return true }

Deletion

The term "trimming" in the bytes and strings packages means removing bytes or runes at the beginning and / or end of a slice or string. The very generalized function for this is Trim () :

 func Trim(s []byte, cutset string) []byte func Trim(s string, cutset string) string

It removes all runes from the cutset on both sides - from the beginning and end of the line. You can also delete only from the beginning, or only from the end, using TrimLeft () and TrimRight () respectively.

But more specific removal options are most often used - removing spaces, for this there is a TrimSpace () function:

 func TrimSpace(s []byte) []byte func TrimSpace(s string) string

You might think that deleting with a cutset equal to "\ n \ r" may be enough, but TrimSpace () can also remove space characters defined in Unicode. This includes not only spaces, line breaks, or tabs, but also non-standard characters such as "thin space" or "hair space" .

TrimSpace () is , in fact, just a wrapper over TrimFunc () , which defines the characters to be used for deletion:

 func TrimSpace(s string) string { return TrimFunc(s, unicode.IsSpace) }

Thus, you can very simply create your own function, which will remove, say, only spaces at the end of the line:

 TrimRightFunc(s, unicode.IsSpace)

Finally, if you want to remove not characters, but a specific substring on the left or on the right, then there are TrimPrefix () and TrimSuffix () functions for this:

 func TrimPrefix(s, prefix []byte) []byte func TrimPrefix(s, prefix string) string func TrimSuffix(s, suffix []byte) []byte func TrimSuffix(s, suffix string) string

They go hand in hand with the HasPrefix () and HasSuffix () functions to check for the presence of a prefix or suffix, respectively. For example, I use them to add bash-like configuration file paths in the home directory:

 // Look up user's home directory. u, err := user.Current() if err != nil { return err } else if u.HomeDir == "" { return errors.New("home directory does not exist") } // Replace tilde prefix with home directory. if strings.HasPrefix(path, "~/") { path = filepath.Join(u.HomeDir, strings.TrimPrefix(path, "~/")) }

Replacement functions

Easy replacement

Sometimes you need to replace a substring or part of a slice. For most simple cases, all you need is the Replace () function:

 func Replace(s, old, new []byte, n int) []byte func Replace(s, old, new string, n int) string

It replaces any occurrence of old in your line with new. If n is -1, all occurrences will be replaced. This feature is very well suited if you want to replace a simple word with a pattern. For example, you can allow the user to use the $ NOW pattern and replace it with the current time:

 now := time.Now().Format(time.Kitchen) println(strings.Replace(data, "$NOW", now, -1)

If you need to replace several different occurrences at once, use strings.Replacer . It takes an old / new value as an input:

 r := strings.NewReplacer("$NOW", now, "$USER", "mary") println(r.Replace("Hello $USER, it is $NOW")) // Output: Hello mary, it is 3:04PM

Register replacement

You may think that working with registers is simple - lower and upper, just something to do - but Go works with Unicode, and Unicode is never simple. There are three types of registers: upper, lower and upper case.

The top and bottom are fairly simple for most languages, and using the ToUpper () and ToLower () functions is enough :

 func ToUpper(s []byte) []byte func ToUpper(s string) string func ToLower(s []byte) []byte func ToLower(s string) string

But, in some languages, the rules of registers differ from the generally accepted ones. For example, in Turkish, i in upper case looks like İ . For such special cases, there are special versions of these functions:

 strings.ToUpperSpecial(unicode.TurkishCase, "i")

Further, we also have a capital register and the ToTitle () function:

 func ToTitle(s []byte) []byte func ToTitle(s string) string

Perhaps you will be very surprised when you see that ToTitle () will convert all your characters to uppercase:

 println(strings.ToTitle("the count of monte cristo")) // Output: THE COUNT OF MONTE CRISTO

This is because in Unicode, capitalization is a special type of register, and not the writing of the first letter in a word in upper case. In most cases, the capital and upper case are one and the same, but there are several code points in which this is not the case. For example, code point ǉ (yes, this is one code point) in upper case looks like Ǉ, and in upper case -.

The function that you need in this case is most likely Title () :

 func Title(s []byte) []byte func Title(s string) string

Her conclusion will be more like the truth:

 println(strings.Title("the count of monte cristo")) // Output: The Count Of Monte Cristo

Mapping Rune

There is another way to replace data in byte slices and strings - the Map () function:

 func Map(mapping func(r rune) rune, s []byte) []byte func Map(mapping func(r rune) rune, s string) string

This feature allows you to specify your own function to test and replace each rune. To be honest, I had no idea about this function until I started writing this post, so I can’t tell any personal usage history here.

Merge and split functions

Quite often, you have to work with strings containing delimiters, by which the string should be broken. For example, paths in UNIX are combined with colons, and CSV format is, in fact, just data, separated by commas.

String splitting

For a simple split of slices or substrings, we have Split () functions:

 func Split(s, sep []byte) [][]byte func SplitAfter(s, sep []byte) [][]byte func SplitAfterN(s, sep []byte, n int) [][]byte func SplitN(s, sep []byte, n int) [][]byte func Split(s, sep string) []string func SplitAfter(s, sep string) []string func SplitAfterN(s, sep string, n int) []string func SplitN(s, sep string, n int) []string

These functions break a string or slice byte according to a delimiter and return them in the form of several slices or substrings. After () - functions include the separator itself in the substrings, and N () - functions limit the number of divisions returned:

 strings.Split("a:b:c", ":") // ["a", "b", "c"] strings.SplitAfter("a:b:c", ":") // ["a:", "b:", "c"] strings.SplitN("a:b:c", ":", 2) // ["a", "b:c"]

Splitting is a very frequent operation, but it usually occurs in the context of working with a file in CSV or UNIX paths. For such cases, I use the encoding / csv and path packages, respectively.

Categorization

Sometimes you will need to specify delimiters as a set of runes, not a series of runes. The best example here would be splitting words into spaces of different lengths. Simply calling Split () with a space as a separator, you get empty substrings at the output, if there are several spaces in a row at the input. Instead, use the Fields () function:

 func Fields(s []byte) [][]byte

It treats consecutive spaces as one delimiter:

 strings.Fields("hello world") // ["hello", "world"] strings.Split("hello world", " ") // ["hello", "", "", "world"]

The Fields () function is a simple wrapper around another function, FieldsFunc (), which allows you to specify an arbitrary function for checking runes on a separator:

 func FieldsFunc(s []byte, f func(rune) bool) [][]byte

String concatenation

Another operation that is often used when working with data is the union of slices and strings. For this there is a function Join () :

 func Join(s [][]byte, sep []byte) []byte func Join(a []string, sep string) string

One of the errors I encountered is that developers try to implement string concatenation manually and write something like:

 var output string for i, s := range a { output += s if i < len(a) - 1 { output += "," } } return output

The problem with this code is that there are a lot of memory allocations in it. Since strings are immutable, each iteration creates a new string. The strings.Join () function uses a slice byte as a buffer and converts it to a string at the very end. This minimizes the number of memory allocations.

Different functions

There are two functions that I could not clearly assign to any category, so they are here below. First, the Repeat () function allows you to create a string of duplicate elements. Honestly, the only time I used it was to create a line dividing the output in the terminal:

 println(strings.Repeat("-", 80))

Another function, Runes (), returns a slice of runes in a string or slice of bytes, interpreted as UTF-8. I have never used this function, since the for loop on a line does exactly the same thing, without unnecessary allocations.

Conclusion

Slices of bytes and strings are fundamental primitives in Go. They are the representation of bytes or runes in memory. Packages bytes and strings provide a large number of auxiliary functions, as well as adapters for io.Reader and io.Writer interfaces.

It’s pretty easy to lose sight of many of these useful features because of the large size of the APIs of these packages, but I hope that this post has helped you to get acquainted with these packages and learn about the features they provide.

Source: https://habr.com/ru/post/307554/

All Articles