Creating a function on Rust that returns a String or & str

From translator

This is the last article from the series on working with strings and memory in Rust by Herman Radtke, which I translate. It seemed to me the most useful, and initially I wanted to start translating from it, but then it seemed to me that the rest of the articles in the series are also needed to create a context and an introduction to simpler, but very important, moments of the language, without which this article loses its utility.

We learned how to create a function that takes a String or & str ( English ) as an argument. Now I want to show you how to create a function that returns a String or &str . I also want to discuss why we may need it.

To begin, let's write a function that removes all spaces from a given string. Our function might look something like this:

 fn remove_spaces(input: &str) -> String { let mut buf = String::with_capacity(input.len()); for c in input.chars() { if c != ' ' { buf.push(c); } } buf }

This function allocates memory for the string buffer, traverses all the characters in the input string, and adds all non-whitespace characters to the buf buffer. Now the question is: what if there is not a single space at the input? Then the input value will be exactly the same as buf . In this case, it would be more efficient not to create buf at all. Instead, we would simply like to return the given input back to the user of the function. The input type is &str , but our function returns a String . We could change the input type to String :
')

 fn remove_spaces(input: String) -> String { ... }

But there are two problems. First, if the input becomes String , the user of the function will have to transfer the ownership of the input to our function, so that he will not be able to work with the same data in the future. We should take possession of input only if we really need it. Secondly, the input may already be &str , and then we force the user to convert the string to a String , nullifying our attempt to avoid allocating memory for buf .

Write cloning

In fact, we want to be able to return our input string ( &str ) if there are no spaces in it, and a new string ( String ) if there are spaces and we need to remove them. This is where the type of copy-on-write ( c lone- o n- w rite) Cow comes to the rescue. The Cow type allows us to abstract away from whether we own the variable ( Owned ) or we just borrow it ( Borrowed ). In our example, &str is a link to an existing string, so this will be borrowed data. If the string has spaces, we need to allocate memory for the new String . The variable buf owns this string. In the usual case, we would move the ownership of buf , returning it to the user. When using Cow we want to move the buf ownership to the Cow type and then return it.

 use std::borrow::Cow; fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> { if input.contains(' ') { let mut buf = String::with_capacity(input.len()); for c in input.chars() { if c != ' ' { buf.push(c); } } return Cow::Owned(buf); } return Cow::Borrowed(input); }

Our function checks whether the original input argument contains at least one space, and only then allocates memory for the new buffer. If there are no spaces in input , then it is simply returned as is. We add a bit of complexity at runtime to optimize memory handling. Please note that our type of Cow the same lifetime as that of &str . As we said earlier, the compiler needs to track the use of the &str reference in order to know when it is safe to free up memory (or call the destructor method if the type implements Drop ).

The beauty of the Cow is that it implements the Deref type, so you can call on the methods that do not change these methods without even knowing if a new buffer is allocated for the result. For example:

 let s = remove_spaces("Herman Radtke"); println!(" : {}", s.len());

If I need to change s , then I can convert it to the owning variable using the into_owned() method. If the Cow contains borrowed data (the Borrowed option is selected), memory allocation will occur. This approach allows us to clone (that is, allocate memory) lazily only when we really need to write (or change) into a variable.

Example with editable Cow::Borrowed :

 let s = remove_spaces("Herman"); // s   Cow::Borrowed let len = s.len(); //         Deref let owned: String = s.into_owned(); //      String

Example with editable Cow::Owned :

 let s = remove_spaces("Herman Radtke"); // s   Cow::Owned let len = s.len(); //         Deref let owned: String = s.into_owned(); //    ,      String

The idea behind Cow as follows:

Postpone memory allocation for as long as possible. At best, we will never allocate a new memory.
To enable the user of our function remove_spaces not to worry about memory allocation. Using Cow will be the same anyway (whether new memory will be allocated or not).

Using Into Type

We used to talk about using the type of Into ( English ) to convert &str to String . Similarly, we can use it to convert a &str or String into the desired Cow variant. Calling .into() will cause the compiler to choose the correct conversion option automatically. Using .into() doesn't slow down our code at all; it's just a way to get rid of the explicit indication of the Cow::Owned or Cow::Borrowed option.

 fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> { if input.contains(' ') { let mut buf = String::with_capacity(input.len()); let v: Vec<char> = input.chars().collect(); for c in v { if c != ' ' { buf.push(c); } } return buf.into(); } return input.into(); }

And finally, we can slightly simplify our example using iterators:

 fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> { if input.contains(' ') { input .chars() .filter(|&x| x != ' ') .collect::<std::string::String>() .into() } else { input.into() } }

Real use of Cow

My example of removing spaces seems a bit far-fetched, but in real code this strategy also finds use. In the Rust core there is a function that converts bytes to a UTF-8 string with the loss of invalid combinations of bytes , and a function that translates the end of the lines from CRLF to LF . For both of these functions, there is a case in which you can return &str in the optimal case, and a less optimal case that requires memory allocation under String . Other examples that come to my mind: encoding a string into valid XML / HTML or correctly escaping special characters in a SQL query. In many cases, the input data is already correctly encoded or shielded, and then it is better to simply return the input string back as is. If the data needs to be changed, then we will have to allocate memory for the string buffer and return it already.

Why use String :: with_capacity ()?

While we are talking about efficient memory management, note that I used String::with_capacity() instead of String::new() when creating the string buffer. You can use String::new() instead of String::with_capacity() , but it is much more efficient to allocate all the required memory for the buffer at once, rather than re-allocating it as we add new characters to the buffer.

String is actually a Vec vector from UTF-8 code points. When calling String::new() Rust creates a zero-length vector. When we put the character a in the string buffer, for example using input.push('a') , Rust should increase the capacity of the vector. To do this, it will allocate 2 bytes of memory. When we further put characters into the buffer, when we exceed the allocated amount of memory, Rust doubles the size of the line, re-allocating the memory. He will continue to increase the capacity of the vector each time it is exceeded. The sequence of allocated capacity is: 0, 2, 4, 8, 16, 32, …, 2^n , where n is the number of times Rust has detected that the allocated memory has been exceeded. Re-allocating memory is very slow (correction: kmc_v3 explained that it may not be as slow as I thought). Rust not only has to ask the kernel to allocate new memory, it also has to copy the contents of the vector from the old memory to the new one. Take a look at the source code of Vec :: push to see for yourself the logic of vector resizing.

Clarification on re-allocating memory from kmc_v3

It may not be so bad because:

Any decent allocator requests memory from the OS in large chunks, and then gives it to users.
Any decent multi-threaded memory allocator also supports caches for each thread, so you don’t need to synchronize access to it all the time.
Very often, you can increase the allocated memory in place, and in such cases there will be no data copying. Maybe you allocated only 100 bytes, but if the next thousand bytes are free, the allocator will simply give them to you.
Even in the case of copying, memcpy byte copying is used, with a completely predictable way of accessing memory. So this is probably the most efficient way to move data from memory to memory. The libc system library typically includes memcpy optimizations for your particular micro-architecture.
You can also “move” large allocated chunks of memory using the MMU reconfiguration, which means you need to copy only one page of data. However, typically changing page tables has a large fixed cost, so the method is only suitable for very large vectors. I'm not sure that jemalloc in Rust makes such optimizations.

Changing the size of std::vector in C ++ can be very slow due to the fact that you need to call the displacement constructors individually for each element, and they can throw an exception.

In general, we want to allocate a new memory only when it is needed, and exactly as much as necessary. For short strings, such as remove_spaces("Herman Radtke") , the overhead of re-allocating memory does not play a big role. But what if I want to remove all spaces in all JavaScript files on my site? Overhead for re-allocating buffer memory will be much more. When placing data into a vector ( String or whatever), it is very useful to specify the size of memory that will be required when creating the vector. At best, you know the desired length in advance, so that the capacity of the vector can be set exactly. Comments to the code Vec warn about the same.

What else to read?

Source: https://habr.com/ru/post/274565/

All Articles