Rust: “Vectors are values”

Recently, I have been thinking about persistent collections for a long time, and especially about how they relate to Rust. I want to share with you my observations.

How persistent vectors are arranged, whether they are faster than traditional collections - look under the cut.

What is a persistent collection?

Usually, persistent collections are considered an unusual way to work with collections. Instead of adding elements via push , which increases the vector without creating a new instance,

 vec.push(element);

you call the add method, which leaves the source vector in its place and returns the new modified vector :

 let vec2 = vec.add(element);

It is important to note that vec does not change. This feature fits well with functional languages, it is also well used in writing multi-threaded programs.

How do persistent collections work?

I will not go into the details of any particular solution, but most of them are implemented on any kind of trees. For example, having a vector of the form [1, 2, 3, 4, 5, 6] , you can save elements not in one big block, but in the form of a tree, the sheets of which are values. In the following diagram, the set of values is divided into two child nodes pointed to by the parent node:

  [* *] // <--      | | ----- ----- 1 2 3 4 5 6

Now imagine that we want to change one of the values in the vector. For example, we want to swap 6 for 10 . This means that we must change the right node, while leaving the left node untouched. After that we will have to re-create the parent node, which will refer to the new right node.

  [* *] // <--  | | //    ,   ----- ----- 1 2 3 4 5 6 ----- | 4 5 10 // <--     | ------ | | [* *] // <--

In a balanced tree, the operation to add to a persistent vector tends to O (log n) - we have to clone some sheet and change it, then we have to copy and change all the parent nodes on the way to the root. This is much more resource-intensive than changing a regular vector, which requires only a few processor instructions .

A few notes:

if there is only one reference to the vector, you can often avoid these cloning and just change the tree in place. Later I will talk about the experimental library of persistent collections of DVec using this trick. This is difficult to implement in a garbage collection language, because you never know how many links there are to your collection. - there are many other persistent collective implementation options that aim to optimize for specific use cases. For example, here is a solution suitable for Prolog-like applications. This design uses “under the hood” variability so that you can do an insert for O (1), but it hides it from the user. Of course, you need to pay for these quick inserts: using old copies of the data structure costs more.

Persistent collections turn collections into values

In some cases, persistent collections make it possible to write easier-to-understand code. This is because they act as "ordinary meanings" and are not unique. Look at this JS code working with numbers:

 function foo() { let x = 0; let y = x; y += 1; return y - x; }

If we change y , we expect x not change. This is because x is a simple value. However, if we use an array:

 function foo() { let x = []; let y = x; y.push(22); use(x, y); }

Now, when I change y , the variable x also changes. It is possible that I need it, and perhaps not. Things become more confusing when vectors are inside objects:

 function foo() { let object = { field: [] }; ... let object2 = { field: object.field }; ... //  `object.field`  `object2.field` // ,      . ... }

Do not get me wrong, it is often convenient when object.field and object2.field are the same vector, and changes in one will be reflected on the other. However, sometimes this is not what you need. I often noticed that using persistent collections makes my code cleaner and easier to understand.

Rust is not like that

If you saw one of my speeches on Rust , then you know that they focused on the following Rust design feature:

Sharing and variability: good in their own right, disgusting together.

Simply put, when you have two pointers to the same memory location (like object.field and object2.field in the last example), then changing the data through one of the pointers is fraught with the danger of a race. This is especially pronounced in Rust, when you want to do without a garbage collector, because with a garbage collector it is not known how many pointers refer to a section of memory. This is true even with GC , for actions like object.field.push(...) can affect more objects than you think, causing bugs (in particular, but not exclusively, when several streams are working simultaneously).

What happens in Rust if you want to access one vector through two separate pointers? Let's go back to the JavaScript examples, but now let's project them onto Rust. The first example works the same as on JS:

 let x = 0; let mut y = x; y += 1; return y - x;

However, the second example with vectors will not compile:

 let x = vec![]; let mut y = x; y.push(...); use(x, y); // :  ''

The problem is that as soon as we make y = x , ownership of the value of x will be transferred to another variable, so it ( x ) can no longer be used.

Rust - ordinary vectors are values

This leads us to the following conclusion: the usual Rust collections that we use every day behave like values . Even more — any type in Rust that does not use Cell or RefCell . If your code is compiled, you know that your vector is not accessible to change through different pointers - you can replace it with a number and it will behave the same way.

From the foregoing, it follows that the persistent collections in Rust do not have to have a different interface as the usual ones . For example, I wrote a persistent vector implementation - the dogged library. The library provides a DVec type that is based on persistent vectors provided by Clojure . If you look at the methods that are available from DVec , you will see that they are the most common ( push , etc.).

Here is one example of using DVec :

 let mut x = DVec::new(); x.push(something); x.push(something_else); for element in &x { ... }

However, DVec is a persistent data structure that is implemented as a prefix tree . DVec contains an Arc (thread-safe reference counter) within itself that references internal data. When you call push , we update Arc , so now it will refer to the new vector, leaving the old one in its place.

( Arc::make_mut is a very cool method, it checks the value of the counter - if it is 1, it gives you exclusive access to the content that can be changed. If the value of the counter is not 1, then the method will slope Arc (and its contents) in place and gives you a modifiable link to this clone. If you remember how persistent data structures work, then this situation is very useful for updating the tree during a traversal. This allows you to avoid cloning in cases where only one link refers to the data structure.)

However, persistent collections are different.

The main difference between Vec and DVec is not in the supported operations, but in how expensive these operations are . When you call push on Vec , it is O (1). When you clone, it is O (n). In DVec these complexity estimates are rearranged: push requires O (log n), and cloning requires O (1).

clone on DVec increases the value of the reference count of the internal Arc , while the same operation on a normal vector clones all the data . Of course, when you call push on DVec , it will clone part of the data, rebuilding the affected parts of the tree (while Vec usually just writes to the end of the array).

However, this big O - notation, as everyone knows, speaks only of asymptotic behavior. One of the problems I encountered when working with DVec was that it is quite difficult to compete with the standard Vec in speed. Often, copying a data set is faster than updating trees and allocating memory. I realized that you have to make a lot of effort in order to justify the use of DVec - for example, you clone a vector many times, and they contain a lot of data.

Of course, performance is not everything. If you clone a vector many times, DVec should use less memory, because it can use the common parts of the data structure.

Conclusion

I tried to show how the system of ownership in Rust offers a fusion of functional and imperative styles through the use of persistent collections. This means that although the collections presented in the standard Rust library are implemented in an imperative style, they behave like normal values : when you assign another vector to one vector, if you want to keep the original vector, we have to clone it, which makes the independent vector from the old.

My observations are not new. In 1990, Phil Wadler wrote the article “Linear types can change the world,” in which he makes the same statements, albeit from a slightly different side. He says that you can provide a persistent interface (for example, the vec.add(element) method, which returns a new vector). However, if you use linear types, you can implement this as an imperative data structure (by providing vec.push(element) ), and no one will know about it.

Playing with DVec , I realized that it is very convenient to have a persistent vector that offers the same interface as the usual one. For example, it was very easy to change the DVec based ena library to work in persistent mode (using DVec ) or imperative mode (using Vec ). Simply put, the idea is to hide the type used under a uniform interface.

(Disregarding the main topic, I would say that I would like to see some experimental studies: let's say it would be convenient to have a vector that would be converted to persistent after reaching a certain length).

I think that there is one more reason for someone purposefully engaged in persistent collections. While simultaneous access and volatility can be dangerous, it is sometimes very necessary, although in Rust it is now inconvenient. I believe that we need to correct the current state of affairs, as I have some thoughts on this (I want to postpone until the next article). Rust already has persistent collections, the cloning of which, however, is inefficient.

Source: https://habr.com/ru/post/348866/

All Articles