Rust: “Unsafe Abstractions”

The unsafe keyword is an integral part of the design of the Rust language. For those who are not familiar with it: unsafe is a keyword that, in simple terms, is a way to bypass Rust's type checking .

The existence of the unsafe keyword is a surprise to many at first.
In fact, except that the programs do not "fall" from errors when working with memory,
Isn't it a feature of Rust? If so, then why is there an easy way to get around
type system? This may seem like a language defect.

But not everything is so simple, the details - under the cut.

This note represents the unsafe keyword and the idea of limited "insecurity."
In fact, this is a precursor of a note that I hope to write a little later.
She discusses the Rust memory model, which indicates what can and cannot be done in unsafe code.

unsafe code adds 3 features:

Reading and writing a static mutable variable
In C, this variable is denoted extern .
Since a variable can be simultaneously accessed from multiple threads,
then a race condition occurs when the variable is not synchronized.
Rust by default prevents this, and unsafe code is used to get around this restriction.

 static mut N: i32 = 1; fn add_one(n: i32) -> i32 { n + 1 } fn main() { unsafe { N = add_one(N); //  } // -   unsafe { println!("{}", N); //  } }

Raw pointer dereference
The compiler does not know in advance where the pointer is pointing.
Responsibility assumes the programmer who checks
that the value of the pointer points to a memory whose access is allowed.

 fn add_one_ptr(n: *mut i32) { unsafe { *n = *n + 1; } } fn main() { let mut n = 5; add_one_ptr(&mut n as *mut i32); //  // -   // safe ,  n -  static mutable //      println!("{}", n); //  }

This code will call segmentation fault:

 unsafe { let ptr = 0 as *mut i32; *ptr = 1; }

Call unsafe code
Any unsafe code must be indicated by an unsafe block.
In the case of a function whose signature contains the unsafe specifier, its entire code is considered
not safe, so you need to wrap the call to this function in an unsafe block.

Like this:

 unsafe fn do_dangerous_thing() { println!("{}", "in `unsafe` code"); } fn main() { unsafe { do_dangerous_thing(); } }

Yet, in my opinion, unsafe not a disadvantage. In fact he is
important part of the language. unsafe plays the role of some kind of output valve - this means that we can use the type system in simple cases, but allowing us to use all sorts of tricks that you want to use in your code. We only require that you hide these your tricks ( unsafe code) behind safe external abstractions.

"Unsafe" code as a plugin

I think that how interpreted languages like Ruby (or Python) use C code is a good comparison to unsafe work in Rust. Take, say, a JSON module in Ruby. It includes both a Ruby implementation (JSON :: Pure) and an alternative C implementation (JSON :: Ext). Usually when you use the JSON module, you run C code, but Ruby code
does not interact with it as it does with regular Ruby code. Externally, this code looks like this
same as any other Ruby module, but inside it can use various clever tricks and perform optimizations that cannot be written only in the code on Ruby itself. (You can read this excellent article on Helix to learn more, also there you can learn how to write Ruby plugins on Rust).

Well, the same can happen in Rust, but on a slightly different scale. For example, you can write a productive implementation of a hash table on a clean Rust. Adding unsafe code will make this code even faster. If this data structure will be used by many people or its work is very important for your program,
then it may be worth it (Therefore, we use unsafe code in the implementation of the standard library). However, in any case, the calling code on Rust refers to unsafe code in the same way as unsafe : the superimposed levels of abstraction provide a uniform
external API.

Of course, the fact that using unsafe code allows you to make a program faster does not mean that you should use it very often. Just like most Ruby code written in Ruby, most Rust code is written in safe Rust. This is also true because safe Rust code is very efficient, so the benefits of switching to using unsafe code to achieve high performance are rarely worth the effort.

It seems that the most frequent use of unsafe code in Rust is the use of libraries in other languages through the FFI ( Foreign Function Interface ). Each C function call from Rust is unsafe , because the compiler cannot judge the "security" of the C code.

Expansion of the language through `unsafe` code.

I think the most interesting thing is to write unsafe code in Rust (or C module in Ruby) in order
to empower the language. Probably the most frequently cited example is the type Vec in the standard library, which uses unsafe code to manipulate uninitialized memory. Rc and Arc , which are reference counters,
are also a case in point. However, there are much more interesting examples, such as: CrossBeam and deque use unsafe code to implement non-blocking ( lock-free ) data structures, or Jobsteal and Rayon use unsafe code to implement a thread pool (thread pool).

In this article we will look at one simple example: the split_at_mut method, which is available in the standard library. This method works with mutable slices . It also takes an index ( mid ) and divides the slice into two parts at the specified index. Subsequently, it returns two smaller slice: one with a range of 0..mid , the second - in the mid..

For convenience, you can imagine split_at_mut implemented as:

 impl [T] { pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) { (&mut self[0..mid], &mut self[mid..]) } }

This code will not be compiled for two reasons:

In the general case, the compiler does not consider the index too "intently", apart from the enclosing array. This means that when he sees an index of the form foo[i] , he ignores the index and treats the array as if it were a single whole ( foo[_] ). This means that it cannot reveal that &mut self[0..mid] is a call to a different memory location than &mut self[mid..] . This is due to the fact that conducting a similar analysis would require a much more complex type system.
In fact, the operator [] not part of the language - it is fully implemented in the standard library. Therefore, even if the compiler knew that 0..mid and mid.. do not overlap, it would not follow from this that he 0..mid that these ranges apply to non-overlapping memory areas.

One can imagine that it is possible, by changing the compiler, to ensure that the specified code sample will be compiled, and perhaps we will implement it once. But at the moment we prefer to implement methods like split_at_mut using unsafe code. This allows us to have a simple type system, having the ability to write an API like split_at_mut .

Boundaries of abstraction

A look at unsafe code as a plug code allows you to clearly express the idea of "boundaries of abstraction." When you write a plugin in Rust, you expect that when the calling code in Ruby calls your functions, it will provide you with Ruby-related variables.
Inside, you can do what you want, for example, use a C array instead of a vector in Ruby. But when you go back to running Ruby code, you must convert your returned entities to standard Ruby variables.

The same is true for unsafe code on Rust. Client code seems that your code is safe . This means that it can be assumed that the calling code will pass valid values to the input. It also means that all your values that you return must comply with the requirements of the Rust type system. Being inside unsafe borders, you can bypass the rules at your own discretion (of course, the amount of additional features provided is a topic for discussion; I hope to discuss this in a later note).

Let's look at the split_at_mut method that we saw in the last section. To simplify the understanding, we will consider only the external interface of the function, represented by the signature:

 impl [T] { pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) { //   ,       //   .        //  ,   . } }

What can we understand from this signature?
To begin with, split_at_mut relies on the fact that all its input data is valid (in safe code, the compiler checks that this is indeed the case). unsafe semantics of the split_at_mut method can be expressed in the following rules:

self argument is of the type mut [T] . From this it follows that we will get a link indicating some (N) number of elements of type T. This is a mutable link, so we know that no one else can access the memory addressed by self (while the mutable link is not will cease to exist). We also know that memory is initialized.
mid argument is of type usize . All we know is that this variable is a non-negative integer.

There is another unmentioned moment. Nowhere is it guaranteed that the mid index is a valid index for accessing self . It follows from this that the unsafe code we are going to write will have to verify this.

When split_at_mut completes, it should make the return value
matched the signature. Simply put, this means that the function should return
two allowable (pointing to allocated memory) sub-array ( slice ). It is also important that these sub-arrays do not overlap, that is, they are two non-overlapping sections of memory.

Possible implementations

Let's look at several possible implementations of split_at_mut and determine if they are working variants or not. We have already seen that the implementation written in "pure" Rust does not work (does not compile). Let's try to implement a function using raw pointers:

 impl [T] { pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) { use std::slice::from_raw_parts_mut; // `unsafe`       ** . //  `unsafe` ,  ,     //    UB(undefined behaviour). unsafe { //  **     let p: *mut T = &mut self[0]; //    `mid`  let q: *mut T = p.offset(mid as isize); //    `mid` let remainder = self.len() - mid; // ""      `0..mid` let left: &mut [T] = from_raw_parts_mut(p, mid); // ""      `mid..` let right: &mut [T] = from_raw_parts_mut(q, remainder); (left, right) } } }

This version is closest to the one that is implemented in the standard library.
However, this code is based on an assumption that is not justified by the input values: the code assumes that mid is within the bounds of the array. Nowhere is it verified that mid <= len . This means that q can be outside the bounds of the array, it also means that calculating the remainder can cause type overflow and wrap around
This is an incorrect implementation , because it requires more guarantees than is required
from the calling code.

We can fix this implementation by adding assert to the fact that mid is
a valid index (note that assert in Rust is always executed, even in optimized code):

 impl [T] { pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) { use std::slice::from_raw_parts_mut; // ,  `mid`    : assert!(mid <= self.len()); //   ,    unsafe { let p: *mut T = &mut self[0]; let q: *mut T = p.offset(mid as isize); let remainder = self.len() - mid; let left: &mut [T] = from_raw_parts_mut(p, mid); let right: &mut [T] = from_raw_parts_mut(q, remainder); (left, right) } } }

Well, here we practically repeated the implementation of this function in the standard library (here we used several other auxiliary
tools, but, in essence, the idea is the same).

Expanding the boundaries of abstraction

Of course, it could happen that we actually wanted to assume that mid was within acceptable limits, and wanted to do without this check. We cannot do this because split_at_mut is part of the standard library. However, you can imagine a helper method for the calling code to validate this assumption, so that we would do without a costly check to find the index within the array at run time. In this case, split_at_mut relies on the calling auxiliary code in order to ensure that the
mid in array boundaries. This means that split_at_mut no longer a safe code, because it has additional input requirements to ensure safe memory handling.

Rust allows you to express that the entire function code is unsafe by placing the unsafe keyword in the function signature. After such a move, the "insecurity" of the code is no longer an internal part of the implementation of the function; now it is part of the function interface . So we can make the split_at_mut option - split_at_mut_unchecked - which does not check split_at_mut_unchecked mid within acceptable limits:

  impl [T] { //      `unsafe`.   //   `unsafe`    , //      : `mid <= self.len()`. unsafe pub fn split_at_mut_unchecked(&mut self, mid: usize) -> (&mut [T], &mut [T]) { use std::slice::from_raw_parts_mut; let p: *mut T = &mut self[0]; let q: *mut T = p.offset(mid as isize); let remainder = self.len() - mid; let left: &mut [T] = from_raw_parts_mut(p, mid); let right: &mut [T] = from_raw_parts_mut(q, remainder); (left, right) } }

When fn declared unsafe just as it was done above, its call also becomes unsafe . This means that the person who writes the calling code must review the documentation of the function and make sure that all conditions are met.
And in this particular case, the calling code must make sure that mid <= self.len() .

If you think about the boundaries of abstraction, an unsafe declaration means that it is not part of the “safe” Rust area, where the compiler itself detects errors by performing static analysis at the compilation stage. On the contrary, this means that a new abstraction appears, which becomes part of the unsafe abstraction of the calling code.

Using split_at_mut_unchecked , we can change the implementation of split_at_mut so that it, inside of itself, carrying out the necessary checks, split_at_mut_unchecked :

 impl [T] { pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T]) { assert!(mid <= self.len()); //  `unsafe`-  ,  ,    //   ,   `split_at_mut_unchecked`, // ,        . unsafe { self.split_at_mut_unchecked(mid) } } // **NB:** ,  `mid <= self.len()`. pub unsafe fn split_at_mut_unchecked(&mut self, mid: usize) -> (&mut [T], &mut [T]) { ... //   . } }

Unsafe abstractions and privacy.

Despite the fact that there is nothing in the language that would explicitly link the rules of privacy and the boundaries of unsafe abstractions, yet they are naturally related to each other. This is because privacy allows you to control a piece of code that can change
field in your data, and this is the main building block used to build unsafe abstractions.

Earlier, we noticed that the Vec type in the standard library is implemented using unsafe code. It would not be possible without privacy. If you look at the definition of Vec , you will see that it looks like this:

 pub struct Vec<T> { pointer: *mut T, //       capacity: usize, //    length: usize, //    }

The Vec implementation code carefully maintains the invariant that the pointer and the first length elements it refers to are always valid. One would think that if length were an open ( pub ) field, then the upper invariant would not be possible: any calling external code could change the length of Vec to an arbitrary one.

For this reason, the boundaries of "insecurity" tend to fall into one of two categories:

single functions like split_at_mut
the type that is contained in its own module, for example, Vec
- This type, as correctly, has private auxiliary functions.
- may also contain helper functions that are unsafe

Types with `unsafe` interfaces

As we saw earlier, it can sometimes be useful to create unsafe functions like split_at_mut_unchecked , which can serve as a building block for safe abstractions. This is also true for types. Looking at the Vec implementation from the standard library, you will see that it looks like the code above.

 pub struct Vec<T> { buf: RawVec<T>, len: usize, }

What is this type, RawVec ? It turns out that this is unsafe type which contains a pointer ( pointer ) and a capacity ( capacity ):

 pub struct RawVec<T> { // `Unique`     `unsafe` , //   **    (uniquely owned). ptr: Unique<T>, cap: usize, }

What makes RawVec an auxiliary unsafe type? Unlike functions, the concept of " unsafe type" is rather vague. I define this type as a type that does not allow you to do anything useful without using unsafe code. Safe ( safe ) code allows you to construct RawVec , it even allows you to change the size of the buffer that underlies Vec , but if you want to access the value that is in this buffer, you can only do this using ptr that returns *mut T This is a raw pointer, so dereferencing is unsafe action. This means that in order to provide useful functionality, RawVec must be included in another unsafe abstraction (similar to Vec , which tracks initialization.

Conclusion

unsafe abstractions are quite powerful tools. , , , . "" , Vec Rc . unsafe API, .

?

, , , , unsafe . , unsafe , , ? , . , . RFC, , , , , .

RFC , , . , , , . , unsafe , ,
, .

. unsafe , . aliasing (statements reordering).

, unsafe
. , , safe- , , unsafe .

Many thanks to everyone from the Rustycrate community who participated in the translation, proofreading and editing of this article. : born2lose, ozkriff, vitvakatu.

UPD : 3 unsafe .

Source: https://habr.com/ru/post/346336/

All Articles