
Few articles on Habré are devoted to a new movement in the IT industry -
NoSQL . I decided to change this and wrote an article-translation-review about one of the reports from
the NoSQL conference held on October 5 in New York. This article will talk about the Riak system, with which I happened to have the fortune to work lately.
What is Riak? Many fashionable words are popular now, can be attributed to Riak. Riak is a
document-oriented database. Riak is a decentralized
key-value data store, with support for standard operations —
get ,
put, and
delete . Riak is a distributed, scalable, fault-tolerant storage solution. As well, Riak is an open
source system with support for downloads using HTTP, JSON and REST. And of course, RIAK is NoSQL.
')
If you dig deeper, you can see that Riak was strongly influenced by
Amazon Dynamo ,
the CAP theorem (
C onsistency,
A vailability and
P artition Tolerance) by Eric Brewer, the Internet system itself, as well as the experience of
Basho’s development team developing network environments We started developing Riak in the fall of 2007, for use in two Basho apps that were running on Riak and worked on it most of the time.
To understand why Riak is so powerful, you need to tell a little theory. First, let's talk about Amazon Dynamo. The document describing Amazon Dynamo has three terms for describing the behavior of a distributed storage system:
N ,
R, and
W. N is the number of replicas of each value in the repository.
R - the amount of data replicas to perform the read operation.
W - the number of replicas required to perform the write operation. Riak's goal is to transfer
N ,
R , and
W into the application logic. This allows Riak to adapt to the requirements of individual applications.

Riak uses different
N values for each
bucket . For example, all objects in the artist segment will have the same
N value, while objects in the album segment are completely different. The system uses a consistent hash algorithm to select where to save the
N number of replicas of your data. When a request arrives, Riak uses hashing to convert the text key to a
160-bit number. When a cluster node is added to Riak, it receives parts from the
160-bit key space. The node with the closest hash value from the key (
160-bit number) and contains the first replica. The remaining N replicas are stored on nodes with other
N-1 portions of the
160-bit key space. A consistent caching algorithm is
very important — it allows each Riak node to perform any request. Since any node can calculate which other nodes it is necessary to contact in order to fulfill the request, any node can act as an organizer for any client. There is no
managing server , there is no
single point for system failure.

Riak uses different
R values for each request. Each time you make a request for data, you can use a different
R value. The value of
R determines the number of nodes that need to return a successful response before Riak returns a response to the requesting client. Riak tries to read all possible replicas (
N ), but when the
R value is reached, the data will be sent back to the client.

Riak uses different
W values for each request. The value of
W determines the number of nodes that need to return a successful response before Riak returns a response to the requesting client. Riak tries to write all possible replicas of (
N ) data, but when the
W value is reached, the result will be sent back to the client.
Providing the client with the ability to specify
R and
W values at the time of the request means that at the time of the request, the application can specify exactly how many nodes can fail. It's very simple: for each request,
NR (for reading) or
NW (for writing) nodes may not be available, but the cluster and data will
still be quite accessible.

So, in the example that we used with
N = 3 and
R = W = 2 , we can have
3–2 = 1 node inaccessible in the cluster, but the cluster will still provide data. For critical data, we can increase the value of
N to 10, and if we still use the value of
R or
W equal to 2, we can have
8 inaccessible nodes in the cluster, but read and write requests will be successful. Riak makes it possible to change the values of
N / R / W as it is a good way to improve the behavior of the application when using
the CAP theorem .
If you are familiar with Eric Brevera's CAP theorem, you know that there are three aspects that need to be taken into account when speaking about data storage: data integrity, data availability in the storage, and resilience to separation. If you are familiar with the study, you also know that it is impossible to implement a system that meets all three conditions.
Since you cannot implement all three conditions, most storage systems use two. Riak allows you to not only choose
any of them, but also different applications can choose different volumes of each. Most likely you choose availability and resistance to separation. However, you develop applications that work on real machines and you want them to be available at any time for users. The Riak structure is implemented to support this feature (we want our applications to run all the time). This means that you are ready to sacrifice data integrity. There are many tips on how you can guarantee data integrity (such as
read-your-writes ) in a document describing Amazon Dynamo, and I advise you to re-read this document.
Here is a simple example of how data integrity issues can be resolved. Let's look at the cluster that is already running and has a document with version 0.

Suddenly, the network crashes. Nodes 0 and 2 are in
New York , nodes 1 and 3 are in
Los Angeles, and the transcontinental link is broken. How will each part of the cluster behave? If you set the
N / R / W values appropriately, both parts of the cluster will essentially provide version 0 of the document, as before. Customers will not know about the failure. Now suppose that a client has made changes to a document stored in half of a cluster located in
New York (did you specify
N / R / W so that this would be allowed?). This client introduced some inconsistency. Now customers joining a part of a cluster located in
New York will receive version 1 of the document, while customers joining a part located in
Los Angeles will receive version 0 of this document. Now suppose that the transcontinental connection is restored and both halves of the cluster work together. What should Riak do with two different versions of the document?
In this case, Riak uses
the time vector algorithm to determine which version of the document is more correct. The time vector algorithm is a special implementation of
the Lamport timestamp algorithm (Lamport clocks / Lamport timestamps). Unlike conventional timestamps, Lamport's timestamp system is designed so that the origin and continuity can be determined by simple comparison. Each time data is saved in Riak, its time vector is increased, and when the cluster recovers, Riak can determine which data to save. In our case, Riak will determine that version 1 is the receiver of version 0, and version 0 will be replaced by version 1, and the data will again be consistent.

Everything becomes a bit more interesting, if while the parts are not connected, customers will make changes in both parts of the cluster. Now that the cluster is restored, the time vector will show that none of the versions is the successor of the other version. Riak cannot determine which version should be selected, so in this case, as well as with the ability to change the values of N / R / W, Riak transfers the ability to resolve the conflict to the application. Instead of implementing an arbitrary version selection rule like it is done on other systems, Riak returns
both values to the application, giving you the option of choosing the right option. Of course, if you want to use a simple rule - the data that came last and is used, Riak has a simple flag to enable this behavior (allow_mult segment property)

After all this theory, how about a few code examples to demonstrate how easy it is to work with Riak?
Since Riak is written in Erlang, let's start with Erlang.

The first line of code describes how our client joins the Riak cluster. The second line creates a new object (document, key / value pair). The third line saves the object in Riak. The fourth returns the object back from Riak. the last two lines change the value of our object and save it again to the Riak cluster.
If you do not want to use Erlang, Riak also comes with libraries for Python ...

... Riak also has libraries for Ruby ...

... Java ...

... PHP ...

... javascript ...

... but in fact all of these libraries work with Riak using standard RESTful HTTP, and this allows you to use Riak on
any system that supports HTTP — for example, using command line tools such as
curl or
wget .

This is good when you need to send or receive data from Riak, but what to do when you want to make a request for several objects at the same time? This is NoSQL, right? How about a little Map / Reduce?

Map / Reduce Riak has a lot in common with other Map / Reduce systems. The Map function occurs on the node where the data is located, increasing the locality of the data simultaneously with the distribution of calculations in the cluster. The part of the Map / Reduce Riak which is most different from other solutions is noticeable in the fact that Riak does not run the Map method over all data in the segment (bucket). Instead, Riak allows the client to provide a list of object keys on which the Map method should be run. Map methods can provide more keys for the later phases of the Map method, but the list of keys for the Map method should always be defined. Of course, you can specify any number of keys you want, for example, if you
want to execute the Map method for all values in a segment, it is enough to include them all in the Map / Reduce query (the list_keys or list_bucket functions can be useful in such cases).
The differences in the implementation of Map / Reduce Riak when compared with other systems are due to the strong desire to support
links in Riak. The first question in the transition from the world of RDBMS is - “How can I organize connections between my data?” It was decided that the best answer to this question is links.

For example, if you want to organize links between the recording of an artist and several recordings of albums, you want to create links to albums in the recording of an artist. Similarly, you can create links in the recording of the album to the recordings of the compositions included in this album. Once you add links and determine how Riak can get these links from your objects, you will have access to the new Map / Reduce syntax. For example, in this example, you can see a simple syntax that allows us to say - “Start with the REM performer, then follow all the albums with which the performer is associated, then follow all the tracks with which the album is associated, then retrieve the names of these tracks.” the names of all the REM tracks that have
ever been released.
The developers decided that
link-walking was so useful that they even implemented the URL syntax for it. At the top, you can see a link similar to the URL, and performing a GET request at this URL will return you a list of all the composition objects that we previously received in the Map / Reduce example.
There is a lot more to tell about Riak, including how to create plug-in backends, how to use an event system, how to monitor the state of Riak clusters, and how to implement intercluster replication. But it will wait until another presentation.
If you are interested in learning more, you can visit
http://riak.basho.com/ where you can read the documentation and download Riak for your experiments. There is also a
riak-users @ lists.basho.com mailing list. And of course you can watch the
video of this report.
***
I hope you learned a lot of interesting and useful information from this article. In the future, if this topic finds a response on the site, I solemnly declare that I will continue a series of articles on the topic of NoSQL . I pass on respects to the teams of projects Translated.by and Typographer , thanks to them the creation of this article was a pleasant pastime.