Unprecedented amounts of data force developers and businesses to look at alternatives to relational databases that have been used for more than thirty years. Collectively, all these technologies are known as “
NoSQL database ”.
The main problem is that relational databases can not cope with the loads relevant in our time (we are talking about
high-load projects). There are three specific problem areas:
- scaling horizontally with large amounts of data, such as in the case of Digg (3 terabytes for green icons, displayed if your friend made a dugg on an article) or Facebook (50 terabytes for searching for incoming messages) or eBay (2 petabytes in general)
- performance of each individual server
- non-flexible logical structure design.
Many companies need to find new ways to store and scale huge amounts of data. I recently wrote a translation of an article about a non-relational RIAK repository. In this article we will look at the main part of non-relational databases and systems, by which the movement NoSQL is meant.
The term
NoSQL was coined
by Eric Evan / Racker when
Joan Oskarson of Last.fm wanted to organize an event to discuss open source distributed databases.
')
Some people disapprove of the term NoSQL as it sounds like it is based on what we don’t want to do, and not on who we are. NoSQL movement is not movement against relational databases. NoSQL is “
Not only SQL ” (Not Only SQL), and not “
No SQL ” (No SQL at all).
The term NoSQL hides a large number of products with completely different designs and sometimes when discussing a conversation can be about different systems. So I propose to use three axes to compare these systems: scalability, data and query model, data storage system.
I selected 10 NoSQL databases for examples. This is not the entire list, but they are enough to evaluate.
Scalability
By scalability, some may imply replication, so when we talk about scalability in this context - we have in view of
the automatic distribution of data across multiple servers . We call such systems distributed databases. They include
Cassandra, HBase, Riak, Scalaris and Voldemort . This is your only choice if you are using a volume of data that cannot be processed on a single machine or if you do not want to manage the distribution manually.
There are two things you need to look at in a distributed database:
support for multiple data centers and the ability to add new machines to a working cluster is transparent to your applications .
Non-distributed databases include
CouchDB, MongoDB, Neo4j, Redis and Tokyo Cabinet . These systems can serve as a data storage layer for distributed systems; MongoDB provides limited sharding support, as well as a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a file storage system for Voldemort.
Data and Query Model
There is a huge variety of data models and query APIs in NoSQL databases.
(Related links Thrift , Map / Reduce , Thrift , Cursor , Graph , Collection , Nested hashes , get / put , get / put , get / put )The system of the
family of columns (columnfamily) is used in Cassandra and HBase, and its idea was introduced into them from documents describing the Google Bigtable device (Cassandra, though a little bit away from the ideas of Bigtable and introduced supercolumns). In both systems, you have rows and columns, as you used to see, but the number of rows is not large: each row has more or less columns, depending on the need and the columns should not be defined in advance.
The
key / value system itself is simple and not difficult to implement, but not effective if you are only interested in requesting or updating a piece of data. It is also difficult to implement complex structures on top of distributed systems.
Document-oriented databases are essentially the next level of key / value systems, allowing you to associate nested data with each key. Supporting such requests is more efficient than just returning the entire BLOB each time.
Neo4J has a truly unique data model, storing objects and links as nodes and edges of a
graph . For queries that match this model (for example, hierarchical data), they can be a thousand times faster than alternatives.
Scalaris is unique in using distributed transactions between multiple keys. The discussion of trade-offs between consistency and availability is beyond the scope of this post, but this is another aspect that needs to be considered when evaluating in distributed systems.
Storage System
By storage, I mean how data is stored within the system.
The storage system can tell us a lot about what the load base can normally handle.
Databases storing data in memory are very, very fast (Redis can perform up to 100,000 operations per second), but cannot work with data that exceeds the size of available RAM. Durability (saving data in the event of a server crash or power failure) can also be a problem (
in new versions there will be support for append-only log ). The amount of data that can be expected to write to the disk is potentially large. Another system with in-memory data storage - Scalaris, solves the problem of longevity with the help of replication, but it does not support scaling to several data centers, so data loss is likely here too - in case of power failure.
Memtables and SSTables buffer requests for recording in memory (memtable), after recording in a commit log for data integrity (this is difficult to explain, but you can read more in the Cassandra wiki -
http://wiki.apache.org/cassandra/ArchitectureOverview ). After accumulating a sufficient number of records, Memtable is sorted and written to disk, already as SSTable. This gives a performance close to the memory performance, at the same time, the system is devoid of problems relevant when stored only in memory. (This procedure is described in more detail in sections 5.3 and 5.4, as well as
log- based tree merging - The
log-structured merge-tree )
B-trees have been used in databases for a very long time. They provide reliable support for indexing, but performance is very low when used on machines with hard disks on magnetic disks (which
are still the most cost-effective), since a large number of head positioning occurs when writing or reading data.
An interesting option is to use
B-trees in CouchDB, only with the addition function (
append-only B-Trees is a binary tree that does not need to be rebuilt when adding elements), which allows you to get good performance when writing data to a disk.
Conclusion
The NoSQL movement rose sharply in 2009, thanks to the enthusiasm for the number of companies associated with the use of large amounts of data. There are more and more systems allowing to organize and transparently support huge data arrays, process and control this data. I hope thanks to this short article, you will learn about some of the strengths of NoSQL systems and perhaps contribute to the development of this movement.
