
Recently, various NoSQL databases are gaining popularity. This article began as a study of the features of the
graphite graph database Neo4j. But, in the process of selecting information, I wanted to systematize information about NoSQL solutions and graph databases in particular.
In the course of this small study, DBMS that are successfully used in the Web domain were selected for detailed consideration. And since there is “PHP” in the tags, I chose a DBMS that can already be used with this language.
The article turned out to be voluminous, for ease of navigation, I suggest using the table of contents
- NoSQL Types
- Key-value stores
- Bigtable stores
- Graph stores
- Document Stores
- Some conclusions
')
NoSQL Types
All NoSQL DBMS are divided into several categories:
- Key-value stores / Key-value stores
- Column Family (Bigtable) stores / Scalable distributed storage
- Graph Stores / Graph DBMS
- Document Stores / Document Oriented DBMS
The figure below shows schematically the volumes of data used and the complexity of this data in these types of NoSQL.
In each section I tried to locate the DBMS in order of increasing functionality. Perhaps it was somewhat subjective.
There are databases that combine several categories, for example,
OrientDB . According to the official description of the link above, it is both graph and document-oriented. Sometimes it is even referred to as Key-value stores and Column Family stores. More details about it later in the graph DBMS section.
Consider each category below:
Key-value stores / Key-value stores
Key-value stores are the very direction in which NoSQL solutions show their superiority over SQL.
And many consider this direction to be the most popular in the short and long term.
For example, the author of the original version of an open MySQL database, Michael Widenius, thinks so.
Key-value NoSQL is very popular and they are developing quickly and well, apparently because of their large number and strong competition. The largest number of NoSQL databases that were studied in the process of writing the article related precisely to key-value stores.
On Habré there is an article about
key-value repositories for PHP , with which I do not fully agree. The general selection of the repositories represented in it (Voldemort, Scalaris, MemcacheDB, ThruDB, CouchDB) seemed to me not so relevant after almost five years that have passed since the publication of the article. And the CouchDB described there is not a key-value store at all, but a document-oriented DBMS (see the
section about document-oriented DBMS ).
MemcacheDB
Description : the same memcached, only with a BerkeleyDB background.
Performance : the developers presented
test results , according to the results of which the average performance in one thread is 18868 w / s (write operations per second) and 44444 r / s (read operations per second). Tested on the server Dell 2950III, which even in the weakest configuration is a non-sickly
device .
Installation : everything
is collected from source. In PHP, we use the usual Memcached from PECL.
License :
BSD-like License - free for commercial and non-commercial projects.
Redis
Description : On Habré there is an
introductory article with a
blackjack benchmark and links. There are transactions (
about them ) and replication. On the approach is version 3.0, in which Redis-Cluster will appear and significantly increase its speed. There is a nice interactive
tutor .
Productivity : ~ 110.000 w / s, ~ 81.000 r / s on the middle gland.
Installation : Redis and the client for PHP are recommended to be collected from source codes. There are quite a few clients (
list ), I would recommend
phpredis from myself for a good description and support of all (or almost all) existing Redis functionality.
License :
BSD license - everything is free, but if something breaks, then no complaints about the developers.
Tarantool
Description : In-memory repository. Opposed to Redis, which differs, according to developers, increased speed, due to the fact that all data are in memory. There is a built-in queue mechanism. There are good
habrostaty , describing the main features.
Installation : on Ubuntu is
installed using apt-get and droplets of magic (the
official page ), the client for PHP is assembled from sources (
github )
Performance : at the level with Redis, the test results are contradictory:
Tarantool is faster than Redis with its developer ,
Tarantool at the level with Redis for an ordinary personLicense :
Simplified BSD - all for free.
Riak
Description : A database with a strong focus on fault tolerance and distribution. This emphasis is so strong that the development company recommends allocating at least five servers to Riak in order to be able to evaluate its capabilities. At first glance, this is a key-value repository, but there is a search in all fields, secondary keys, MapReduce. No transactions. Detailed and thorough
habrostatya .
Installation :
many ways up to installation from packages for Debian / Ubuntu. For PHP, there is a PECL package, as well as the official
PHP-client .
Performance : it is not given the most important place, but there are
references to 2,500 operations per second.
License :
Apache 2 License is free for ordinary people, but for commercial use, prices for one copy of Riak Enterprise start at $ 2,800 / year.
Aerospike
Description : Scalable storage for huge amounts of data with minimal latency. Transactions by default, ACID support is allocated a separate
page . In version 3, secondary indexes appeared. The amount of proprietary scaling, replication, and clustering technologies is impressive (
link ). For myself, this system is remembered as a powerful industrial Memcached.
Installation : Aerospike is installed from the distribution, the
official client for PHP exists only for Aerospike2, it is built from source.
Performance : declared speed from 180,000 to 400,000 operations per second with a delay in microseconds (
source ).
License :
- Community Edition is a free version with limitations: a maximum of two servers with 200GB of data on each;
- Enterprise Edition - 30 days trial, no restrictions. According to rumors , the cost ranges from $ 50,000 per data center.
FoundationDB
Description : It is positioned as a comprehensive and simplest solution to install and configure. Easy scalability, easy management are the keywords that catch on. Users are offered "uncompromised ACID transactions." Ability to use different data models - key / value, document, and even SQL. This DBMS seemed to me especially interesting when I read about its performance.
Productivity :
3,750,000 r / s * .
* Reading random records from RAM (cache). There are many interesting tests on the official website in the
performance section, the “slowest” of which shows the result of ~
235,000 operations per second (50/50 read and write operations). Delay reading less than 2ms, commit delay less than 15ms. The results were obtained on a cluster of 24 machines, each with 16Gb RAM, 2x200Gb SSD, the test database consisted of 2 million key-value records, all operations were transactional with the maximum isolation level and triple replication.
Installation : and everything is simple: DEB-package for Ubuntu, PEAR-package for PHP.
License :
- Community License is free to use. There are no restrictions on development and testing, but a maximum of 6 running processes in production, i.e. one process on six servers, two for three, etc .;
- Enterprise License - no limit, from $ 99 to $ 199, depending on the quality of support.
Some interesting projects were not included in this list due to the lack of PHP support. The projects
Voldemort ,
Scalaris ,
ThruDB were also not included. Due to poor performance, or poor documentation, and due to the fact that since 2009 nothing has changed for the better.
Column Family (Bigtable) stores / Scalable distributed storage
The stores presented in this section are mainly designed based on the design of the original Google Bigtable.
The main feature of these NoSQL is working with data, whose volumes are measured in terabytes.
Here, the instant access speed is not so important, where a greater emphasis is placed on distribution, fault tolerance and the ability to process huge amounts of information.
Hbase
Description : Open Source development based on the original Google Apache Bigtable design. Developed through the Hadoop project. Used by Facebook itself as the basis of the messaging service. For HBase, the selection is made on a single indexed field. There is partial support for ACID, it turns out that the
transaction seems to be there , but it is not supported in the most obvious way.
Installation : installed using a magic pill named Thrift, the installation and use process is well described in
this habrostate .
Performance :
field tests with an unusual method of measuring performance: on a cluster of 7 servers (16Gb RAM, 8x core CPU, HDD) operations were performed in a table with 3 billion records. 300 read / write processes were launched simultaneously, the time spent on the operation was measured. As a result, the average write time was
10ms , reading -
18ms .
License :
Apache License 2.0 - use for any purpose for free.
Hypertable
Description : An interesting development, similar to HBase. It has a little more performance and much more familiar with the syntax of HQL queries. Request example:
select * from QueryLogByUserID where row =^ '003269359' AND "2008-11-13 05:00:00" <= TIMESTAMP < "2008-11-13 06:00:00"
There are no transactions, which is clearly stated in the first lines of the documentation on the official website.
Installation : connect with PHP using Thrift and the official ThriftClient (
github ).
Performance :
several graphs on the official site. As mentioned above, performance is similar to HBase.
License :
GNU General Public License Version 3. - use for any purpose for free. 24/7 support is available at an additional cost.
Cassandra
Description : Distributed storage, originally developed on Facebook, subsequently transferred to Apache. Unlike the above, Cassandra is a distributed decentralized hash table (DHT) and is based on Amazon's Dynamo. It has a CQL query language, very similar to SQL with some limitations. You can build queries with a selection of several columns, add secondary indexes. In version 2.0, there are "transactions" that operate on the principle of "compare-and-swap".
The syntax of the transaction request will be noticeably like this:
- Adding record
INSERT INTO users (login, email, name, login_count) values ('jbellis', 'jbellis@datastax.com', 'Jonathan Ellis', 1) IF NOT EXISTS
- Record Update
UPDATE users SET reset_token = null, password = 'newpassword' WHERE login = 'jbellis' IF reset_token = 'some-generated-reset-token'
Installation : There are several ways to establish interaction between PHP and Cassandra (same Trift, Cassandra-PHP-Client-Library,
cassandra-pdo ). The last option seemed to me the most pleasant.
Performance : good
comparative tests with graphs, the results of which, on 8 servers with a ratio of 50/50 read / write operations, Cassandra performs about 9,000 operations per second. HBase makes about 2,500 under the same conditions.
License :
Apache License 2.0 - use for any purpose for free.
There are other BigTable solutions, for example,
Stratosphere ,
HPCC ,
Cloudera ,
Cloudata . They are not reviewed in detail for various reasons, for example: lack of PHP support, low prevalence, poor documentation.
Graph Stores / Graph DBMS
It was for them that this article was started. Recently, I discovered NoSQL graph as a new version of the data storage structure and was pleased a lot, because in a number of projects the basic graph DBMS functionality had to be implemented using the not very simple MySQL queries.
In a graphical DBMS, the structure of the stored data may look like this:
If you add all the films to the graphical DBMS and associate with each of the actors acting in it, you can easily find
, , - "", " "
Neo4j
Description : the most successful and sought-after development in the field of graphical DBMS. It fully supports ACID. Just installed and effortlessly scaled. She has already developed a developed community, you can quickly find answers to most of the emerging issues. You can read about its capabilities in conjunction with PHP in this
article .
Installation :
installed from your repository,
Neo4jPHP client is used for PHP
Performance : in view of the specific nature, it seemed strange to me to give specific read / write speeds. It allows you to select hard data and makes it many times faster than relational DBMS.
License :
- Community Edition - GPL-licensed open source, free use
- Commercial Subscription - we have high-performance cache, enhanced horizontal scaling capabilities, support, and some more buns. The cost varies from $ 0 (if you are a startup of three people with an annual project turnover of less than $ 100,000) to infinity (for very large companies)
In this section, I described only one DBMS, and its most interesting competitor, OrientDB, is below. As it turned out, there are so many graph databases for the Web and for PHP in particular.
There is also
Titan , which uses HBase, BerkleyDB or Cassandra as the back-end. There is not a lot of information on this miracle; there are even fewer ways to make friends with PHP.
It is worth remembering about
FlockDB from Twitter, which can be connected to php using
a Thirt client. But, again, due to the small amount of information about this DBMS, it is difficult to form a complete and objective opinion about it.
Document Stores / Document Storage
In this section, we consider document-oriented storages - DBMS for hierarchical data structures. These storages are universal: they have high read / write speeds, have a flexible approach to the formats of stored data, easily work with unstructured data and provide ample opportunities for scaling.
MongoDB
Description : Perhaps the most popular document-oriented NoSQL DBMS. Data is stored in JSON / BSON format. Good scaling, replication, indexes, Map-Reduce. Transactions are represented as compare-and-swap.
Installation : MongoDB from repository, php-client from PECL.
Performance : a little higher were the
comparative tests , in which there were results on MongoDB.
License :
GNU AGPL - open source, free use.
Couchdb
Description : Apache development. In many ways similar to MongoDB. It is distinguished by the absence of blocking during reading operations, and by the more complicated sharding technology.
Installation : CouchDB from the repository, for php client there are several options (PHPillow, PHP Object Freezer, PHP-on-Couch, extension from PECL).
Performance : according to the results of
one test , it is noticeably slower than MongoDB
License :
Apache 2.0 - use for free.
There are many more developments in this area, but they seemed to me very monotonous. Although, perhaps, I just did not study them deeply enough.
OrientDB
Description : document-oriented and, at the same time, graphical DBMS.
Its closest competitor as document-oriented is MongoDB. A
separate page is devoted to this comparison.
The main advantages of OrientDB:
- full ACID support
- the ability to use foreign keys in documents (as well as in relational DBMS)
- Three types of indexes used (SB-Tree, Hash, MVRB-Tree) vs. B-Tree in MongoDB
- high performance (OrientDB performs 150.000 w / s on regular hardware)
- simple query language similar to SQL
Separately, I want to note the query language, compare what identical update-queries look like:
- MongoDB
db.product.update( { “stock.qty”: { $gt: 2 } }, { $set: { price: 9.99 } } )
- OrientDB
UPDATE product SET price = 9.99 WHERE stock.qty > 2
Its main competitor as a graph is Neo4j. And I must say that mastering graph capabilities in OrientDB is much more complicated than in Neo4j. The first ideas about this can be obtained in
this article .
Installation : with installation, you need to do some work, here is a completely
working manual , and
this library is recommended as a PHP client.
Performance : promise
150.000 w / s , there is also a
comparison of graph DBMSLicense :
- Community Edition - Apache 2 license open source, free to use for any purpose, including commercial
- Enterprise Edition - extended support and such buns as Query Profiler, Metrics recording, Live Monitor with configurable alerts for ÂŁ 1,000 for the first server and ÂŁ 500 for each subsequent server. For startups twice cheaper.
Some conclusions
In the course of writing the article, I found a lot of interesting useful and useful information, and I am glad to share it with habrovchanami.
I really liked such solutions as FoundationDB, Neo4j, OrientDB. I would like to devote each of them a separate article.
In conclusion, I would like to share a fun picture that helps you quickly choose a NoSQL solution for your project. I saw the picture in
4dmonster's comments, for which he thanks.