Why am I writing this article? First, I would like to contribute to people's understanding of the essence of nosql and why it is necessary to choose this type of storage consciously. Secondly, I will be happy to meet like-minded people, opponents and, possibly, debate. And if you liked this article, I will be glad to hear questions that can be disclosed in more detail in new articles :)
Despite the fact that nosql solutions are now dark, people are reluctant to switch to new types of repositories. Is it correct? In my opinion - yes. And I will try to say why, using the example of different nosql repositories that met on my professional path.
Beginning of the story
Good day to all. This is my first “attempt at writing” in a large edition, I hope it will turn out interesting :) The very term nosql is written here
in this article , and we turn to life examples and try to draw conclusions.
Consider the most popular DBMS: MySql, PostgreSql, Oracle. There are a lot of differences, but all three of them are well-established relational databases with rich possibilities. They allow you to create document circulation systems, banking applications and business cards for a small cafe. This is a common solution for almost any of your tasks.
')
What problems does a novice developer encounter when they encounter their first SQL database?
- Need to learn SQL syntax
- We need to realize the very essence of the relational model.
- You need to master the client to the database in your favorite development language
And everything, after that the person will not just master one database, he will master the family of databases and will easily transfer, for example, from Mysql to Oracle. (let's forget about PL / SQL and other important differences for a moment). And if you use ORM ... beauty.
This imaginary simplicity can play a cruel joke. For example: when debugging a 5 line query in Oracle, in an attempt to make it more optimal. Here you begin to understand that free cheese is only in a mousetrap.
And yet: this is the convenience of selecting information, with the help of a huge number of means of the query language, is it not happiness?
Frankly, for more than 2 years, I didn’t seriously touch mysql, oracle. And then I will describe what distracted me and lured away ...
Alfresco
And may
it require a SQL solution for work, but still I consider alfresco my first nosql database.
What do you need to study a person who first sits down to develop on the basis of this wonderful platform? Yes, actually, everything :)
It is completely different. The data structures in it are described using xml. Relationships are defined using so-called associations. For example: a post; a list of comments in it is an association. And then there is the inheritance of models. One "table" can be inherited by another.
There is an opinion that the nosql solution is necessarily a fast repository. But alfresco is a very slow repository. Very very. Among the shortcomings, I can also name the query API. You need to access the repository in two ways: associations and objects by id should be received via java api, and more complex queries with selections by attributes and associations via Lucene Query Engine. The requests look scary, but I wrote a simple wrapper over the query engine that allowed you to build queries like this:
Query.field(title).eq("").and(Query.field(text).like("**"));
and life has become more beautiful and more fun. The request was written from memory, but colleagues will find out (hello! :))
And anyway this is a wonderful thing, because it is very convenient to write workflow systems on it, with large and complex business processes, in which documents will travel, “spending the night” for one user or another. Until they finally come to some kind of result. For example to the resolution: done.
Then it was the beginning of version 3, in 2011 it came out 4. A lot of tasty things were added, probably the performance improved, but I was too carried away by the new storages ...
Cassandra
This is my
love , which I do not change until now. Colleagues did not have much excitement about her, but I still think that this is all from the lack of RAM on the servers. Naturally, when it comes to 500 million lines of blobs on the server, you need to use more RAM than 8 GB ... sometimes the node hangs with the ends.
But ... very fast recording, fast reading. Full control over the data, the confidence that the base will not be capped by the speed of writing or reading. I still use it in my own projects and she has not let me down yet. A distinctive feature of this database is that it is difficult to kill. I am never afraid that the server will be cut down and I will have to do restore, as it happens, for example, with MongoDb with default settings.
Requests to the database are made using
thrift api , which is very scary in appearance. It lacks all the necessary amenities such as a pool of connections. We put a set of bytes, we get, in fact, a set of bytes. I also solved this problem, as in the case of Alfresco, only on a larger scale: I had to write an ORM framework, which became a superstructure over the thrift, and at the same time did not impose restrictions on performance. There were open source alternatives to the bike, but they all seemed inconvenient in the context of the tasks to be solved.
Thanks to the team leader and patient colleagues who selflessly started using my product and immediately threw a ton of bug reports :))
And still, cassandra still guzzled memory and hung at its flaw ...
Riak
My acquaintance with
him was short. I read on Habré - cool. I read on the site - cool. Installed, began to test. Firstly, the lack of the necessary functionality for queries to the database confused. Secondly, the base behaved very strangely at the recording of 20 million lines. She just died. The restarted base behaved even more strangely: with 20 million lines onboard, it loaded for 10 minutes, for some reason only 100% of the four were forcing.
This was my personal research, so I did not want to waste time on this database.
Hypertable
The salvation seemed to be
this database, since it was not very memory intensive on a billion records per server, and it was very fast by appointment. Although, of course, the recording speed there depends solely on the selected timeout'a reset to disk.
Thrift api did not cause any problems after cassandra, it remained only to add support for the hypertable in orm.
But this base was so epile, and the logs were so uninformative that you could only be amazed as soon as the product could be called stable. Attempts to find colleagues on the problems in the network - nothing was given. It was possible to simply restart and never wait for the base. And it was necessary to raise it with a tambourine: reload 2 times, remove the logs, reload another 2-3 times. Or 5 times. Although the problems did not appear immediately, she almost managed to leave in production. In general, not an option ...
Mysql
(just for example)
Sad faces of colleagues, sad me. Nosql did not solve our problems. Everything was in vain. Reluctantly we tested mysql on our tasks and on 3 billion records it showed itself quite well. This completely upset me, with the thoughts “How is that! After all, nosql! Big data! ”I had to search for Mysql on real data. Naturally: no join'ov, complex links. I must say that the real data changed the picture and one of the problems with mysql was never solved. That is all. A request for 4 seconds is over the edge. Even under the condition of a hard-optimized query, this time with connections and using SQL features. But with the other task Mysql coped completely nothing. The main thing is the correct number of lines in the write batch.
In general: we were financially limited, it was impossible to purchase many powerful servers. We used what they give. And they tried to save as much as possible.
MongoDb
In parallel with the listed DB, I used / use
this one too . This is also a favorite database, I used it already in 6 projects. As amenities there is a convenient ORM framework for java -
Morphia , great possibilities for data sampling, scalability and speed.
Of course, there are nuances here:
- use highly desirable mongo version> 2
- be careful with server restarts, without shutting down mongo, if you haven’t been properly tuned
- read about mongorestore and journaling :)
In my opinion, this database is remarkable as a transitional one - between the SQL solution and the world of Nosql. What are the advantages of this database for me personally? Schema free, easy queries, document orientation, scalability. I like the very paradigm that this database carries.
And yet: out of 6 projects in Mongo, you could write 3-4 on mysql and not bathe. I wrote them in Mongo just because I like Mongo.
Hadoop
I started using
this thing recently - about 3 months ago, with the transition to a new place of work. Hadoop is an ecosystem of solutions for storing and processing huge amounts of data. When realizing the essence of map-reduce and hadoop, the simplicity of the algorithms and principles laid down at the beginning of this decision are striking. Nevertheless, this simplicity helps to process 200 gigabytes of textual information as if you were processing a small article. The thing is that a set of simple ideas gives a quick, simple solution. And if it seems to you that the data are not being processed fast enough, add the node to the cluster.
Of course, the understanding of the essence, the study of the hadoop sources, the implementation of the first calculation tasks take some time.
But the main surprise of this solution for me was that the database may not be needed at all if you need to store and process really big data.
As a
conclusion, I would like to express my personal opinion on all this:
There is no one solution for all tasks. Closest to that, in spite of everything, is the sql solution. Each nosql repository is a tool that solves only a certain range of tasks, and at the same time requires working with a file, studying the guts and careful tuning, or even writing your client.
Addition to the conclusion:
It is necessary to think first of all because there will be no silver bullet. And no matter how clear the manual to the database, the number of surprises from this will not decrease. Current nosql solutions are young and therefore not without flaws tools. Nevertheless, some of them are quite ready for production use, for example: mongodb, redis, hbase, cassandra.
But to come to the answer to the question "what to use" and in which case, in my opinion, you need to yourself. By testing and researching solutions to your specific problem.