Cassandra Cluster Rescue Experience

I happened to save the cluster of Cassandra that had gone into oblivion. It was an interesting experience that I would like to share, because in a regular situation, most databases work the same way, but the level of stress during a fall can differ very much.

about the project

')
A service that uses Cassandra should store N recent events for each user. Events come much more often than the user can read them and, in most cases, the recorded data will never be read, but will be simply superseded by newer events. In the world in general, there are not so many databases that work well in write-intensive tasks, but Cassandra is one of them. Writing to the cluster (with minimal consistency) is much faster than reading. Of course, it helps that you only need to select data by the primary key - the user id.

Something went wrong

The person who started the service was not serious enough about documentation and did not balance the ring. The fact is that when you automatically add a node, it is allocated half of the largest at the time of adding a segment. As a result, of the five almost simultaneously running nodes, a very bizarre configuration was obtained, in which two servers were loaded much stronger than the other three.

nodetool ring
161594151652226679147006417055137150248
X1 Up 106.92 GB 38459616957251446947579138011358024346 |<--|
X2 Up 261.58 GB 87228834825681839886276714150491220347 | ^
X3 Up 268.08 GB 136691754424709629435450736264931173936 v |
X4 Up 148.58 GB 151190524462851319585265604946253553766 | ^
X5 Up 72.71 GB 161594151652226679147006417055137150248 |-->|

The size of the hard drive on all five servers was 260GB. Two nodes fell, because of the end of disk space and the entire cluster choked in the load.

Documentation, on the other hand, repeatedly warns that one should not allow automatic selection of ring segments in production. Provides PHP formula and code for manual calculation of tokens.

Resuscitation

First, with Cassandra turned off, you can do anything with its data files. We moved one of the heavy and old files (30GB) to NFS and put symlink on it. We started the cluster, checked the service - it works. The total repair time since the detection of the problem is 15 minutes. Almost all this time was spent on transferring a file to NFS.

Secondly, I immediately enabled caching in the database. Cassandra has a pretty decent caching mechanism, which significantly reduces hard disk accesses. At least in our application, cache hits were between 80% and 90%.

nodetool setcachecapacity App Feeds 200000 1000000

Note: cache size is set in “records”, not bytes. It is necessary to properly imagine the size of the average record, so as not to miss. I, of course, missed, singled out more available and in a few hours I received the first node that took off from Out Of Memory. It is treated by a simple restart and more careful cache size.

Attempt to cure

So, the service came to life, copes with the load, but it’s impossible to have an unbalanced cluster with a part of data on NFS for a long time. After reading the documentation, an attempt was made to use the nodetool move command. I tried to make it work for almost a week. The essence of the problem was that the data between the nodes did not move. The streams directory appeared on the source node, which contained the data prepared for the transfer, but the transfer itself (which can be watched by the notedool streams command) always hung. Sometimes even at the very beginning.

So for the first time I ran into bug 1221 . After reading the fix, I tried to upgrade to the latest version, but here I was caught by a 1760 bug. In the end, I still updated the cluster to 0.6.5, but it did not help much. The cluster is “stuck” in an unbalanced state.

I must say that the tools for managing the cluster are not just poor, but rudimentary. You can give only a few commands and monitor their process by indirect signs. That's all.

To my great joy, by this time the leadership has forked out on a training seminar from Riptano . This company is Cassandra, they develop it and provide paid support. At this seminar I opened the Tao Cassandra.

Tao Cassandra

Do not try to treat anything. Falling - finish, clean, turn on as new. That was what was said at the seminar. This explains the vestigiality of tools for managing the cluster. The fact is that according to the authors' idea, the management consists of two main operations - 1) add a node; 2) remove the node.

It was in this way that I finally managed to fix the cluster. The nodes were removed one by one from the cluster, cleared and run with correctly counted tokens. Of course, I had to sit down with the paper, inventing a restart procedure that would allow me to do without downtime. It turned out to be an interesting, though not very difficult task in combinatorics.

Without bugs, of course, not done. 1676 pursued me all the time. The loading node received 50GB of its new data and calmly sat on. Restarting the service led to the next 50GB. And so until everything comes.

Conclusion

Fix the cluster turned out. My opinion about Cassandra has changed from “what a student's hand-made article, no tools there are” to “surprisingly stable database”. In fact, for two months the cluster worked with damaged servers - at two speeds of access to the HDD were slowed down to the speed of NFS. And while the service as a whole lived and users did not really complain.

During this time, I learned a lot about the insides of this database, talked to its creators (amazingly responsive and intelligent people) and even got closer to the end of the process, enjoyed it.

Source: https://habr.com/ru/post/114160/

All Articles