How we repaired compaction in Cassandra for a week

The main repository of metrics in our country is cassandra, we have been using it for more than three years. For all previous problems, we successfully found a solution using the built-in diagnostic tools for Cassandra.

The cassandra has quite informative logging (especially at the level of DEBUG, which can be enabled on the fly), detailed metrics available via JMX and a rich set of utilities (nodetool, sstable *).

But recently we ran into one rather interesting problem, and we had to seriously think and read the source code of Cassandra to figure out what was going on.

It all started with the fact that the response time of Cassandra to reading on one of the tables began to grow:

In this case, the logs were empty, there were no background processes (compactions). We quickly noticed that the number of SSTables in the bunches table is growing (there we store the values of our clients' metrics):

Moreover, this happens only on three servers out of 9:

Then we stupid for a long time, googled and read JIRA , but there were no similar bugs. But as time went on, it was necessary to find at least a temporary solution, since the response time grew almost linearly. It was decided to force the compaction of the bunches table, but since it is not obvious from the documentation, on one or all of the nodes, compaction will start at nodetool compact , we initiated this process via JMX.

The fact is that we have already learned from bitter experience that changing the compaction strategy through the ALTER TABLE fraught with launching a full compactiton simultaneously on all the nodes of the cluster. Then there was a way to make it more manageable.

This time it turned out that nodetool compact starts compaction only on the node with which we work.

After compaction has ended, the number of sstable has decreased, but immediately began to grow again:
')

Thus, we have a crutch that allows us to keep the performance of cassandra at an acceptable level in manual mode. We set compaction to cron on problem nodes, and the response time for reading began to look like this:

At the moment we are using cassandra 2.1.15, but JIRA found several similar bugs fixed in version 2.2+.

Since there were no good ideas at that time, we decided to upgrade one of the problematic nodes to 2.2 (the more so we were going to do it anyway and our application was already tested with 2.2). The update went smoothly, but it did not solve our problem.

In order not to introduce additional entropy into the current situation, we decided not to update the entire cluster and return to 2.1 (this is done by removing the node from the cluster and returning it with the old version).

It became clear that with a swoop this problem can not be solved and it's time to go read the code of Cassandra. Since we ultimately store timeseries in this table, we use DateTieredCompactionStrategy with the following settings:

 { 'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy', 'base_time_seconds': '14400', 'max_sstable_age_days': '90' }

This allows for our case to ensure that data for the same time interval lies side by side and does not mix with the old data. At the same time, data older than 90 days should not be compacted at all, this will eliminate unnecessary load on the disks, since these data will not exactly change.

There is a hypothesis: what if our sstable is not compact, because Cassandra thinks that they are older than 90 days?

The time on which Cassandra relies is an internal timestamp that every column must have. Cassandra either puts down the current timestamp when writing data, or it can be set by the client:

 INSERT INTO table (fld1, fld2) VALUES (val1, val2) USING TIMESTAMP 123456789;

(we do not use this feature).

When checking the metadata of all sstable with the sstablemetadata utility, we found the anomalous timestamp value:

 $ sstablemetadata /mnt/ssd1/cassandra/okmeter/bunches-3f892060ef5811e5950a476750300bfc/okmeter-bunches-ka-377-Data.db |head SSTable: /mnt/ssd1/cassandra/okmeter/bunches-3f892060ef5811e5950a476750300bfc/okmeter-bunches-ka-377 Partitioner: org.apache.cassandra.dht.RandomPartitioner Bloom Filter FP chance: 0.010000 Minimum timestamp: 1458916698801023 Maximum timestamp: 5760529710388872447

But the newly created sstable had absolutely normal timestamps, why aren't they compact? The code found this:

 /** * Gets the timestamp that DateTieredCompactionStrategy considers to be the "current time". * @return the maximum timestamp across all SSTables. * @throws java.util.NoSuchElementException if there are no SSTables. */ private long getNow() { return Collections.max(cfs.getSSTables(), new Comparator<SSTableReader>() { public int compare(SSTableReader o1, SSTableReader o2) { return Long.compare(o1.getMaxTimestamp(), o2.getMaxTimestamp()); } }).getMaxTimestamp(); }

Now in our case:

 $ date -d @5760529710388 Sat Dec 2 16:46:28 MSK 184513

That is, another 182 thousand years, you can not even hope for compaction :)

On each of the three problem servers there was one “broken” sstable of sufficiently large size (60Gb, 160Gb and 180Gb). The most "small" of them was moved to the side, then through sstable2json they got a human-readable file of 125Gb in size and began to grep it. It turned out that there is one broken column (a secondary metric of one test project) that can be safely removed.

There was no standard way to remove data from sstable from cassandra, but the sstablescrub utility is very close in meaning. Looking at Scrubber.java , it became clear that he does not read the timestamp and it is quite difficult to make a beautiful patch, we made it ugly:

 --- a/src/java/org/apache/cassandra/db/compaction/Scrubber.java +++ b/src/java/org/apache/cassandra/db/compaction/Scrubber.java @@ -225,6 +225,11 @@ public class Scrubber implements Closeable if (indexFile != null && dataSize != dataSizeFromIndex) outputHandler.warn(String.format("Data file row size %d different from index file row size %d", dataSize, dataSizeFromIndex)); + if (sstable.metadata.getKeyValidator().getString(key.getKey()).equals("226;4;eJlZUXr078;1472083200")) { + outputHandler.warn(String.format("key: %s", sstable.metadata.getKeyValidator().getString(key.getKey()))); + throw new IOError(new IOException("Broken column timestamp")); + } + if (tryAppend(prevKey, key, dataSize, writer)) prevKey = key; }

where 226; 4; eJlZUXr078; 1472083200 is the key of the bat record, which we know as a result of the exercises with sstable2json.

And it worked!
Separately pleased that sstablescrub works very quickly, almost at the speed of writing to disk. Since sstable is an immutable structure, any modifications create a new sstable, that is, for scrub, you need to provide enough free space on the disks. For us, this turned out to be a problem, we had to do scrub on another server and copy the cleaned sstable back to the right server.

After stripping the broken record, our table began to compact itself.

But on one node we noticed that one fairly weighty sstable (with itself) with a size larger than 100Gb is constantly compacted:

 CompactionTask.java:274 - Compacted 1 sstables to [/mnt/ssd1/cassandra/okmeter/bunches-3f892060ef5811e5950a476750300bfc/okmeter-bunches-ka-5322,]. 116,660,699,171 bytes to 116,660,699,171 (~100% of original) in 3,653,864ms = 30.448947MB/s. 287,450 total partitions merged to 287,450. Partition merge counts were {1:287450, }

As soon as the process ended, she began to compact herself again, for example, the graph of reading operations on disks looked like this:

Here you can see how this file migrated from ssd1 to ssd2 and back.

There was also a mistake in the log about the lack of space:

 CompactionTask.java:87 - insufficient space to compact all requested files SSTableReader(path='/mnt/ssd2/cassandra/okmeter/bunches-3f892060ef5811e5950a476750300bfc/okmeter-bunches-ka-2135-Data.db'), STableReader(path='/mnt/ssd1/cassandra/okmeter/bunches-3f892060ef5811e5950a476750300bfc/okmeter-bunches-ka-5322-Data.db')

But why even compact 1 sstable? I had to figure out how sstable is generally selected for compaction in the DateTieredCompactionStrategy:

It takes a list of all sstable for the current table;
Sstable participating in compaction right now are excluded;
Sstables that have a maximum timestamp older than max_sstable_age_days are excluded;
The rest are grouped by minimum timestamp, starting with the freshest. In each of these groups, candidates are selected according to SizeTieredCompactionStrategy , and for the first interval (base_time_seconds) a minimum of min_threshold (in our case, 4) files is required, and for older ones, two are sufficient. If there are no candidates within the group, the group is skipped. For one pass, one of the “youngest” groups is selected for compaction;
For the selected sstable group for compaction, the size of the resulting file is predicted (just the sum of all source files), if there is no free space in any data_file_directories , the largest file is excluded from the group;
If at least 1 file is left in the group, compaction is started;

At the log level of DEBUG, it became clear that in our case:

The group on compaction from 3 files turned out
SizeTieredCompactionStrategy has chosen to merge 2 files out of 3
Because of the lack of space, there was 1 candidate left, he was compacting around

I do not venture to judge this bug or feature (maybe when using TTL there is a need to compact 1 file), but we needed to fix it somehow. We decided that we just need to find a way to let cassandra zakompakti.

We use servers in which 2 sata disks and 2 ssd. We have problems with space on ssd, we decided that we copy the candidates for this problematic compaction to disk, put links on ssd, thereby freeing up space on ssd for the resulting sstable. It worked and compaction on this machine began to work as usual.

This graph shows how hdd2 participated in the process.

Total

We didn’t figure out just how a record with a crooked timestamp got into sstable (let's postpone it until the repetition, if it is of course)
It is necessary to tweak the settings of the DateTieredCompactionStrategy: reduce the max_sstable_age_days to 10-30 days so that the sstable does not grow to gigantic sizes (it will be easier to provide a buffer when manipulating them)
Cassandra source code is fairly clear and documented.
This problem has almost no effect on the availability of our service (it slowed down several times for 2-3 minutes)

Source: https://habr.com/ru/post/310000/

All Articles

How we repaired compaction in Cassandra for a week

Total

More articles: