Moving to Yandex ClickHouse

Alexander Zaitsev answers questions regarding the move to Yandex ClickHouse. This is a transcript of the Highload ++ 2016 report.

Hello everybody! During these two days, the conference had two two-hour meetings, today even almost a three-hour ClickHouse meeting. After that, Victor and Alexey made a wonderful report, it would seem - you can’t say anything more. In fact, it is not.
')
I'll tell you how to move on ClickHouse, if you already have something. Usually, if there is nothing, then everything is very simple. Take and build on the new system. But if there is something, then it is much more difficult. You are fine now. You realized that ClickHouse is a great system. Victor and Alexei will answer all the questions, I have no doubt to further assure you that this is the right decision.

In fact, not everything is as good as they say if you are going to move. Because ClickHouse is quite different from everything you have dealt with in the past.

All this is a completely new and quite specific experience, it requires a lot of work for everything to work out well. Some simple things work right away, for example, download logs. And some not very simple things do not work right away. In the end, I am convinced that everyone will succeed, because we succeeded. We broke through all the obstacles.

Who are we?

I work for LifeStreet, an advertising network. A fairly large network, we have been on the market for more than 10 years, we are engaged in optimization of various types of promotional items: campaigns, ads, parts of ads, bidding, integration with external systems, etc.

This is a completely lively company that earns tens of millions of dollars annually, we don’t have as much data as at Yandex, but about half as much. Only 10 billion events a day. The whole thing is stored in a large analytical system: used by both internal users and external, as well as machine learning, optimization and other algorithms. The infrastructure that we have, it is very good, verified, built on Vertica 2010, we are one of the very first who started using Vertica in Russia, although the company is American. I told about this on HighLoad twice, last time in 2013. That time I told you what other options for building analytical systems. ClickHouse did not exist then. I looked there different and better than Vertica did not find. Then there was nothing better.

But we continued to look - we all liked Vertica, but there were some things, and we wanted to refuse it sooner or later. Last year, there was such a thing as Snowflake, a company entirely on Amazon. It scales remarkably, it raises servers on demand, that is, you can turn off the server for the night, and turn it on during the day and not pay for them. Data is all stored in Amazon S3, etc. It is so good because it was developed by former Vertica programmers - they won't do bad.

Everything is fine with us so far , but we want to leave Vertica.

One fine moment appears, like a rabbit from a cap, ClickHouse. Everyone rushes to look at him - what is it. Because he is fast, he runs very fast, the guys have told the truth. It scales remarkably, planted one rabbit, in a month already a whole family. This is really a cool thing, thank you very much from everyone.

Our CTO sends me a letter as follows:

Sasha, ClickHouse has similar use cases, they have an advertising network, they count clicks, etc. They have released a system and everything looks perfect on paper. Practically what we need. Absolutely everything fits, if we can use it - it will be very cool. Please take a look.

No sooner said than done

What is ClickHouse in the summer-fall of 2016?

Firstly, he just appeared - a new thing completely. This is usually bad for any product, there are likely many bugs, etc. This is an internal project of Yandex. This is the first external product with unclear then still perspectives.

Released in Open Source, good. So what?

At that time there were no independent installations, but now, although they say that there is a production installation - in fact it is not known who, with what results, loads, is unknown. There are no case studies to look at them and understand that this is really what you need. There is no support, Alexey is the only support that exists, but he is alone. No roadmap, it is unclear how the product will develop. The company does not disclose any plans, this is all internal, that is, a product like Open Source, but all plans are internal. There were only three developers, as announced, at ClickHouse itself and a few more from Yandex. Metrics assistants.

It was immediately clear from the documentation that there are many known limitations that are already documented. And how many unknowns were incomprehensible, but it was clear that they also exist.

A bit of pressure on the history of other projects, which only did that released in Open Source, because they were not always successful. For example, Facebook released Cassandra in Open Source, when it was very raw, in the end it no longer uses it, but Cassandra is lucky, others are developing it.

If you were on the reports of Peter Zaitsev, he probably talked about TokuDB and TokuMX, which are included in the Percona server. It was such a company Tokutek in Boston. She made a completely wonderful analytic engine for MySQL, which stored and took data in blocks, much better than InnoDB for a number of tasks. The company is wonderful, but it all sold very badly - in the end, it went bankrupt, but in order to not lose its results, it released it all in Open Source.

With this list we come to our director at LifeStreet, who is responsible for this direction and for the money. We tell him: “Paul, we want to change Vertica and your analytical structure to this, developed in Russia” . What do you think he said? He is a polite man, but he said: “Are you out of your mind? Do you want to replace it with something that works for me and makes money? ”

Fortunately, he does not make decisions, but he said so.

Then we went to the developers who know how to develop for databases, they know SQL and all these things. They began to look at ClickHouse. They realized that there was almost nothing in ClickHouse:

There are no transactions, there are no.
There is a primary key, but this is not the primary key in fact, we can say this is the order of physical sorting.
There is no Consistency, that is, Consistency is not guaranteed at all.
No deletions and updates.
There are no NULLs, the system to which Null cannot be written at all.
No milliseconds, that is, time to within a second, you want milliseconds - store them in your own way, like an Int64 or a separate field.
There is no automatic type conversion, that is, even Int8 to Int16 must be explicitly cast.
There is no normal SQL, it is clear from the entire previous list.
There is no arbitrary partitioning, you can partition only by months, that is, you cannot partition by day or weekly or by arbitrary keys, that's all.
There are no cluster management tools, that is, Yandex probably has them, but it couldn’t pack them up enough to be seen in Open Source or unwilling.

What do you think the developer said when he saw this list?

He said: “And they call it a database?” “They” - this is what Yandex means. Nevertheless, despite the understanding that the system has serious limitations - we decided to try.

We decided to try first of all because there is a really close subject area and task. And since a close task, then the solution for us is likely to work, besides in Russia, the authority of Yandex is very high. I know a sufficient number of people who worked or work in Yandex - they are all very high professionals. Immediately active community interest in this topic. He showed that it is really interesting and it cannot be an empty sound. It became very interesting to us. We wanted to really try, because it's an interesting thing. And it's free.

We tried it - we made a quick small pilot project and realized that it really works. Yandex is telling the truth. It works fast.

This is the first database in our memory - in recent years, which we checked, and checked quite a lot - which was not worse than Vertica. It was better on Yandex, on our tests it was about exactly, somewhere a little better, somewhere a little worse. This is a very strong result.

Many features - we also understood this. That is very good documentation, that is, as documentation for a project that has just been released in Open Source. It was and remains perfect, in my opinion, of course there are some things there, but, first, you can always ask them, secondly, what is, this is enough to start and not only start. A completely lively forum on Google Groups, where, by his own admission, Alexey spends almost half of his working time on answers. These answers are always very detailed, he tries to grasp the essence of the problem and this convinced us that in Yandex the attitude is more than serious.

Well, I have already spoken about Alexey: because of his answers to questions, to personal mail and even reached the point that we personally came to Yandex - I came with one of my developers to talk with Alexey and look him in the eye . We looked and realized that "you can."

So we drove

As I said: the main problem was that you had to transfer the existing system to ClickHouse. These are data schemes, these are download scripts, this is an OLAP service that was used to deliver this data to all our clients, both internal and external, as well as some administration procedures.

There is a fundamental problem of moving. If you move to another place - usually something will not suit you. You move to England, you have to drive on the other side. You are moving to America, you have a plug there to the outlet does not fit. There are certain habits, which things you are used to. They work for you, you move, and they stop working for you. This problem of relocation always exists, for whatever you move, even very close systems. And here in ClickHouse this is complexity - it was the main one. What we have become accustomed to in our established practice, in databases such as: MySQL, Oracle, Vertica - if we talk about approaches and their applicability in ClickHouse, they never worked in the forehead, but they never managed to be done in the forehead.

The most important thing is, of course, the scheme. Everything depends on the data scheme. And the schemes that are used in analytical problems are such a star, embedded in a cube or a cube, deployed in a star. Such a tricky geometry.

This tablet in the middle - it is usually called "facts", and the signs on the sides are called "measurements" or "demencheni." In addition, there is such a synthetic concept as a metric - this is a certain function that aggregates over facts.

This scheme does not necessarily look like it is drawn here, because there are always two extremes. You can put everything in the fact table, including all measurements, and it will be very expensive on the disk, a lot of space will be required, it is expensive to get a list. If you have a billion records, and you want to get a list of all countries that are there, then no matter how efficient the storage of the column is - if it does not of course keep the list of countries separately - then you have to read it all, which will be expensive. And you can not change anything. Usually the fact table is never changed.

The other extreme is when you have everything in the dimension table (and here ClickHouse puts a big pig, so let's say - by the fact that there are very bad joins in it, I will say about them later, join is very limited applicable). And again, there are no updates - although measurement is the first thing, that it is often necessary to update or add some things to them.

I’ll stop here a little more.

Measurements can be divided into three conditional types. The first is static directories that never change, or very rarely. For example, a list of countries, the hierarchy of time, the hierarchy of geography and the like. They almost never change. In addition, usually the company has some kind of CRM and RP, and there are changeable directories in it : a list of clients, a list of advertising companies, some characteristics of ads, and so on. They change, but not often. If the LTP database is normal, then something is there once changed. And these changes need to be displayed in an analytical database. And besides, there is a series of data that arbitrary attributes come from — for example, an id application or phone models that change so often and so quickly — new ones appear that it’s almost impossible to make a static reference book or even a little changeable one. You never know what your traffic will come from.

As it turned out, ClickHouse has such a wonderful thing as dictionaries. Victor said a little about them, and I will tell about them in detail. Dictionaries allow you to associate something with the key, if you take an analogue from Cassandra - this is the column family, that is, there are a number of columns with values in the key. All these dictionaries can be taken from different sources, for example, from files or MySQL databases. They are described in XML, and here one dictionary can be described in one file, thus it is easy to maintain versions and the whole thing is a little easier to manage.

They are updated, although with some nuances, which I will discuss next.

And access to them through the function is not through such, as Victor showed, because he showed access to his dictionaries, which are protected inside ClickHouse, but to the general type of dictionary access to and through the function of the general type.

Dictionaries have certain limitations. One of the limitations is that there are a lot of them in ClickHouse — the Ulnt64 type; there cannot be other types. And since the imple set does not have any explicit castings of the ClickHouse type, if you have a type in a different table in another table, you have to lead to a type, which is not very convenient.

The second limitation, significant enough for us, maybe not for someone, is that there is no direct way to get all the values of a column. You cannot understand from the dictionary what is there. There is no such way; you can only look at the source, but you can't ask from ClickHouse to the forehead.

There is a certain size limit . There are three types of dictionaries and the most efficient and fast, it is cut in the size of 500 thousand lines, the next magic number. There are other types that are less limited, but they are slower. You cannot update the dictionary on demand or by changing the source, again with some nuances. We have to somehow follow this. On each node of the cluster everything is independent. If you have 100 nodes in a cluster and you need to update the dictionary, you need to go to each node so that it is updated. And in some cases, requests may work slowly.

We came up with just such a thing. Now know-how, I don’t know whether Yandex is using it now or not. We have come up with such know-how - we make tables, we attach another table to each dictionary, in which we store only the keys.

What does it give

This makes it possible to obtain all entries from the dictionary in this way, for example, on the one hand, and on the other, it makes it possible to optimize such queries. If you have a very, very large table and you make, for example, such a query with this kind of condition, then this function will be called every line, it is likely that for a very large table it is not too fast. But with the use of a separate label for the keys, you can rewrite the request. You can first pull out a separate set of keys, a very simple query, and then a very efficient select works.

These are of course simplified requests. In reality, there will be some other conditions, but the idea is probably understandable.

Updating dictionaries is also important for us, because the data is changing. The standard timer method, which defaults to 5 minutes. What is bad? The fact that if you have a lot of dictionaries and a lot of servers, then there are connections, for example, a lot of MySQL and they can be puzzling.

You need to set some kind of replicas, or somehow allow it differently, but if you are so lucky that you have the source table in MyISAM, then ClickHouse itself understands that the dictionary has changed. Since the fact that in the MyISAM table, in its description, if you look at the show table, then there will be the date of the last change and ClickHouse knows how to look at it and pick up the changes.

Another great way is to make Touch config. That is, if you go to the server, make Touch the dictionary description file, then ClickHouse will think that the dictionary has changed and will take it. But you need to go through all the servers in the cluster.

And finally, the last method that we invented - invented, but did not try. It is, for example, to make a shared file somewhere and pick up a dictionary from a file. Shared-file is easier to change than to go through all the nodes.

In general, like Victor noticed - that 80% join it took away, so for us it is a real silver bullet, even a whole silver shop, which allows the star schema (star in the cube) to make at least something working.

But tables are still needed sometimes.

First, it is for keys. Secondly, these are attributes from web traffic that do not fit into the dictionaries. Thirdly, it is join by a complex key, if you have a composite key, if you cannot make a hash for some reason, then the dictionary will fail.

If you need to do join on a range of dates - you have a table, where some range of dates and there is some value. You need to take this value as of today or yesterday, and so on. Such a specific use case, it can always be solved differently, but it does not work in dictionaries.

Once the table appears, the question “how to update the table?” Appears. ClickHouse says that it is not necessary to do this, but if you need to update - then rewrite it entirely. Well, if you really want, you can.

For this there is such a thing as ReplacingMergeTree. It allows, in principle, to update the record, but, firstly, it is very Eventually - once it is updated with you, you add a record there and once this new record becomes current, but when you don’t know. You can call optimize and then it will actually become current at once, if you are lucky, or you can put it final and it also pulls out the current record. Yandex says that it is slow and not efficient, although for small tables there is no difference.

It is inconvenient to update individual fields, that is, the entire record is convenient. Separate fields are inconvenient because values need to be pulled out by a separate select and then written back.

There is still such a problem that in ClickHouse for each table there must be a partition key, which is the date for each. Even if you do not need a date, you still need to add it. And sometimes there is such a desire to put something useful on this date. And if you put something useful there, then with ReplacingMergeTree it will not work, because different partitions do not collapse.

And if you want to delete it is even worse. If you cannot delete a table, only if there is overwriting everything with zeros - and from the facts there is such a thing as CollapsingMergeTree, and this was explained in detail at the last meeting. This is an analogue of such a banking “reversal”, when you add something with a different sign, and then add an entry with the correct sign. She again Eventually, that is, someday will be correct. You can only correct metrics, that is, only money or just tsiferki, in an amicable way. And not usually the data is corrected, but the result of aggregation. The data may someday be corrected, but not necessarily.

The next thing you need to say before continuing is issues related to segmentation and sharding. Each company has different requirements for this. For example, Yandex immediately understands that the sharding is based on the account or client, and there they have a lot of users, they can be very well scattered across different servers. As I recall, they make a two-tier system for this and everything is very convenient.

Our company does not have such, and companies like ours, which are “ad networks” - they have not so many, maybe, clients - have some other analysis needs. It is impossible to clearly unshake it up and we need to come up with some other ways to do it well, but we need to think about it, because the performance and loadings and queries depend on sharding.

Further in ClickHouse there are such remarkable pieces as Replicated tables. When you take one rabbit and put it on two different servers, you get two rabbits. Or on three servers - then you get three rabbits. One server fell, the second remained.

There is such a thing as a distributed table. When you take a rabbit and cut it into pieces (not really), you put the pieces into different servers. In order to get a whole rabbit - you need to go to all the servers, but on each small piece, it’s kind of faster to do it, although the result will be the same.

These approaches must be combined, that is, you have a rabbit, and each piece lies on several servers. You can lose and quickly refer to it.

If all the previous knowledge is summarized, then you get a FarmVille farm from ClickHouse, where the facts are laid out on many, many shards, some shards are replicated, there is a Distributed table on top of them to read from this.

The shards themselves are replicated at least once, and preferably two. Dimension tables are replicated to all nodes, because you need to filter the data on all nodes and go to all. And some tables can be stored in accordance with the facts - for example, on those shards. These are the ones that rescue from the traffic, because they came with the data - even if they are stored next to the data, we will keep it locally and will not be superfluous.

And if everything is correct there from the side of the design - then you can do the download, which is nothing nothing. The concept is done and does not use anything, and the excess does not send anywhere. And in order to succeed, try not to load Distributed - cut the rabbits yourself into pieces and put them into the necessary shards. And Distributed only for select.

In addition, even if you do not need replication, it is better to use Replicated tables.

Because in this case ZooKeeper works, and he counts the check-sums of the blocks. If you break something in the middle, but there are no transactions and you don’t know what you have recorded and what you don’t, ZooKeeper (if you try, write the same thing) will compare and write those blocks that are not yet signed up from his point of view. This gives some false consistency. And temporary / intermediate tables also always help.

You can write to the intermediate tables a lot of small pieces of data, then combine them and write to large. Or write to a temporary table, do some manipulations, add some dictionaries, join, do something with data and write to the main table. If all this is done correctly, the download is very fast. We did it faster than Vertica, definitely, not much faster, but still.

Another sensitive issue that will come before those who are trying to translate something to ClickHouse is aggregation.

What it is

If you have a pack of carrots, then you can take one carrot, write the number 10 and say that this is a new carrot. However, it will not be the same as those - it will not be just a carrot, it will lose individuality. Yandex believes that aggregation is not needed at all, precisely because you lose parts, it is better to keep everything and always.

But considering this, however, not everyone in Yandex is of this opinion, because in ClickHouse there are tools that do the aggregation using the ClickHouse tools themselves - these are Aggregated and SummingMergeTree, which work as a table in which you insert, and exit it actually aggregates what you have already inserted and stores aggregated and compressed data.

Best of all, if you go this way, you can do this Aggregated and SummingMergeTree as MV over the facts. And you kind of like inserting facts, and in your side, in a magical way, an aggregate appears. Any problems and difficulties that I will not talk about here can also be with him, but nevertheless this is a working solution and maybe it will help someone.

Then we got to access the data. Immediately a lot of interesting specifics, which are not found in other databases. These are arrays and higher order functions for working with them, or ARRAY JOIN, which the array expands into lines. Rather unusual redefinitions for select expressions. In any place you can put alias for an expression or for a column and it can be used, not necessarily in having - it can be used further in the same select line. Sometimes it is convenient, but sometimes it leads to errors that are very hard to catch, especially if you use a name that is already used somewhere.

The interesting specificity that is associated with join is not at all SQL, it is not at all standard. Since ClickHouse never guarantees on primary key. In order to cope with this matter they came up with such a thing as ANY vs ALL JOIN .

What is it?

ALL JOIN returns everything that fits the JOIN conditions on the other side, and ANY returns one record - the first one to go. Thus, if you have non-unique records in the tables that you join, this is not a problem for you - you will receive a maximum of one.

There is an interesting specificity associated with distributed things already: PREWHERE vs WHERE. One runs on shards, the other after shards.

GLOBAL IN, GLOBAL JOIN - this is from what we liked. And approximate calculations are sampling or probabilistic models, about which Victor spoke, also for many cases, when a very large data array and we need accuracy, this can be used.

But this is all good.

Now the pain . What is missing in this dialect of SQL, which sometimes is even very difficult to call SQL, but really want to.

This is an auto-casting, I will repeat this many times. Reassigning alias to tables is understandable why, because its syntax is for JOIN. You can make a JOIN table once, that is, select the JOIN table once, twice it is already impossible, it will not work anymore. Therefore, redesignation is not necessary.

To make two JOINs you need to do under-select and do the second time, I will show later. And JOIN supports only using - this means that your column should be named the same on both sides and its type should be the same. If the type is different, then you will have to do sub-select again, cast the type or rename it.

SQL, group by/order by 1,2,3… , ClickHouse , slicing and dicing, . select, group by, order by. , order by 1,2,3… , , .

nulls coalesce, , .

— - , , JDBC HTTP , . - , , , .

. , join. join , , . -select — , , , - where join , . . !

That's great how you can rewrite coalesce if nulls are replaced by zeros through the high-order function of arrayFilter. This is my favorite example. What immediately surprises the developers here and I hope you are also surprised - that the arrays are numbered starting from 1 and not from 0. It also happens.

And the following - I don’t know whether this is pain or not pain, but administration

Administering ClickHouse looks like this:

This is a set of “skilled hands” or “do it yourself”. That is, you have everything - you can collect anything.

Everything is very, very flexible and very, very manual. And one of the main pains is that any DDL operations (alter table or something else) you need to perform on all nodes. To go to each node and create a create table or alter table, if you forget about a node, then with some query it may fall.

There is a ZooKeeper. . ClickHouse, ZooKeeper. — ClickHouse, ZooKeeper . . , , , - , , ZooKeeper .

« — », , . , ClickHouse, . , .

Yandex Way ( ) — use case, , . , , , , , .

, , ClickHouse, — , -1 . . , - .

, unit test , , , — .

, , .

, . — , , - — , RDBMS . - , — , , .

, , . ClickHouse .

.

: Yandex ClickHouse .

Source: https://habr.com/ru/post/322620/

All Articles