Evolution of the Analytical Infrastructure (continued)

In the previous article I told how and why we chose Vertica. In this part I will try to tell about the features of this unusual database, which we have been using for more than two years. Writing this article took a little more time than I had planned, in particular due to the fact that it was necessary to tell on the one hand rather technically detailed, on the other - available, and not to violate the NDA. As a result, I chose a compromise path: I will try to describe how Vertika works and works in principle, without touching on details.

Part 3. Vertica. Simply Fast

Simply Fast - this vertical slogan did not originate from scratch. She is really very fast. Fast even with the “boxed” settings that our tests showed during the selection of the solution. During the migration of infrastructure, we have learned well how to make Vertic even faster and get maximum performance from it. But first things first.

To begin with, in Vertic there are not a few obvious things. This may be surprising, but in Vertic, compared with many popular SQL databases, there is no:

stored procedures and nondetermined functions
strict constraints - that is, you can add entries that violate constraints
ability to delete columns
indexes
caches

If the availability of indices is compensated by a special way of storing data, then the remaining limitations are rather a forced necessity, a performance charge. Some things are not very convenient to do, however, they are not a serious problem. Separately, I will say about the lack of cache. Vertika believes that the data volumes used for analysis (and these are terabytes) still cannot fit into the cache, so you should not do it. Any request goes for data to disk. There is some special area of memory that is used for intermediate data storage when writing or changing. But as a rule, everything is on disk. This somewhat contradicts the stable opinion that the slowest place in the database is the disk subsystem, and therefore we must strive to maximize the data in memory. However.

I will not discuss in detail the high performance of the download, in fact it is limited by the speed with which the disks manage to write, taking into account that the data is encoded to reduce the volume. From the point of view of users, the performance of SQL queries is much more interesting. Vertika is a specialized analytical database, so you don’t need to expect high performance from it for simple key queries. It is fully disclosed on aggregative queries with filters that are typical for analytics.
')
High performance of analytical queries is provided in several ways of organizing physical storage of data:

Column-oriented data storage
Optimized for certain types of queries, how to store columns
The ability to have many physical views of the table optimized for different tasks
Linear scaling of the cluster, MPP - massive parallel processing, that is, one request can be parallelized to all nodes of the cluster and to all processor cores simultaneously

Let us dwell on these methods and how they affect the performance in more detail.

Column-oriented storage

Usually the tables used in analytical tasks have tens and hundreds of columns characterizing some events (facts). For example, sales statistics, statistics of calls or Internet connections, statistics of impressions and clicks, price dynamics of a market asset, etc. Event characteristics are usually called dimensions (dimensions), and all together - a multidimensional data cube. However, a person never analyzes all characteristics together, but considers, firstly, certain data slices, and secondly, projections onto a relatively small number of measurements. (Decision theory specialists have long established that a person cannot analyze more than seven parameters at the same time). Translating into a database language, a slice is a predicate or a where condition, and a projection on specific dimensions is aggregation with group by. I simplify a little, there are still joins, there are analytical functions, etc., but these are already minor details. In any case, of all the columns in the query, only a small number is used. Therefore, the ability to read from the disk not all data, but only a part of it, significantly speeds up requests. Physically, each column is stored in one or several relatively small files that never change (they are only created or deleted). Therefore, reading a column is a very fast operation.

Specific Encoding Methods for Columns

Vertical column coding is used for two purposes:

reducing the size of data on the disk, thereby speeding up the read-write
data placement on the disk so that it is convenient to filter, join, group, sort and other standard operations

As you remember, there are no indices in Vertic. Instead, the data on the disk itself is “stacked” so that it can be quickly found. The method of laying is called projection. This is a hint that the projection may not be all columns, but we will return to this later. Fast search is achieved using a combination of order (order by) and RLE (run length encoding, i.e. encoding the number of repetitions of a value in a row). How RLE works with order is easiest to demonstrate with an example. Imagine that there is a table with fields event_date, amount and another 100 fields. There are 365 days a year. There are 1 billion records in each day. If a typical request is to calculate money for a period (day, week, etc.), then we use RLE coding for event_date sorted by event_date. If you sort the table by event_date, then in the event_date column there would be blocks of 1,000,000,000 of the same values. RLE encoding means that physically there will only be the date and the number of repetitions (billion in our case) on the disk for the event_date column for each day. In the year - 365 such records. To lift from the disk and find one is a matter of seconds. An entry with event_date contains something like a pointer to blocks with the remaining columns. That is, to execute “select sum(amount) from t where event_date='2012-07-01'” , it suffices to read the event_date column, and then the block or amount blocks that relate to the desired date. It's very fast. It is clear that there can be several ordered columns with RLE. And it is also clear that this encoding method is best suited for columns where there are not very many unique values. Which is typical for measurements.

In general, this is somewhat similar to the index. However, it works somewhat differently. Imagine that we still have the column account_type (two values), and, say, department (five values). And for them, too, often have to do a search. In the case of the indexes would have to do separate, for each case. And in Vertical, one projection is enough, which will work on searching through all three columns in any combination. For example, with the following order: order by account_type, department, event_date. All columns with RLE encoding. It would seem that the search for event_date thus “breaks down”. With the index on account_type, department, event_date, this is how it would be. However, in the case of Vertic, the request for event_date will be executed only slightly slower than in the previous example ... Instead of one event_date block, you now have to read ten of them (2 account_type * 5 department), in each you find “your” event_date, and read the corresponding block amount. Given the small size of event_date blocks due to RLE, this difference will be almost imperceptible.

Hope the idea is clear. Instead of an index, some columns on the disk are sorted and tricky connected. This is usually illustrated with a picture or table, which, unfortunately, will not be here.

In addition to searching, RLE encoding is also used to speed up sorting in group by, since the data is already located in such a way that it is sometimes not necessary to sort them. As in the case of search, acceleration (the so-called pipelined group by) also occurs if the group by field is not in the first position in the projection.

Vertika supports many ways of encoding columns, but the rest are used for more compact data placement, which, of course, also affects performance, but not first of all.

In addition to Vertik's coding, traditional data partitioning is supported, and, accordingly, partition elimination or partition pruning, which in some cases makes it possible to further reduce the amount of data read from the disk.

Ability to have many physical views of the table optimized for different tasks

After reading the previous section, a reasonable suspicion may creep in that it is not always possible to dispense with a single physical representation of a table or a projection. Since even the most sophisticated RLE / order-by does not help for all occasions. Therefore, Vertika supports multiple projections for a single table. At least one projection must contain all the columns (super-projection), and the rest can contain only a part. This allows for different types of queries to build specific small projections. The approach is a bit like materialized views. But MVs are a “side” lotion, and projections are the main and only way to store data. Vertikovskim engineers managed to make the use of projections completely invisible to users. They themselves are kept up to date, and the query optimizer correctly, as far as I can tell, chooses the most appropriate projection for each query.

Having understood the basic principles, it is quite simple to independently develop the design of projections for specific tasks. But for beginners or lazy complete with Vertic there is a special utility - DB Designer, which helps to generate the optimal design for a set of tables and test queries.

All design changes can be made on a working database. The existing super projection is used to create new ones. This, of course, brings productivity, but with small projections (by the number of columns) it does not take much time.

Linear scaling. MPP

Vertika almost linearly scaled by quantity:

drives
cores
servers

Scaling by servers is ensured by uniform “smearing” (segmentation) of projections by servers (nodes) of the cluster. The segmentation method is specified by the developer - usually it is a hash function from one or several columns, ng can be an explicitly specified condition. When a request is made, each of the servers performs operations on its own part, including all the cores for this, and then the results are combined on the server that initiated the request. Nothing extraordinary except that it really works. Not all tables make sense to segment. Small dimension tables, which are almost always used only in joins, are usually not segmented, and a full copy is stored on each node. For the same table, one projection can be segmented, while others are not. All this allows very flexible configuration of the physical design for specific tasks, achieving optimal performance.

In addition to performance, the cluster allows you to configure fault tolerance. In this case, each data block is stored in two or more instances on different nodes of the cluster. The level of duplication is called the K-safety level. With standard level 1, a cluster experiences a loss from one (guaranteed) to floor (N / 1) nodes at best. However, we have found that the performance of a cluster with one or several fallen nodes for some (not for all) types of requests may decrease significantly (by orders of magnitude). Vertikovtsy confirmed that this is a bug and it will be fixed (already in Vertica 6), and we managed to change the requests so as not to fall for this problem. By the way, the iron breaks down often, and we repeatedly had to live on the “lame” cluster.

Vertika supports transparent cluster expansion, and the process is fully manageable. After adding new ones, you need to manually or automatically rebalance all the projections. On the production system, it is better to do it manually in not-so-working hours, since rebalancing loads the disks heavily. Starting with version 5, data is stored on disks in such a way that it is convenient to rebalance them (elastic cluster). It is enough to move the blocks without recoding them. It is much faster and less costly in performance.

A little more practice. We had to make several different copies of the database on Vertic on different hardware for different tasks (see my not very successful article about backup and data cloning). This happened due to the fact that with all the scalability, the cluster is relatively slow to perform short simple queries. Probably, network delays and the need for an extra “pass” to combine the results from different nodes have an effect. The strength of a cluster is revealed when a lot of data is actually required to execute a query, a large amount of sorting, etc.

Vertic extensions

Vertika is one of the first manufacturers of RDBMS to offer integration with Hadup. Therefore, it was natural to try the same task to count on Vertical and on Hadup. In some ways, we followed in the footsteps of the article “A Comparison of Approaches to Large-Scale Data Analysis” . For the experiment was used the real task of counting unique visitors for a number of measurements. It should be noted that the vertical adapter is not very fast, so we tested both through the adapter and through data storage first in HDFS.

We tried several implementations for this task. Test scenario - counting the number of unique visitors, broken down by country and some other measurements. The volume of test data is about 300 million records, in which 60-70 million are unique. We intentionally did not use a cluster, since both Vertika and Hadup are scaled almost linearly, and the results of a single-server experiment would be sufficient. It is clear that Hadup on one server in reality, there is no sense to use much. The locations were as follows:

Google Big Query is the fastest, the query was executed in a couple of minutes. I had to refuse, after the Google users wanted a lot of money for their service (it turned out several hundred dollars per request). Here we used md5 visitor identifier; otherwise, 8-byte int. That is, GBQ could be even faster. A separate problem is data loading.
Vertica is a standalone server (not a cluster) on a good hardware, which was used only for this task. The request is about 15 minutes. The task could be quite easy to parallelize if there was such a need. Speed suited us.
1-thread Plain Java. 2 hours. Single Pass and HashMap.
Hadoop (not a cluster). Request for Pig. In the forehead about 8-10 hours. With data storage locally and loading of all cores - 2 hours (that is, comparable to a single-stream Java application). Probably, expansion to a cluster would give a linear or close to linear increase. But the difference with Vertic is still very big.

As a result of this experiment, which is quite a side for us, we were convinced that integration works, and Hadoop, as expected, is rather slow. The advantage of writing queries directly to Pig / Hive is rather dubious, and it is completely absent when the same task can be expressed in terms of SQL. Perhaps we still have tasks that require a full-fledged MapReduce.

Hadoop is not the only way to extend the functionality of Vertika. Vertika allows you to write UDF in C ++, including its aggregate and analytical functions, and in the latest version it offers integration with the R. language. We have tried to integrate with R before (and even as an interesting exercise with J), solving problems complex in terms of SQL. ODBC or our own REST service was used for this. Now this integration is even more simplified.

Conclusion

Currently Vertika has been working in production for more than a year and a half. We have both single-server and cluster systems. Vertika proved to be a very reliable database, we have never had any data loss or failure due to a database. There were errors in one of the first versions, which led to incorrect results of some queries, but they were quickly corrected. There were performance problems, especially on the “chrome” cluster, but they too were solved. Currently, we load into our vertices about a billion facts per day (and on one of the servers - 5 billion), which are aggregated into a couple of hundred million records in different aggregates. Vertika performs about 10 thousand user requests per day, and approximately twice the number of requests from runtime systems, monitoring and other internal robots. Most user requests, usually requesting statistics for days or weeks, are performed in 5–10 seconds, and runtime (fairly simple) 1.5–2 seconds. This is not an OLTP database, it is an analytics, so these response times are quite satisfactory. However, now we are migrating to a more powerful cluster, which reduces query execution time by an average of 3 times.

But time does not stand still, we think about the need to have something like multi-tier datawarehouse or in-memory OLAP. But more about that in the next article.

PS July 11th I can answer any Vertical questions on our meeting in Moscow: habrahabr.ru/events/836

Source: https://habr.com/ru/post/147254/

All Articles