Evolution of analytical infrastructure

With this article I open a series of materials about the infrastructure for analytics in general and the exotic for Russia database Vertica in particular. The articles describe the experience of a series of projects in my LifeStreet company and do not claim to be complete. However, where this seems possible, I will try to give general reviews. Before starting the conversation about Vertical itself, I want to tell a little about how we came to it. Let's start with the history of the development of the analytical infrastructure in our company.

Part 1. A bit of history, theory and practice

Traditionally, we profess an iterative process of developing everything new. That is, first a quick prototype is made to “touch” some subject or technological area. Then, starting from the prototype, the architecture and design are developed “as it should be”, with preference given to fast enough good solutions in implementation rather than academically correct, but long and complicated. Then, the concept of “as it should,” is changing, and the architecture is modified, “as it is actually necessary.” And so on. All changes occur on a working and dynamically developing business, which requires a careful evolutionary approach. So it was with the analytical platform.

The first version of the “infrastructure” was made “on the knee” in two days back in 2006, when the company had 4 developers, and about the same number of people from the business. It was a MySQL with one table and one Java class that processed and loaded nginx logs. Even this minimum set was already enough to start doing some marketing experiments and make money. Then there was a long period of design and specification of requirements, as a result of which a data model, structure of facts, measurements, metrics, aggregates, etc., were developed. All this was coordinated with a parallel ontology of advertising Internet companies. Within several months, the first version of our analytical data storage was built on MySQL, which includes all the main components: the physical and logical scheme ETL, written partly on stored procedures, and partly on java. Later, we began using MicroStrategy as the client side, which we later abandoned in favor of our own development. Looking ahead, I note that this infrastructure worked for 4 years with minor changes, and our company began to rapidly conquer markets on it.
')
However, we understood that the MySQL solution was temporary, as the business required a scalable infrastructure that could support up to a billion facts per day. Optimistic estimates promised up to 400 terabytes of raw data per year. Then we made a very reasonable, but wrong decision to go for Oracle. For the year, we built a data warehouse on Orakle on the correct hardware and software in accordance with standard Oracle practices such as materialized views, and it even worked, and quite quickly. In addition to the speed of Orakla itself, we were able to scale ETL horizontally. If it were not for one “but”. Oracle was absolutely impossible to support in the framework of the processes that took place in the company. And the processes are simple - constant incremental development, adding new fields, entities, etc. Any new field, which even indirectly affected MV, caused a cascade of rearrangements completely blocking the system. Attempts to overcome this have come to nothing. In the end, we abandoned Orakla and decided to suffer a little more with MySQL until we find a better option. By the way, at that time we came across Tokutek and its product TokuDB.

Located in the suburbs of Boston, Tokutek is one of the many MySQL companies that do something good. TokuDB is a storage engine for MySQL that has several unique properties that are very useful for analytic applications. Probably, in this place it is worthwhile to step back a little to one side, and tell about the features of circuit design for analytical applications.

The traditional approach to the design of a database schema for analytics is the so-called star diagram or asterisk. The point is that there are large central fact tables and many relatively small dimension tables. The fact table contains metrics and measurement keys. If you draw tables and links in some editor or simply on the board, you get an asterisk or a sun, where the table of facts is in the center, on the rays of the measurement table. Hence the name. In dimension tables, data is often denormalized and organized in hierarchies. Textbook examples are hierarchies of time and geography. For example, the following table:

dim_geo

geo_key
city
state
country

From the point of view of data normalization, this is bad, since the same state and country are repeated many times. But from the point of view of the analytical model, this, on the contrary, is convenient and efficient, since it is not necessary to make unnecessary joins.

Star schemas and hierarchies dictate typical query types. For example, suppose we want to get statistics on the number of impressions of our advertising in different countries.

select date(event_date), sum(impressions) from fact_table join dim_geo using (geo_key) where country = 'RU' group by 1;

What is significant about this request? Those who are able to read the execution plans will immediately guess that in this case there will be a large index range scan for geo_key for Russia, if there is an index for geo_key in fact_table, and full scan if there is no index. Then sort by date (event_date). If there are hundreds of millions or billions of rows in a table, then such a query is unlikely to be fast on MySQL.

Even from this simple example, we can assume that, firstly, analytical queries “like” different indices. Secondly, these indexes are sparse, that is, one unique index value corresponds to many records. And the third thing is that aggregating and sorting a large number of rows is a fairly typical operation. We will talk more about this later, but now we will return to the capabilities of TokuDB, and how they help in analytical applications.

The most important thing is the proprietary proprietary technology of indexes that use not binary, but fractal trees. It has nothing to do with fractals, it's just a beautiful word, and I will not explain how they work in detail - it's difficult. (Although, if interested, I can write a separate article or send it to an old article on mysqlperformanceblog , where there are links to academic materials). The key difference from binary trees from the user's point of view is the performance of inserts and deletions with large table sizes. Those who are seriously engaged in databases, know that in the same MySQL with a table size of hundreds of millions of rows and more, adding new records can take considerable time, especially if the table has several indexes that do not fit into the index cache. This happens because, firstly, in order to “go through” the index, you have to read it in slices from different places on the disk. And secondly, the addition of a new “sheet” can cause (and causes) an index rebalance, which leads to the displacement or replacement of significant parts of the index. In general, a serious problem. In TokDB fractal indexes it is solved, and the performance of inserts and deletions practically does not fall with the size of the table and with the number of indices in it. In particular, this is achieved by the fact that the whole path along the index tree “fits” onto the disk so that it can be read sequentially. Plus tricky cache.

The second major advantage of TokuDB is clustering indexes. The bottom line is that the whole record is stored with the index. This significantly affects the speed of queries, especially for queries on sparse indexes, because otherwise the index first obtains references to the data, and only then the data itself is read. In MySQL, for MyISAM, the index and data are always stored separately, and for InnoDB, the data is stored sorted by the primary key, and the remaining indices are stored separately. For analytical tasks, one index is usually small.

These two competitive advantages led us to throwing Oracle out and living on MySQL with TokuDB for another 3 years. Their product was constantly developing, they readily provided support, despite the fact that we enjoyed a free license. Already when we almost switched to Vertica, two even more “sweet” features appeared in TokuDB: the “hotter” addition of the index, and the “hotter” modification of the tables. In the standard MySQL engines, these operations block the table. If a table is several tens of gigabytes in size, then for a long time.

So, our infrastructure at the end of 2009 is MySQL with TokuDB, which loads about 150-200 million facts per day. The facts are not stored for long, but they are aggregated when loading in a dozen two aggregates of different granularity. Some units are “eternal”, others store data for only a few days or weeks. It can be longer, but we are limited to a free license for TokuDB, so the size of the units does not exceed 50GB. We have several more or less identical systems for reserve or for specific tasks. The main client interface is MicroStrategy, but we have already begun the process of transition to our solution, having developed the first versions of the universal service for performing analytical queries in the new Scala language. We have problems with both performance and scalability, which are not yet beating, but they already bite. We have become true professionals in configuring MySQL and designing applications for it in the “right” way. But this is not enough.

Part 2. The choice of specialized analytical database

TokuDB allowed us to fairly quietly hold an internal competition for the best specialized database for analytics. In order to understand what might be special in such a database, let us return to the example with country statistics from the first part.

Imagine that there are 100 columns in the fact_table table. This is quite a realistic figure, we now have, for example, more. The query uses only three columns - 'event_date', 'impressions' and 'geo_key' - but in a traditional database, data is stored in rows. Regardless of how many columns you need to fulfill a query, everything that is very inefficient for tables where there are many columns will be read from the disk. This unpleasant effect was noticed quite a long time ago, and the first database that used column-oriented internal data storage structure was a by-product of Sybase SybaseIQ , developed in the mid-90s. As I understand it, he did not receive due recognition. But in the mid-2000s, they returned to this issue again, and the pioneer was the notorious Michael Stonebraker with the C-Store research project, which later became the basis for the Vertica commercial platform. I learned about the C-Store in mid-2008 from a translation article at CITForum , and this was a turning point.

The advantages of column-oriented storage are not limited only to the fact that it is enough to read only the columns required for a query from a disk. The data in the columns are usually very well compressed. For example, if you have only two values in a column, you can encode them with bits, if several - then with a dictionary. If the values are grouped around a center, then you can encode deltas. And so on, there are many options. All this allows you to compress data several times or even dozens of times, significantly reducing the volume of disk operations. Another obvious feature is that it is very easy to add columns, it is practically free. There are other, less obvious. But all these properties have one thing in common: they are very suitable for databases used for analytical tasks and types of queries.

At the end of 2009, there were only a few commercial databases that offered column-oriented data storage:

Vertica
Greenplum
InfoBright
ParAccel
InfiniDB
Oracle (Exadata)

They were on varying degrees of readiness and availability, but we tried them all, except Orakla. For testing, we have prepared a table in which there were 25 columns and 12 billion lines. Eight test queries measured various aspects of performance; in all of them there was summation and filtering, in some - grouping and subqueries. The reference system was MySQL with TokuDB. In cases where the database allowed scaling by adding servers to the cluster, we tested how this improves performance.

The project lasted about two months by the forces of one person, since each of the systems required some study, adjustment and loading of data. The results of all experiments were summarized in a general table. As a result, Vertika overtook all. The single-server system worked 20-100 times faster than the reference MySQL, and 2-4 times faster than Greenplum, which took the second place. Vertika scaled linearly, unlike GreenPlum, where adding a second server increased performance by only half. And, as we later understood, this was not the limit, and with a more optimal physical design, Vertik could be even faster. Thus, the choice was obvious, it remains only to sign a lucrative contract and begin to migrate our infrastructure to a new database.

About this and about the features of Vertika in the next article.

Source: https://habr.com/ru/post/146224/

All Articles

Evolution of analytical infrastructure

Part 1. A bit of history, theory and practice

Part 2. The choice of specialized analytical database

More articles: