As an introductory word
On
Habré and other sources, there was already a description of HP Vertica, but, basically, all the information was reduced to a theory. Until recently, in real industrial exploitation Vertica was used (as we call it Vertika, I propose to appoint the feminine) in the States and a little in Europe, but the guys from
LifeStreet Media wrote about it in Habré. One and a half years have passed since working with Vertica, our data warehouse contains dozens of terabytes of data. Per minute, the data server processes thousands of requests, many of which contain tens of billions of records. Data loading goes on all the time in realtime with volumes of about 150 GB per day ... In general, I thought that it was worth filling the gap and sharing the sensations of riding on really modern new technologies under BigData.
To whom it will be useful
I think this will be useful for developers, architects and integrators who are faced with the tasks of storing and analyzing big data in terms of volume, content and complexity of analysis. Moreover, Vertica now finally has a sane, free, full-fledged version of Community Edition. It allows you to deploy a cluster of 3 servers and upload up to 1 TB of raw data into the data warehouse. Given the performance and ease of deploying solutions on Vertica, I think this proposal is worthy to consider when choosing a data warehouse for companies whose data volume fits into 1 TB.
In one paragraph about how we chose
Briefly without reason to holivaru:
When choosing a data warehousing server, we were interested in pricing principles, high performance and scalability of working with large amounts of data, the ability to load data into realtime from many different data sources, ease of start-up projects on their own and minimal maintenance cost: as a result, all these indicators are best for Vertica took us, beating IBM Netezza and EMC GreenPlum. The latter could not fully meet all our requirements. This could result in additional costs for the development and maintenance of our project, which has not very large budget.
')
What does Verica look like from an architect's point of view?
An architect is the most important person in Vertica’s data warehouse. First of all, the success and performance of the data warehouse operation depends on it. The architect has two difficult tasks: to correctly select the technical stuffing of the Vertica cluster and correctly design the physical model of the database.
What does the technical architecture affect?
I give my personal feelings as the criticality decreases:
- Being an MPP server, Vertica primarily imposes strict requirements on the network architecture. If you have 3 servers in a cluster and a base of 3 TB, then the data exchange between them and users is not particularly noticeable in terms of network costs. If you have 6 servers and a database of 30 TB of data, then at peak loads the network may become the weakest point of the entire cluster operation. And imagine that you have 100 servers and a base of more than 1 petabyte? Fearfully? :)
- Correctly assembled arrays on the local disks of the cluster servers ensure successful balancing between performance, storage volume and iron cost. Collected RAID on cheap slow disks, but with large volume? Cheap and angry, but slowly. Collected all on fast drives, but with a smaller volume? Expensive, fast, but not enough space for the base, you also need to connect the server to the cluster. Read the documentation on Flex technology in Vertica, fast and slow disks were hung on the cluster servers, distributing the storage of table columns (projections) by their participation in the queries as filters and returned data? Well done, got the best cluster for work. If we also add local fast disks to the servers and take out the Vertica tempo space, everything will look great.
- An open secret: there is not much memory and it is cheaper to immediately buy servers with a large amount of RAM than to deliver them to existing servers. Here it is worth remembering that analytical requests for billions of data records want a lot of memory, nobody canceled physics. Easily one session of one analyst can eat 20 gigabytes of memory. Count your analysts, count the heavy queries your ELT will perform and don’t start to panic and think about buying a terabyte of memory. Vertica allows you to successfully balance the load of resource consumption using resource pools. Based on the estimated amount of data storage divided by the number of servers in a cluster, taking into account the expected load, you can choose the golden mean of the size of RAM on the cluster servers and reinsure yourself from peak loads using resource pooling.
- What always surprises me in Vertica after Sybase IQ is the low load on processor power. If IQ during the execution of heavy requests, all processors work in full-time, then with Vertica, under the same conditions, they look like they are on holiday in the background of Sybase. And both servers are column-oriented, both compress all data when stored in a database. Miracles, and only that, but the fact remains that processors are not the most critical place for Vertica, although I hope you will not see this as a guide to the fact that 2 cores are sufficient. Miracles do not happen: as elsewhere, running queries occupy the cores, a heavy query can parallelize its execution into multiple threads and occupy multiple cores. Therefore, we choose the number of cluster server cores, focusing on the expected loads.
What does the physical database model affect?
It is difficult to put down the level of criticality, everything is important, so the list without numbering:
- Against the backdrop of Sybase IQ, Vertica works surprisingly effectively with JOIN even of large tables, but nevertheless column-oriented DBMSs were initially focused on trying to get rid of excess connections and simplify data schemes. Wherever possible, use data denormalization. This does not increase the cost of storing data, but it speeds up the execution of queries, including simplifying their writing, by removing unnecessary tables from the query. The disadvantage here is that the denormalized data in the tables takes up space in the license. But, if you think logically, the output in the fact table of the name and price of the goods will take on average 30 bytes per record, that is, a billion records in facts will result in a 30 GB license. God knows how much to save bytes.
- I used to think that partitioning tables is not the most important thing when designing a data schema - on older versions of Vertica, it was. But now looking at the new versions of Vertica, I stopped thinking so. Partitioning allows you to split these tables into logical containers, with which Vertica is easier to manipulate in storing and accessing data. Vertica developers verdict is harsh: there should not be many partition keys and there should not be enough keys. In the first case, Vertica will shut up when fulfilling requests by revising the cloud of containers of partitions. In the second case, Vertica will spend a lot of time merging and reading containers hundreds of gigabytes in size on disk. Also, when planning partitions, it is worth remembering that Vertica is able to quickly and efficiently delete, as well as transfer containers to other tables using partition keys, and this should be used. The conclusion from all this is simple: do not make containers according to logical features, such as for example cities or customers. Ask yourself: do you have a reason to delete or transfer to the archive table from the facts all the information about the city or the client? But deleting or transferring information from the previous period, day, week or month is a good reason to think about the key by date for the part of the fact table. For those who believe in the future of their company and the long happy life of the data warehouse and do not agree that the data does not need to be stored forever, Vertica has a great feature MERGE_PARTITIONS. It allows you to combine the specified interval of partition keys into a single container and keep the number of partition keys in a table within a reasonable limit. I hope it’s understandable about measurements: partitions are definitely not needed there.
- Any MPP server, be it Vertica, Hadoop or Cassandra, likes evenly distributed data across the cluster servers. Any motorist will confirm that poor balancing is immediately visually felt when driving. MPP is no exception here - if you specify city segmentation as a data storage segmentation, then on one cluster server you will have 15 million Moscow, and on the other 40 thousandth Uryupinsk. I think it is already clear which server will work on the requests, while the rest of the cluster servers will simply stand idle and wait for the “Moscow” server to finish. If the segmentation is clearly unspecified, by default Vertica simply hashes the entire table fields. This gives a uniform distribution of data between the cluster servers, but there are times when it can be thoughtfully done manually. This may make sense in cases of frequent use in grouping requests for data on certain fields, provided that there are always approximately uniform distributed quantities of records for these grouping fields. For example, we have about the same number of sales facts for each city every day. Of course in Moscow it will be tens of thousands of records per day, and in Uryupinsk it will give out only a dozen sales. Since there are a lot of cities in Russia and days in the year too, when determining the segmentation of the fact table by the hash of the day of sale and the city, the records of the same Moscow will be evenly distributed between the cluster servers by day, and Uryupinsk will be just a statistical error balancing the distribution of the data of such a table , being laid out between the cluster servers. What do we get with this order of segmentation? A query grouped by day and city will work more optimally than with segmentation into all fields. Each server will be enough to aggregate data by city and day and send the final results to the server initiator, who will give the result of the session. Otherwise, each of the servers would pick up the data, partially aggregate what it has and transmit to the initiator for further aggregation. When such a game worth the candle? My opinion - only if there will be requests for data in which thousands of cities for large periods of sales. That is, in this example, this segmentation is obviously not worth doing and it is enough to make the usual uniform segmentation of sales table data. But as they say, he is an example to show, not to do. With the segmentation of the measurements, everything is also simple - for measurements with a reasonable number of records, it is easier to make a mirror storage of all copies on servers, for the rest, even data distribution between the nodes. Here it is worth remembering that when mirroring a measurement, the costs of each server in the cluster when connecting a fact table with a dimension will only go to just get a copy of the measurement records in memory and start the connection. In the opposite case, you will have to receive measurement pieces from each server and then connect them. Here, the network costs for a mirror are less obtained according to the principle - everyone got it from one thing than everyone got it from everyone.
- One of the advantages of Vertica for marketers is: there are no indices in Vertica. That's right, there are no indices, but there is a sorting and coding format for data storage. In fact, it is directly dependent on these things how efficiently your table or its projection will be ready to perform various queries and who more of the projections will like the query optimizer when building a query execution plan. To be honest, the description of all the nuances of how best to encode and sort the fields in the projections draws on a large thesis on the subject of Voodoo. Therefore, I will limit my recommendations to simple design rules:
- When choosing how to store the value of a column, Vertica automatically focuses on its data type. Since it is not possible to guess by any algorithm that in some case BIGINT is a fact value, and in some dimension identifier, it’s worth while describing the tables not to ignore the ENCODING indication of that part of the fields that will be required for filtering and aggregation in queries, there is not a fact, but a measurement value. With the correct description of this column option, Vertica will be easier to store and search for data on it, and the optimizer will be considered when building more efficient queries. We also do not forget about GROUPING, if there are fields that are always returned and processed together in requests, there is a good reason to combine their storage in one place, reducing the cost of reading and compiling records.
- When assigning sorting to a table or projection, remember that in all cases sorting is not stored and you will not be saved from creating projections. Therefore, for the table, we choose sorting from the point of view of executing the most frequently running queries, and on the remaining cases we are already making projections.
- When choosing fields and sorting order, we are guided by the following rules: at the beginning of sorting, we put the columns that most often search queries for equality and which best fit the definition of unique ones, then it’s nice to put fields that are used in connections or search in the middle of sorting according to the list by the operator IN. The last is to put the field, which are comparison operations. In the case of our long-suffering sales table, from the point of view of sorting, it would be most beneficial to put the following field order: Region, Customer, Date_Sales. This sorting order covers all queries that filter by searching by region and / or clients in the specified section of the date period. If these queries use sorting fields in ORDER BY, Vertica makes it even easier to execute the query.
- Since you still can not be completely sure that the encoding and sorting, and segmentation, will be the most optimal, you can always make a knight's move. Namely: deploy a table prototype on Vertica without explicitly specifying segmentation and sorting, fill it with a certain amount of data, write more standard queries to this table in the file and run Vertica adminTools through the Database Designer of the Vertica utility. The designer will analyze the queries, create the necessary projections for the table with, from his point of view, the best encoding and sorting of data. Then, it will be possible to drive queries using this table, look at their query plans, evaluate how effective the proposed projections are, and already make a working table, focusing on the sorting, segmentation and field coding proposed in the created projections. Well, if necessary, create immediately additional projections for queries that are not covered by the sorting of the table itself and which were proposed by the designer Vertica.
How does Vertica look like from the ETL / ELT developer’s point of view
ETL logic developer lives easily and freely: Vertica has all the standard data access drivers (ODBC, JDBC, .NET, Python). There is also an impressive set of staffed own funds for batch loading of flat files of the COPY team. They can be extended with their own parsers, filters and validators. Files can be downloaded both from the cluster servers themselves and locally from the working machines via COPY LOCAL. Vertica's JDBC drivers support the batch insert insert through the JDBC prepared statement, automatically converting the packages of the inserted values ​​into the COPY package insert. This gives a high rate of insertion of records. The only fly in the ointment is that the batch loading extension functions can only be written in C, which immediately complicates development. Judging by the latest rumors, Vertica is confidently moving towards integration with the world of Java (probably closer to Hadoop), so it is quite possible in the near future, such things can be written and connected to Java. Concerning the performance and efficiency of parallel loading of large amounts of data, Vertica with its architecture completely takes them over. The developer of ETL does not require any special knowledge of the nuances of load organization in realtime and load distribution.
From the point of view of the developer of ELT logic, only two complaints can be made to Vertica: there is no support for the language of stored procedures and there is no possibility of writing your own functions except primitives in the form of expressions or in C. The last Vertica team promises to fix it soon in the form of support for Java functions, but no one has made any promises with the stored procedures yet. So ELT logic has to be stored and executed with its ETL tools. Otherwise, Vertica fully satisfies even the most demanding developer of scripts: full support for the ANSI SQL standard, many functions familiar to other data servers, simple work with data types and their casting, support for local and global session temporary tables for storing intermediate results, operator availability MERGE, support for fast UPDATE and DELETE over large data arrays, extending OLAP functionality by supporting time intervals (TIME EVENT), time connections (TIME JOIN) and much more . All this makes it easy to write complex queries on the transformation and filling of data from the staging area into the window area, count units and KPIs and perform other work on calculating data in the repository. As practice has shown, with Vertica, at the level of complex queries, developers who know Oracle, MSSQL or Sybase IQ DBMS quickly find a common language. For Oracle developers, the lack of a language for stored procedures and cursors may be an additional incentive to change their paradigm for approaches to developing data calculation logic. Perhaps this is one of the reasons why Vertica developers are moving away from the issue of supporting stored procedures, IMHO may be afraid that cursors will kill performance (no joke, of course, but with some truth).
How does Vertica look like from a BI developer’s point of view
From the point of view of the developer of BI solutions, Vertica clearly lacks the possibility of a certain analogue of parameterized views, an example of which in MSSQL can be called table functions or stored procedures in Sybase IQ. When developing complex queries for BI, often filtering data by specified parameters is used somewhere in the subqueries. And from above, data aggregation occurs over the results of the subqueries. The inability to take out such queries in Vertica, such as saved queries with parameters, forces BI developers to copy complex queries between different universals, extracts and other things that one has in BI tools. Otherwise, as is the case with ELT developers, Vertica fully covers the full range of analytical queries of any level of complexity. That is, it has no limitations in SQL queries, there is support for OLAP, WITH, TIME SERIES, and TIME JOIN, EVENT SERIES functions and extensions, work with geo-data, ip addressing, url, etc. The metadata of the Vertica repository itself is great in BI and there are no difficulties with working with Vertica in all major BI products. Here it is worth for those interested in the interaction of Vertica with BI to pay attention to such a wonderful link, like Vertica + Tableau. These two products together produce a powerful analysis output by analysts right in real time. I will say that Tableau was appreciated in our company. I believe that the main issues for BI developers, namely the performance of ad-hoc queries and functionality limitations, are simply absent in Vertica, which has a positive effect on the speed and quality of developing BI solutions.
How does Vertica look like from admin point of view
Administration of Vertica can be considered conditionally zero. This does not mean that Vertica does not need to be administered. This means that the administration of Vertica is required sporadically, as required. Moreover, instead of a dedicated staff member of a permanent administrator, it is possible to remotely administer a server or administer by an architect, developer of ETL or BI. The server administration itself can be divided into a number of categories:
- Manage roles and users. Standard process description in the database of users, their distribution by roles and description of access of roles to database objects. The work is not frequent, it is done as users and roles are added.
- Manage cluster loads. A more complex process that requires a presentation of the architecture of the Vertica server. It requires an analysis of the current loads on a cluster of different processes and groups of users in order to optimally allocate resources of Vertica servers across resource pools. Using resource pools, you can split the execution of requests by category, allocate a different description of resource usage by their needs, specify the hot reserved memory size, the maximum amount of memory consumed, the number of competing connections, the priority of receiving resources, the maximum allowed query execution time, and restrictions on level of use of processor cores. Competent development of resource pools and the distribution of various users on them ensures a balanced cluster operation even in cases of peak loads. This allows you to effectively distribute the execution between the tasks of the categories under real-time, operational reports and long analytical queries. Typically, such work is already being done on a production server, with well-established workloads and a presentation of when and how peak server loads occur. As long as the load conditions do not change, there is no point in changing resource pools.
- Cluster server management. The process of adding new servers to the cluster, replacing or deleting servers from the cluster, followed by rebalancing the data is not complicated and is done through the utility, however, it requires a clear understanding of the server’s operating architecture and planning to work so that the cluster does not suffer a partial decrease in performance. coincided with other costly work in the data warehouse. For example, one of these tasks may be currently overloading a large amount of data from a table into a table or loading from an external source. For example, in our practice, adding 3 more servers to a cluster to 3 servers with a rebalancing of data between them took about a day with a slight loss of performance, which in the end none of the users noticed.
- Cluster recovery If one of the cluster servers fails, the Vertica administrator can restart this server if the server is physically working and sees other servers on the network or replace it with another if the server has failed. At the time of disconnecting the server from the cluster, there is a partial decline in data warehouse performance due to the fact that another server in the cluster was forced to take over the work of the stopped server. The administrator has utilities to start or replace the failed server, Vertica takes over the further work on automatic data recovery on the server and its inclusion in the work. If the hardware of the servers is reliable, then this work is not frequent - in our practice, one of the cluster servers only stopped at Vertica. The first time was due to the fact that there was a failure when working with RAM, the second time when the RAID controller failed. In both cases, the cluster did not stop work, while technical work was done with the servers. Data users and loaders continued to work in the normal mode of work with the data warehouse.
- Upgrade server version. This is done by uploading the distribution package to one of the Vertica servers, temporarily stopping the Vertica server within 10 minutes, starting the installation of the upgrade and restarting the Vertica server. Any administrator who can download files on Linux and run programs can handle this job.
- Query optimization. If there are moments when certain queries are too slow, it may be time for the table to make another projection. To do this, the administrator can call the Database Designer from the Vertica adminTools utility and run problematic queries through it. At the exit, the designer will analyze what exactly is not enough for the quick life of these requests and will issue ready-made recipes in the form of projections on the tables. To follow or not to follow these recipes is up to you. The developers of Vertica claim that their American and European clients, without thinking, simply create the recommended projections and everything is good with them. I personally check back on what the designer is doing. Basically, I don’t have any complaints, but sometimes I still think that some of the proposed projections are superfluous and will take up more space on server disks than they will be useful for speeding up requests. This mainly applies to cases where requests do not have to work out in a couple of seconds and the session is wonderful to wait several tens of seconds while the request is completed without any complaints from the user or the process.
How does Vertica look like in terms of leadership
To begin with, let me list what our management has never had a headache with an explanation of why:
- It is required to quickly add new data sources for analysis: new schemas and tables are added, data loaders are made in them, the work is transparent and clear.
- : , , . .
- , BI: Vertica ANSI SQL .
- : .
- : , . () . , , . , — , - .
:
- .
- .
- .
- Vertica , - - Vertica.
:
- . , . Vertica , , , .
Vertica
Vertica , , . : Vertica . , Vertica Linux . , MS, , , . , .
Vertica , . ( — ) . , , , Vertica, ( … ). , . , Vertica . , Vertica . , , Vertica HP , Vertica .
, «» , .
, DWH
PS , Vertica ;)