📜 ⬆️ ⬇️

HP Vertica, the first launched project in the Russian Federation, experience of one and a half years of actual operation

As an introductory word


On Habré and other sources, there was already a description of HP Vertica, but, basically, all the information was reduced to a theory. Until recently, in real industrial exploitation Vertica was used (as we call it Vertika, I propose to appoint the feminine) in the States and a little in Europe, but the guys from LifeStreet Media wrote about it in Habré. One and a half years have passed since working with Vertica, our data warehouse contains dozens of terabytes of data. Per minute, the data server processes thousands of requests, many of which contain tens of billions of records. Data loading goes on all the time in realtime with volumes of about 150 GB per day ... In general, I thought that it was worth filling the gap and sharing the sensations of riding on really modern new technologies under BigData.

To whom it will be useful


I think this will be useful for developers, architects and integrators who are faced with the tasks of storing and analyzing big data in terms of volume, content and complexity of analysis. Moreover, Vertica now finally has a sane, free, full-fledged version of Community Edition. It allows you to deploy a cluster of 3 servers and upload up to 1 TB of raw data into the data warehouse. Given the performance and ease of deploying solutions on Vertica, I think this proposal is worthy to consider when choosing a data warehouse for companies whose data volume fits into 1 TB.

In one paragraph about how we chose


Briefly without reason to holivaru:
When choosing a data warehousing server, we were interested in pricing principles, high performance and scalability of working with large amounts of data, the ability to load data into realtime from many different data sources, ease of start-up projects on their own and minimal maintenance cost: as a result, all these indicators are best for Vertica took us, beating IBM Netezza and EMC GreenPlum. The latter could not fully meet all our requirements. This could result in additional costs for the development and maintenance of our project, which has not very large budget.
')

What does Verica look like from an architect's point of view?


An architect is the most important person in Vertica’s data warehouse. First of all, the success and performance of the data warehouse operation depends on it. The architect has two difficult tasks: to correctly select the technical stuffing of the Vertica cluster and correctly design the physical model of the database.

What does the technical architecture affect?

I give my personal feelings as the criticality decreases:
  1. Being an MPP server, Vertica primarily imposes strict requirements on the network architecture. If you have 3 servers in a cluster and a base of 3 TB, then the data exchange between them and users is not particularly noticeable in terms of network costs. If you have 6 servers and a database of 30 TB of data, then at peak loads the network may become the weakest point of the entire cluster operation. And imagine that you have 100 servers and a base of more than 1 petabyte? Fearfully? :)
  2. Correctly assembled arrays on the local disks of the cluster servers ensure successful balancing between performance, storage volume and iron cost. Collected RAID on cheap slow disks, but with large volume? Cheap and angry, but slowly. Collected all on fast drives, but with a smaller volume? Expensive, fast, but not enough space for the base, you also need to connect the server to the cluster. Read the documentation on Flex technology in Vertica, fast and slow disks were hung on the cluster servers, distributing the storage of table columns (projections) by their participation in the queries as filters and returned data? Well done, got the best cluster for work. If we also add local fast disks to the servers and take out the Vertica tempo space, everything will look great.
  3. An open secret: there is not much memory and it is cheaper to immediately buy servers with a large amount of RAM than to deliver them to existing servers. Here it is worth remembering that analytical requests for billions of data records want a lot of memory, nobody canceled physics. Easily one session of one analyst can eat 20 gigabytes of memory. Count your analysts, count the heavy queries your ELT will perform and don’t start to panic and think about buying a terabyte of memory. Vertica allows you to successfully balance the load of resource consumption using resource pools. Based on the estimated amount of data storage divided by the number of servers in a cluster, taking into account the expected load, you can choose the golden mean of the size of RAM on the cluster servers and reinsure yourself from peak loads using resource pooling.
  4. What always surprises me in Vertica after Sybase IQ is the low load on processor power. If IQ during the execution of heavy requests, all processors work in full-time, then with Vertica, under the same conditions, they look like they are on holiday in the background of Sybase. And both servers are column-oriented, both compress all data when stored in a database. Miracles, and only that, but the fact remains that processors are not the most critical place for Vertica, although I hope you will not see this as a guide to the fact that 2 cores are sufficient. Miracles do not happen: as elsewhere, running queries occupy the cores, a heavy query can parallelize its execution into multiple threads and occupy multiple cores. Therefore, we choose the number of cluster server cores, focusing on the expected loads.


What does the physical database model affect?

It is difficult to put down the level of criticality, everything is important, so the list without numbering:


How does Vertica look like from the ETL / ELT developer’s point of view


ETL logic developer lives easily and freely: Vertica has all the standard data access drivers (ODBC, JDBC, .NET, Python). There is also an impressive set of staffed own funds for batch loading of flat files of the COPY team. They can be extended with their own parsers, filters and validators. Files can be downloaded both from the cluster servers themselves and locally from the working machines via COPY LOCAL. Vertica's JDBC drivers support the batch insert insert through the JDBC prepared statement, automatically converting the packages of the inserted values ​​into the COPY package insert. This gives a high rate of insertion of records. The only fly in the ointment is that the batch loading extension functions can only be written in C, which immediately complicates development. Judging by the latest rumors, Vertica is confidently moving towards integration with the world of Java (probably closer to Hadoop), so it is quite possible in the near future, such things can be written and connected to Java. Concerning the performance and efficiency of parallel loading of large amounts of data, Vertica with its architecture completely takes them over. The developer of ETL does not require any special knowledge of the nuances of load organization in realtime and load distribution.

From the point of view of the developer of ELT logic, only two complaints can be made to Vertica: there is no support for the language of stored procedures and there is no possibility of writing your own functions except primitives in the form of expressions or in C. The last Vertica team promises to fix it soon in the form of support for Java functions, but no one has made any promises with the stored procedures yet. So ELT logic has to be stored and executed with its ETL tools. Otherwise, Vertica fully satisfies even the most demanding developer of scripts: full support for the ANSI SQL standard, many functions familiar to other data servers, simple work with data types and their casting, support for local and global session temporary tables for storing intermediate results, operator availability MERGE, support for fast UPDATE and DELETE over large data arrays, extending OLAP functionality by supporting time intervals (TIME EVENT), time connections (TIME JOIN) and much more . All this makes it easy to write complex queries on the transformation and filling of data from the staging area into the window area, count units and KPIs and perform other work on calculating data in the repository. As practice has shown, with Vertica, at the level of complex queries, developers who know Oracle, MSSQL or Sybase IQ DBMS quickly find a common language. For Oracle developers, the lack of a language for stored procedures and cursors may be an additional incentive to change their paradigm for approaches to developing data calculation logic. Perhaps this is one of the reasons why Vertica developers are moving away from the issue of supporting stored procedures, IMHO may be afraid that cursors will kill performance (no joke, of course, but with some truth).

How does Vertica look like from a BI developer’s point of view


From the point of view of the developer of BI solutions, Vertica clearly lacks the possibility of a certain analogue of parameterized views, an example of which in MSSQL can be called table functions or stored procedures in Sybase IQ. When developing complex queries for BI, often filtering data by specified parameters is used somewhere in the subqueries. And from above, data aggregation occurs over the results of the subqueries. The inability to take out such queries in Vertica, such as saved queries with parameters, forces BI developers to copy complex queries between different universals, extracts and other things that one has in BI tools. Otherwise, as is the case with ELT developers, Vertica fully covers the full range of analytical queries of any level of complexity. That is, it has no limitations in SQL queries, there is support for OLAP, WITH, TIME SERIES, and TIME JOIN, EVENT SERIES functions and extensions, work with geo-data, ip addressing, url, etc. The metadata of the Vertica repository itself is great in BI and there are no difficulties with working with Vertica in all major BI products. Here it is worth for those interested in the interaction of Vertica with BI to pay attention to such a wonderful link, like Vertica + Tableau. These two products together produce a powerful analysis output by analysts right in real time. I will say that Tableau was appreciated in our company. I believe that the main issues for BI developers, namely the performance of ad-hoc queries and functionality limitations, are simply absent in Vertica, which has a positive effect on the speed and quality of developing BI solutions.

How does Vertica look like from admin point of view


Administration of Vertica can be considered conditionally zero. This does not mean that Vertica does not need to be administered. This means that the administration of Vertica is required sporadically, as required. Moreover, instead of a dedicated staff member of a permanent administrator, it is possible to remotely administer a server or administer by an architect, developer of ETL or BI. The server administration itself can be divided into a number of categories:


How does Vertica look like in terms of leadership


To begin with, let me list what our management has never had a headache with an explanation of why:
  1. It is required to quickly add new data sources for analysis: new schemas and tables are added, data loaders are made in them, the work is transparent and clear.
  2. : , , . .
  3. , BI: Vertica ANSI SQL .
  4. : .
  5. : , . () . , , . , — , - .


:
  1. .
  2. .
  3. .
  4. Vertica , - - Vertica.

:
  1. . , . Vertica , , , .


Vertica


Vertica , , . : Vertica . , Vertica Linux . , MS, , , . , .


Vertica , . ( — ) . , , , Vertica, ( … ). , . , Vertica . , Vertica . , , Vertica HP , Vertica .

, «» , .
, DWH

PS , Vertica ;)

Source: https://habr.com/ru/post/190740/


All Articles