New version of HP Vertica Excavator (7.2)

In late October, a new version of HP Vertica was released. The development team continued the glorious traditions of the release of construction equipment BigData and gave the code name of the new version of Excavator.

Having studied the innovations of this version, I think the name was chosen correctly: everything that is needed to work with big data from HP Vertica has already been implemented, now you need to balance and improve the existing one, that is, to dig.
')
You can view the full list of innovations in this document: http://my.vertica.com/docs/7.2.x/PDF/HP_Vertica_7.2.x_New_Features.pdf

I will briefly go through the most significant changes from my point of view.

Licensing policy changed

In the new version, the algorithms for calculating the size of the data in the license were changed:

For tabular data, now when counting does not take into account 1 byte separator for numeric and date-time fields;
For data in the flex zone, the license size is calculated as 1/10 of the size of the loaded JSON.

Thus, when upgrading to a new version, the size of your storage license will decrease, which will be especially noticeable on large data warehouses that occupy tens and hundreds of terabytes.

Added official support for RHEL 7 and CentOS 7

Now it will be possible to deploy a Vertica cluster on more modern Linux operating systems, which I think should rejoice the system administrators.

Optimized database directory storage

The format for storing the data catalog in Vertica is already quite a lot of versions remained the same. Given the growth not only of the data in the databases, but also the number of objects in them and the number of nodes in the clusters, it has already ceased to satisfy the efficiency issues for high-loaded data warehouses. The new version was optimized in order to reduce the size of the directory, which had a positive impact on the speed of its synchronization between nodes and working with it when executing queries.

Improved integration with Apache solutions

Added integration with Apache Kafka:

This solution allows you to organize real-time loading of streams through Kafka, where this product will collect data from streams in JSON and then load them in parallel into the Vertica Flex storage zone. All this will allow you to easily create downloads of streaming data without the involvement of expensive software or resource-intensive development of your own jobs on ETL.

Also added support for downloading files from Apache HDFS in Avro format. This is quite a popular format for storing data on HDFS and it really didn’t have enough support before.

Well, Vertica's work with Hadoop has become so constant with customers that now there is no need to install a separate work package with Hadoop in Vertica, it is immediately included in it. Remember to remove the old Hadoop integration package before installing the new version!

Added drivers for Python

Now Python to work with Vertica has its own native full-featured drivers, officially supported by HP Vertica. Previously, developers on Python had to be content with ODBC drivers, which created inconveniences and additional difficulties. Now they can easily and easily work with Vertica.

Improved JDBC driver functionality

Added the ability to perform multiple requests at the same time within one session (Multiple Active Result Sets). This allows the session, to build a complex analytical query with different sections, to simultaneously run the necessary queries, which, as they are executed, will return their data. Those data that the session has not yet taken from the server will be cached on its side.

Also added is the functionality of calculating the hash of field values, similar to calling the Hash function in Vertica. This allows even before loading records into the data storage table, to calculate which nodes they will be placed on by a given segmentation key.

Expanded management of the cluster node recovery process

Added functionality that allows you to set the priority of recovery tables for recovered nodes. This is useful if you need to balance the recovery of the cluster yourself, determining which tables will be restored among the first, and which ones should be restored last.

Added new backup mechanism functionality.

You can back up to local node hosts;
You can restore a schema or a table from a full or object backup;
Using the COPY_PARTITIONS_TO_TABLE function, you can organize sharing of data storage between several tables with the same structure. After copying these partitions from the table to the table, they will physically refer to the same ROS containers of the copied partitions. When changes are made to these table partitions, each further has its own version of the changes. This makes it possible to do snapshots of partitioning tables into other tables for their use, with the guarantee that the original data of the original table will remain intact, at high speed, without the cost of storing the copied data on disks.
With object restoration, you can specify the behavior when the object to be restored exists. Vertica can create it, if it is not already in the database, do not restore it, restore it from a backup, or create a new object next to the existing one with a prefix in the name of the backup name and its date.

Improved optimizer performance

When joining tables with the HASH JOIN method, the connection processing process could take a lot of time if both joined tables had a large number of records. In fact, it was necessary to build a hash table of values on the external connection table and then, scanning the internal connection table, look for the hash in the created hash table. Now, in the new version, scanning into the hash table is made parallel, which should significantly improve the connection speed of the tables by this method.

For query plans, it is possible to create query plan scripts using hints in the query: specify the explicit order of joining tables, their join and segmentation algorithms, list projections that can or cannot be used when executing a query. This allows you to more flexibly seek from the optimizer to build effective query plans. And in order for BI systems to take advantage of this optimization when performing standard queries without the requirement to enter descriptions of hints, the ability to save the script of such queries has been added to Vertica. Any session that executes a query that matches the pattern stored by the script will receive the optimal query plan already described and work on it.

To optimize the performance of queries with multiple calculations in calculated fields or conditions, including like, JIT compilation of query expressions has been added to Vertica. Earlier, interpretation of expressions was used and this greatly degraded the speed of query execution, in which, for example, dozens of like expressions were encountered.

Extended data integrity checking functionality

Previously, Vertica, when describing restrictions on tables, only checked the NOT NULL condition when loading and modifying data. All PK, FK and UK restrictions were fully checked only with single DML INSERT and UPDATE statements, as well as for the MERGE operator, whose operation algorithm directly depends on the compliance of the PK and FK integrity constraints. However, it was possible to check for violation of the integrity of the values of all constraints using a special function that issued a list of violating constraints.

Now, in the new version, you can enable checking all restrictions for group DML operators and COPY on all or just the necessary tables. This allows you to more flexibly implement checks on the purity of the downloaded data and choose between the speed at which data is loaded and the simplicity of checking its integrity. If the data in the repository comes from reliable sources and in large volumes, it is reasonable not to include checking restrictions on such tables. If the volume of incoming data is not critical, but their purity is questionable, it is easier to enable checks than to implement the ETL checks themselves.

Deprecated Ad

Alas, any product development always not only adds functionality, but also gets rid of obsolete. In this version of Vertica, not so much has been declared obsolete, but there are a couple of significant ads that are worth considering:

Ext3 file system support
Support for pre-join projections

Both items are quite critical for Vertica customers. Those who have long been working with this server can easily have a cluster still on the old ext3 fs. And I also know, many, use pre-join projections to optimize queries to constellations. In any case, an explicit version of the removal of support for these functions is not indicated, and I have time to get ready for Vertica customers, I think, at least a couple more years.

Summarizing the impressions of the new version

This article lists only half of what has been added to Vertica. The scope of expansion of the functionality is impressive, but I listed only what is relevant for all projects of building data warehouses. If you are using full-text search, geolocation, advanced security, and other cool features implemented in Vertica, then you can read all the changes on them at the link that I gave at the beginning of the article or documentation on the new version of Vertica:
https://my.vertica.com/docs/7.2.x/HTML/index.htm

From myself I’ll say: working with large data warehouses on HP Vertica in dozens of terabytes in different projects, I appreciate the changes in the new version very positively. It really implements a lot of things that I would like to receive and facilitate the development and maintenance of data warehouses.

Source: https://habr.com/ru/post/270755/

All Articles