📜 ⬆️ ⬇️

HP Vertica: Big Data DBMS

One of the problems of modern business is the oversupply of data - there is a huge amount of information scattered across different storages, databases, file servers, etc. There is a lot of information, but decisions need to be made promptly.

Tools for working with such big data do not keep up with their growth. Among such problems:
- a high proportion of manual labor,
- the inability to implement the analysis in real time,
- low search accuracy and lack of consistency,
- inefficient processing of unstructured information.

The solution can be a specialized HP Vertica database designed for real-time big data analysis, working much faster than traditional DBMS.
')
Work with data
HP Vertica performs better in storing and compressing data because it uses columns instead of rows. The use of cluster technologies allows a linear increase in system performance by connecting more resources on the fly, reducing storage and reducing search time. Storing data by columns makes it possible to read from the disks not the entire record, but only the necessary fields involved in the query.

Compression of data in columns is performed by recording the number of repetitions together with the field value, delta encoding of consecutive values, and LZO Lempel – Ziv – Oberhumer compression for columns with a large number of unique values ​​and unsorted columns. In addition, special compression algorithms are used for numbers in the floating-point format, dates, and a number of other types of fields. All this allows us to provide a compression ratio of over 90%. An important aspect is the ability in most cases to perform data operations without decoding, which not only reduces the required storage capacity and the number of disk accesses, but also reduces the load on processors and memory.



Acceleration of processing a large number of parallel queries is also carried out through the use of different sorting orders in different copies of columns in different projections that are automatically selected.

Aggressive compression allows you to store multiple copies of the same columns in different "projections" of the database, which are sets of columns contained together. It is possible to store not only different copies on different disks, but also the division of the “projection” according to the value of one of the fields into segments, which are located and processed on different machines.

To work with already accumulated data, Vertica supports SQL and is equipped with a standard SQL interface (ANSI SQL-99), which has extensions for working with analytical queries. The platform is compatible with data cleaning and reporting mechanisms, as well as with business intelligence solutions from Cognos, Informatica, Business Objects and SAS. This makes it easy to migrate databases and use other analytic applications that have a standard SQL interface, ODBC, JDBC, or ADO.NET connectors.



Analytical "tap"
In August 2014, a significant updated version of HP Vertica 7.1 was released, which received the name Dragline - “Scraper Excavator” in continuation of the tradition of large-scale construction. The main innovations of this version are:
- support of direct work with unstructured data,
- text analysis
- geo-spatial analytics
- improved workload management,
- support for projection units and much more.

HP Vertica 7 implements a special area for storing and processing unstructured Flex Zone data. It allows you to create Flex-tables, load in them information from CSV, JSON and other files and perform queries to them, connecting this data in queries with Vertica relational tables. The data in these tables is stored on the cluster nodes in a special format, but according to the same principles as the relational database data. For unstructured data, compression, mirroring, and segmentation are possible.

The advantage of Flex Zone is that it is not an external solution integrated with Vertica, but an implementation of native support for unstructured data. This guarantees the speed of work with hybrid processing in queries using tables of structured and unstructured data.



Clustering capabilities
HP Vertica’s fault tolerance is provided by a special data copy creation mechanism (K-Safety). The mechanism guarantees the maximum available level of fault tolerance in 24x7x365 mode. The cluster is able to withstand failures of several nodes without stopping the execution of requests. The main data segment and its copy are stored on the K nodes of the cluster. If any nodes fail, the system continues to function using copies of segments. Access to this data is automatic. To replace a failed node, the original data is restored to the copies of the segments that are stored on healthy nodes.

In addition, clustering allows you to proportionally increase performance and provide not only scaling, but also fault tolerance. Since the cluster does not contain any shared resources, it does not waste time waiting for their locks and, therefore, there is no need for distributed locking controls. The Vertica architecture also provides for not logging, since logging often becomes a bottleneck when loading data. Instead, the system supports multiple copies of columns on different nodes of the cluster.

Since real-time analytics is most often needed, Vertica has a special mechanism for continuously loading data without slowing down reading. Data is written to a special area of ​​the WOS (Write Optimized Store) RAM, and reading is done from the disks from the ROS (Read Optimized Store) storage area, and the information in WOS is not sorted or indexed. At the same time, information located in WOS is available to obtain the results of the query even before transfer to ROS.

Migration of records from WOS to ROS occurs in large blocks, automatically and asynchronously using a special process for moving Tuple Mover records. Since this process handles the entire WOS, moving records can be very efficient, while simultaneously sorting many records and transferring them to disk in batch mode.



Benefits of using
Statistics of already implemented platforms shows that, on average, work with databases is accelerated up to 1000 times. The average information compression ratio in comparison with other systems is 10: 1, and data loading for further analysis is 10 times faster and comparable to the regime close to real time.

Unlike the solutions available on the market, HP Vertica is not tied to a specific hardware platform - the user chooses the equipment he needs. It is worth noting that there are recommended configurations.

Since Vertica was originally designed to work in a horizontally scalable environment and is licensed not by processors, but by the amount of data loaded into the system, it can be easily integrated into cloud environments, for example, in VMware vSphere or Amazon Elastic Compute Cloud. The advantage of a virtualized environment is the speed of deployment, since all nodes in the Vertica complex are the same and the ready virtual machine image is instantly installed on the existing equipment.

HP Vertica comes with Database Designer software to automatically customize your system. Vertica has simple integration tools and reporting capabilities via SQL, JDBC, ODBC, ADO.NET. There is also a free version of Vertica Community Edition, which allows analysts to create their own applications and share experiences with the Vertica user community.



Life example
One of the largest installations of DBMS Vertica at the moment is made in a company engaged in the development of network games for social networks. The system serves about 200 million active players, up to 40 million playing simultaneously. The daily data stream is 3 TB. 200 machines in a cluster provide instant analysis and provide players with information in the form of recommendations. The installation works 24x7x365 without “windows” for data loading, analyzing incoming and historical data in real time. However, this is far from the limit. The largest client is Facebook with a data volume of several petabytes and a cluster of several hundred nodes. The speed of loading data into a cluster today is 40 TB per hour.



We distribute HP solutions in Ukraine, Georgia and Tajikistan. Prices, questions - write: abo@muk.ua, or in a personal.
Catalog of all solutions and services of the distributor MUK
Authorized Hewlett-Packard Training Courses
Next Hewlett-Packard courses:
February 16-17, 2015, (Kiev, TC MUK) - Infrastructure management through HP OneView
February 11-13, 2015 (Kiev, TC MUK) - HP BladeSystem Virtual Connect
February 23-24, 2015 (Kiev, TC MUK) - Implementing MSA 2000 Storage Solutions
MUK-Service - all types of IT repair: warranty, non-warranty repair, sale of spare parts, contract service

Source: https://habr.com/ru/post/249715/


All Articles