Simple and accessible on analytical databases

Interest in Big Data technologies is constantly growing, and the term is becoming increasingly popular, many people want to talk about it, discuss the prospects and opportunities in this area. However, few specify what companies are represented in this market, do not describe the solutions of these companies, and also do not talk about the methods underlying the Big Data solutions. The field of information technology related to the storage and processing of data has undergone significant changes to the present moment and represents a rapidly growing market, which means it is a tasty morsel for many world-famous and small start-up companies in this field. A typical large company has several dozen operational databases that store data on the company's operational activities (transactions, stocks, balances, etc.) that analysts need for business analysis. Since complex, unanticipated requests can lead to unpredictable load on operational databases, analysts' requests for such databases are limited. In addition, analysts need historical data, as well as data from several sources. In order to provide analysts access to data, companies create and maintain so-called data warehouses, which are informational corporate databases designed to prepare reports, analyze business processes and support decision-making systems. Data warehouses also serve as a source for evaluating the effectiveness of marketing campaigns, forecasting, searching for new potential markets and audiences for sale, and all sorts of analysis of previous periods of company activity. As a rule, a data warehouse is a subject-oriented database built on a temporary basis, i.e. all data changes are tracked and recorded over time, which allows to track the dynamics of events. Also, data warehouses store long-term data - this means that they are never deleted or overwritten - only new data is entered, it is necessary to study the dynamics of data changes over time. And finally, data warehouses, in most cases, are consolidated with several sources, i.e. data enters the data warehouse from several sources, and, before it enters the data warehouse, this data is tested for consistency and reliability.

In February 2012, research and consulting company Gartner presented its analytical report for data warehousing. As part of this report, data warehouses are defined as a DBMS that manages and maintains one or more logical databases in the repository. In addition, the data warehouse DBMS must support the relational data model, as well as be able to provide access to data through software interfaces so that third-party analytic applications can use the data stored in the data warehouse. In addition to this, the data warehouse DBMS should have mechanisms that isolate various types of loads from each other, as well as control various user access parameters within the same data instance.

Since one of the Gartner data warehouse requirements in this analytical report was support for the relational data model, the solutions based on the Apache Hadoop platform, respectively, were not included in the number of applicants for consideration in this report. In addition, Apache Hadoop does not quite fit the requirements put forward for modern data warehouses, as it has a low speed of execution of complex analytical queries compared to other solutions presented in this report, and obviously cannot perform the tasks put forward by companies to modern data warehouses. Despite this, the Apache Hadoop platform is a fast-developing project and, perhaps in the future, solutions based on this platform will take their place alongside solutions from other manufacturers. I would also like to note that Apache Hadoop is not designed to analyze data in real time - it is a solution for storing unstructured data with the ability to analyze this data in the future, which does not meet the requirements of companies to be able to quickly analyze data for data warehouses. In addition, the MapReduce paradigm is not suitable for the whole class of analytical tasks that arise before analysts. However, most of the solutions discussed in this article support integration with Apache Hadoop.

According to the results of the analysis carried out in this report, Gardner compiled the so-called Magic Quadrant, where he placed the companies according to the fields of this square.
')

In this article I would like to dwell on the decisions of companies located in the upper right corner of the Gartner magic quadrant. To be more precise, not on the solutions themselves, but on those properties of the solutions that allowed them to become leaders and analyze these properties, as well as assess their applicability in the future.

Before you begin to consider the common features inherent in the leaders in the field of data warehousing, it makes sense to dwell on the solution from Oracle and consider it closer, as this solution, in general, differs from competitors. For data warehouses, Oracle offers the use of its Exadata Database Machine X2-2 hardware platform. However, this is only hardware - Oracle Database licenses are purchased separately. In order to achieve maximum performance, as well as fault tolerance, you need to purchase separately licenses for Oracle Real Application Clusters and partitioning. Oracle claims that, in this form, the data warehouse can use both the OLTP system and the DW solution at the same time. This statement is true in a limited number of cases - often the existence of loads of different types on one platform leads to the non-optimal operation of both. The Exadata architecture implies a division into two types of servers - these are the Exadata Database Server X2-2 compute nodes, where the Oracle Database is located and the Exadata Storage Server nodes that store data. To be more precise, Exadata Storage Server Software is located on Exadata Storage Server, which includes the so-called “Cell Services”, which, in turn, allows the use of “Smart Scan” technology. "Smart Scan" takes up almost all the work on a simple search and sample data, thereby freeing the compute nodes from unnecessary load. This architecture provides the ability for Oracle to load balance, and reduce the amount of data transferred between the two types of nodes, but has a significant drawback, or rather limited scalability, which does not allow the Exadata platform to scale linearly. This leads to the fact that, starting from a certain point, the addition of new nodes does not bring a significant increase in performance, and at the same time is a rather expensive operation, since Oracle hardware and software is expensive. Thus, we can conclude that Oracle data warehousing solutions will not be able to meet the companies' future expectations, although they provide acceptable performance for the current moment. The fact that Oracle is in second place after Teradata is caused, in my opinion, by the aggressive marketing policies of Oracle, and also by the fact that many customers already using Oracle decided to consolidate their databases on the Oracle hardware platform, thereby reducing maintenance costs and support without losing at the same time on performance. Therefore, Oracle does not have the features that other leaders have in this review, but despite this, extensive experience in database development and a huge number of customers made it possible to be in the sector of leaders.

Overview of Oracle Exadata and the discovery of its shortcomings from the company Teradata

The same is true for Microsoft, the SQL Server 2008 R2 DBMS Business Data Warehouse and Fast Track Data Warehouse DBMS solutions on the market. In fact, these solutions do not have sufficient technical functionality in comparison with their competitors, while at the same time offering a lot of marketing "chips". Microsoft sales are largely based on the dumping policy, as well as the game that there are a lot of specialists familiar with these solutions in the market, their training should not take much time, quality and availability of manufacturer support. Also important is the fact that Microsoft solutions can be quickly and efficiently integrated into the existing infrastructure of most companies. With a large number of customers, as well as an extensive network of partners and resellers, Microsoft has managed to achieve good sales by now. Despite this, existing Microsoft solutions are lagging behind their competitors technologically, which leads to the idea that they will not be able to compete with solutions from other manufacturers in the future. In addition to this, Microsoft does not change its tradition and releases products with a large number of errors, while delaying the correction of these errors for a long time, which also does not contribute to the distribution of Microsoft products. The Microsoft PDW solution, released in November 2010, does not have a single review as a system used for industrial operation. This suggests that no one was able to make this solution work more or less stable, providing the current needs of companies.

However, in this article I would like to consider exactly those features that allowed the decisions of other leaders to be ahead, and also hope that in the future they will be able to meet the ever-growing needs of companies.

Before proceeding to the review, it would be good to understand which of the solutions is supplied in the form of software, and which in the form of a pre-configured hardware-software complex. Pre-configured hardware and software complex eliminates the procedure of configuration and configuration, reducing the time required for its implementation, but its cost is significantly higher. The solution, which comes in the form of software, allows you to choose the hardware platform itself and not overpay for the equipment, and also makes it possible to easily increase performance and capacity by adding new servers in the future. In this list, you can highlight IBM Netezza, which comes in the form of proprietary equipment, namely on the blade servers of IBM. This means dependence on the manufacturer, to be exact from IBM, i.e. if the need arises expansion, you will have to contact IBM and buy additional equipment and licenses from it. In addition, the expansion process for IBM Netezza is non-trivial and requires a long migration process, during which the system must be in downtime - this leads to downtime and loss of time. A note on vendor dependency is also valid for Oracle and Teradata solutions. Some solutions can be supplied both in hardware form and in software:

DWH	Software	Hardware
EMC Greenplum	X	X
HP Vertica	X
IBM InfoSphere Warehouse (DB2)	X
Ibm netezza		X
Microsoft SQL Server 2008 DBMS R2 BDW		X
Microsoft SQL Server 2008 DBMS R2 PDW		X
Microsoft SQL Server 2008 DBMS R2 FTDW		X
Oracle exadata		X
Sybase iq	X
Teradata Database 14		X

Here it is necessary to clarify that Microsoft solutions are in fact different types of the same progenitor - Microsoft SQL Server 2008 DBMS R2 and, in general, differ only in the hardware platform on which they are delivered (different servers from HP and Dell), as well as the addition of some software features. IBM InfoSphere Warehouse is a DB2 10 LUW (Linux, Unix, Windows) with the DPF (Database Partitioning Feature) capabilities, respectively, has the same capabilities as DB2 10, but in addition to them can be deployed using the MPP architecture.

MPP architecture

Mass-parallel architecture (Massive Parallel Processing, MPP) is a class of parallel computing systems consisting of a set of nodes, where each node is an autonomous unit independent of others. If this definition is applied to the data storage area, then its meaning will best reflect the term “distributed databases”. Each node in a distributed database is a full-fledged DBMS that operates independently of the others. The distributed database itself is a collection of independent, autonomous nodes connected by a communication network. All data in such a network is distributed evenly among the nodes, i.e. Each node stores its own unique part of the data, logically, nevertheless, representing a single database. For a user, a distributed database looks like a single, unbroken database. When accessing a distributed DBMS, a query is executed in parallel by all database nodes, performing search and selection only in its own unique piece of data, which can significantly increase the speed of access to data. The advantages of this architecture are obvious - it is linear scalability, which provides stable and predictable performance parameters and system development. The trend in data warehousing regarding the architecture is such that in the future only solutions based on the MPP architecture will remain, as they allow you to process huge amounts of information on standard hardware. The disadvantage is also obvious - in order to comply with the functional requirements of ACID, the system must receive a response from each node, therefore the communication network between nodes must have high bandwidth as well as fault tolerance. In reality, for modern data warehouses, existing data transfer standards are enough (10 Gigabit), and with the introduction of a standard of 100 Gigabit, the problem of transfer speed between nodes will completely disappear, since the disk subsystem will be a bottleneck (also not a fact, because In-memory DBs use RAM to store data).

Teradata was the pioneer in the MPP architecture for data warehousing, which introduced the first system with such an architecture in 1984. Since then, Teradata has come a long way, has accumulated a lot of useful experience in this field and is deservedly a leader. In the 2000s, solutions with similar architecture from other manufacturers appeared - these are EMC Greenplum, HP Vertica, Sybase IQ, IBM Netezza, IBM InfoSphere Warehouse and Microsoft PDW. It should be noted that the architecture of MPP can exist in two qualities - it is shared nothing and shared everything. In the first case, each node does not share system resources with other nodes, allocating and using the resources it needs independently. In the second case, the node uses shared resources, referring to a mechanism for obtaining the necessary resources. Each approach has its own advantages and disadvantages - shared nothing does not allow utilizing all resources efficiently enough, leaving some of them to stand idle in any case. Shared everything utilizes resources more efficiently, however, it requires an interprocess negotiation mechanism when using shared resources, which results in additional time required for such coordination when allocating resources. The system resources here are RAM and disk subsystem. Oracle and Microsoft went their own way, using the SMP (Symmetric Multiprocessing) architecture in their designs. This architecture is characterized by the fact that several processes work with a single database. To increase the speed of access to data, Oracle and Microsoft recommend using partitioning, dividing the table into several logical sections, and thus dividing the data into smaller pieces. This approach helps if we need to select a small part of the data in the table, but it is inefficient if we have to use most of the data contained in the table (so-called Full Scan) for analysis, which is traditional for data warehouses with a star or a snowflake Is inevitable. For a great understanding of the principles on which each decision is based, I will give a table:

DWH	SMP	MPP
DWH	SMP	Shared nothing	Shared everything
EMC Greenplum		X
HP Vertica		X
IBM InfoSphere Warehouse (DB2)		X
Ibm netezza		X
Microsoft SQL Server 2008 DBMS R2 BDW	X
Microsoft SQL Server 2008 DBMS R2 PDW		X
Microsoft SQL Server 2008 DBMS R2 FTDW	X
Oracle exadata	X
Sybase iq			X
Teradata Database 14		X

As can be seen from the table, most manufacturers prefer to use a more productive version of shared nothing, despite the fact that it does not utilize part of the system resources, allowing them to stand idle. It is the MPP architecture that is most suitable for data warehouses, since it provides linear scalability, allowing you to increase the capacity of the data warehouse in accordance with the ever-growing requirements. However, a number of solutions located in the upper right quadrant of the Gartner Magic Quadrant are characterized by another property that helps increase data access speed and, accordingly, provide them to analysts as quickly as possible.

Column storage

The idea of storing data not in rows, but in columns is not new and belongs to Sybase, which released Sybase IQ in 1996 - the world's first database that supports column data storage. The basis for this idea was that the load on the data warehouse and operational databases is radically different in nature. OLTP , , (insert/update), , , . , OLTP , , . , , , 10 , , , 10 , , 7 . , , , , . , , , . DWH – Oracle, Microsoft IBM - , , - . , . , , EMC Greenplum Teradata, , .. , . , , OLTP, .. «/» .

In order to understand which data storage model each solution uses, I’ll give a table:

DWH	Row-oriented	Column-oriented
EMC Greenplum	X	X
HP Vertica		X
IBM InfoSphere Warehouse (DB2)	X
Ibm netezza	X
Microsoft SQL Server 2008 DBMS R2 BDW	X
Microsoft SQL Server 2008 DBMS R2 PDW	X
Microsoft SQL Server 2008 DBMS R2 FTDW	X
Oracle exadata	X
Sybase iq 15		X
Teradata Database 14	X	X

, Sybase IQ . Sybase IQ , , . Sybase IQ 15 , Fast Projection Default Index. , Sybase IQ, 10 . , , , , .. . , , Sybase IQ , , , . , , Sybase IQ , .

HP Vertica, (projections), , Oracle, , . HP Vertica, , , , , . , HP Vertica, , , , – , . , , , HP Vertica . , HP Vertica .

, HP Vertica, . , , HP Vertica – Write-Optimized Column Store (WOS) Read-Optimized Column Store (ROS). , ROS WOS, Tuple Mover , ROS. , , . , HP Vertica – , .. – , WOS, , Tuple Mover, ROS.

– external tables, , . Oracle Exadata, IBM InfoSphere Warehouse, IBM Netezza, EMC Greenplum Sybase IQ. EMC Greenplum , Oracle Exadata - SMP . , – IBM Netezza (Zone Maps), — Sybase IQ, IBM DB2 , , .

Microsoft SQL Server-a SQL Server Integration Services (SSIS), , , Data Transformation Services (DTS), SQL Server . , , «Integration Services Package», . DTS Microsoft SQL Server 2000, , ( «», ), .

/ , Teradata. Teradata Parallel Transporter, , / , , .

Gartner . , . , , , , SQL. , , :

HP Vertica ,

IBM Netezza , Full Scan, , -

. , , . , , , .

, . Teradata EMC Greenplum. MPP shared nothing, ( ), . , , , , .

Teradata - Oracle Exadata, .. . Teradata , , . Teradata , Teradata , .

EMC Greenplum , - , HP Vertica IBM Netezza, , , Greenplum , HP IBM. EMC, , Greenplum , . , , . EMC Big Data, , - – Data Science Summit 2012, EMC -. EMC Big Data , , . , EMC – , , Greenplum, . , , Greenplum , , Teradata, .

Gartner DWH Report

Source: https://habr.com/ru/post/149641/

All Articles

Simple and accessible on analytical databases

MPP architecture

Column storage

More articles: