📜 ⬆️ ⬇️

Oracle vs Teradata vs Hadoop

This article is aimed at Large and Very Large Data Storages, but for an even picture, small and small ones are mentioned in the classification.

The article is written for professionals who appreciate the main criterion for working with databases - speed. It will be a question of systems aimed at a rough full scan (the oraklists have already tensed, and the Theradatists are happy).

Let's take a look at how much data and work Oracle or Hadoop / NoSQL is best for.

1) On small volumes it is more profitable to use NoSQL without Hadoop, if you feel sorry for $ 900 on Oracle SE One. Its main benefit is the price, the base NoSQL, as a rule, are free. The small amount of data implies a small complexity of the model and development in the database.
')
2) On medium and large volumes, Oracle has great advantages over Teradata and Hadoop. Its main advantages are:
1. Very high maturity of technology and product, the number of implementations compared to Hadoop.
3. A very rich set of technologies that significantly simplifies and speeds up development compared to all.
3. I suspect that Oracle is cheaper to operate than Hadoop, because of the cost of renting a server and electricity.
4. Price compared to Teradata. If you do not buy Exadata, but build your server, I think the price difference will not be huge compared to Hadoop.

Oracle has good scalability, but there is also a bottleneck, it is a storage subsystem, it is one for all. So up to a certain limit, Oracle shows one of the best processing speeds.

The fastest self-assembly storage arrays that I saw provide 18 Gb / s (although I am sure that there are more). Exadata Full Rack provides for the expense of custom firmware of the entire Hardware and Software 25 GB / s.

However, it often happens that the full scan performance in Orakle is not enough.

I'll tell you by example. In 2007, in the Beeline, 170 million records per day fell into one table, these are all calls throughout Russia. Analyzing, and being expressed in the slang of the basemen, to run, according to such a table is unrealistic, no performance of the hard drives is enough. In such cases, optimization is applied, and several large aggregates of 4 million records per day are created on the basis of this fact table. And on the basis of these large aggregates, many smaller aggregates are already being created for specific tasks / reports. This kind of optimization can be done on Oracle, on Teradata, and on Hadoop.

This system has 3 drawbacks:
1. If business users need a new field that is not in the aggregates, the development process, that is, adding it is a very long one.
It is necessary to stretch the field through all units.
2. Not all ad-hoc reports on such a system are possible. And they are because Ad-hoc, that a report is required here and now, is small and simple, and this is either a loss for the company, or the answer to the question is already outdated and not needed.
3. Very difficult ETL.

Here to solve the two data disadvantage, and you can apply Hadoop or Teradata.

3) Hadoop can be used on extra large volumes.
Advantages of this technology 2:
1. Almost infinite linear scalability. You can provide 25, 125, 1000 gigabytes per second.
2. Price, all for free. Except iron, of course.

Disadvantage:
1. Creating MapReduce procedures is usually time consuming. So Ad-hoc queries will not be as simple as in SQL.

I did not compare the performance of Oracle and Hadoop on the same hardware, but I think Hadoop will give way to Oracle significantly. If we take into account only the speed of the screws, then Exadata gives out 25 Gb / s, a regular office disk 7.2K 100 Mb / s, it turns out that you will need 250 ordinary computers. A typical computer costs 20 thousand rubles. Consumes 200 watts Exadata 7600 Wat. Hadoop, it turns out, is very disadvantageous in terms of electricity, and this is without taking into account the fact that Exadata has everything with double redundancy.

4) Over large volumes on Teradata.
Teradata does a much better job with the coarse data manipulation method, such as full scan. Teradata has a shared-nothing ideology and it is very similar to Hadoop / NoSQL. The data lies on a set of servers, each server processes its part by itself. But Teradata has a significant drawback - a rather poor toolkit. It is inconvenient to work on it. Compared to Oracle, it’s not such a mature product. The price, a full cabinet of Teradata and Exadata Full Rack are about the same, $ 5 million.

I also mention the general lack of Teradata and Hadoop. This is the need to somehow distribute the data to the nodes. This can be either a natural key, i.e. business key or surrogate. Time is no longer suitable here, this is not partitioning. It is necessary that the future data fit evenly across all nodes. Region, for example, for Beeline, a bad attribute, Moscow occupies 30%. Either some kind of surrogate or hash key.

The advantages of Teradata are that, in fact, it has triple partitioning, while Oracle has double. If there are 170 million lines in one partition, this is very useful; if you break this 170 million into 85 sub-subpartitions across regions, and into Teradata also into 30 nodes, the final data set can be considered very quickly.

Teradata limit:
Teradata at the expense of the shared-nothing technology and the BYNET V5 network can scale up to 2048 nodes, at 76TB (10K) per node, for a total of 234PB. And one Exadata rack is only 672Tb (15K) or 200TB (7.2K). Parallelizing Exadata is not particularly beneficial. Disk space is one for all! And if you combine the disk space of 2 racks (whether it is allowed Exadata - I do not know?), Then everything will rest on the performance of the 40 Gigabit network between the racks. Rather, rack 1 will have fast and wide access to its screws, but slow to rack 2 screws, and vice versa.

It should also be borne in mind that Teradata and Exadata have a column, hybrid compression. Up to 4-6 times the average compression. Although it is also in NoSQL databases, but perhaps not as effective as in such monsters, which took a lot of money to develop.

To complete the picture, it is worth mentioning that:
Oracle Cache Level 2, RAM and SSD Flash Cards.
Teradata Level 1 - memory, but there is a know-how - temperature storage.
Due to cache level 2 and the lack of MPP, Exadata is much better suited for OLTP loads.

Conclusion: If you do not have ad hoc requests, all requests are known in advance and data is not more than 600TB, then take Oracle - it is very convenient to work. If more, then take Teradata or Hadoop.
If you have more than 100TB of data and a lot of ad hoc queries, take Teradata or Hadoop.

PS I wanted to add a bunch of Oracle + Luster to the article, but I realized that it doesn’t add anything to Oracle, everything again comes up against the performance of the 40 Gigabit network.

Source: https://habr.com/ru/post/235465/


All Articles