📜 ⬆️ ⬇️

You do not need Hadoop - you simply do not have so much data

I was asked: "How much experience do you have with big data and Hadoop?" I replied that I often use Hadoop, but rarely - with data volumes larger than a few TB. I am new to big data - I understand the ideas, wrote the code, but not on a serious scale.

The next question was: “Can you do simple grouping and Hadoop sum?” Of course, I can, and I asked for an example of the data format.

They handed me a flash drive with all 600 MB of data (yes, it was all the data, not a sample). I do not understand why, but they did not like my decision, in which there was pandas.read_csv and there was no Hadoop.

Hadoop restricts you. Hadoop allows you to perform only one type of computation, which in pseudocode looks like this:
')
Scala: collection.flatMap( (k,v) => F(k,v) ).groupBy( _._1 ).map( _.reduce( (k,v) => G(k,v) ) )

SQL: SELECT G(...) FROM table GROUP BY F(...)

A few years ago I explained it this way:
Purpose: count the number of books in the library.

Map: You count the odd shelves, and I count the even ones (the more people we have, the sooner we will complete this part).

Reduce: We will assemble and add up the quantities received by each of us.

The only thing you can change in this calculation is F(k,v) and G(k,v) . Well, still optimize the performance (and this is not the most pleasant process) of intermediate calculations. Everything else is strictly fixed.

Hadoop requires that all calculations be processed through transformation, grouping, and aggregation. Sequences of such calculations are also allowed. Many calculations in this Procrustean bed do not fit. The only reason to use it is that it allows you to scale calculations on huge amounts of data. Most likely, your data is several orders of magnitude smaller. But due to the fact that Hadoop and big data are fashionable words, half the world is stubbornly trying to get on this bed without any need.

But I have hundreds of megabytes of data! In Excel, do not download them!


Too much for Excel is not big data yet. There are many great tools for them. My favorite is Pandas , made on the basis of NumPy. You can load hundreds of megabytes into memory in a convenient format for quick processing. On my three-year-old laptop, NumPy multiplies 100,000,000 floating-point numbers in the blink of an eye. Matlab and R are also good for this purpose.

In addition, hundreds of megabytes often lend themselves to a simple Python script that reads a file line by line, processes the results and writes them to another file.

And I have 10 GB!


I just bought a new laptop. 16 GB of RAM in it cost me $ 141.98, and a 256 GB SSD — another $ 200 (Lenovo is preinstalled). In addition, if you load a 10-gigabyte CSV into Pandas, it will often take significantly less memory, due to the fact that numeric strings like "17284932583" are loaded as integers in 4 or 8 bytes, and "284572452.2435723" as floating-point numbers . Well, in the worst case, it will not work to load everything into memory at the same time.

And I have 100 GB / 500 GB / 1 TB!


A 2 TB hard drive costs $ 94.99, a 4 - $ 169.99. Buy one and put Postgres on it.

Hadoop << SQL and Python


In terms of expressing your calculations, Hadoop is strictly inferior to SQL. There are no such calculations that can be written in Hadoop, but it is not easier to write either a script in SQL or in a simple Python script that reads files line by line.

SQL is a simple query language with minimal abstraction leaks, which is widely used by both programmers and business analysts. Queries in SQL are usually quite simple. In addition, they are usually very quickly executed - if your database is properly indexed, queries rarely need to take more than a second.

Hadoop has no concept of indexes. Hadoop can only completely scan tables. Hadoop is full of holey abstractions - at the last job I spent more time fighting Java memory errors and file fragmentation than on the analysis itself (fairly simple), which I wanted to do.

If your data is not structured as a SQL table (for example: plain text, JSON, binary files), it is usually sufficient to simply write a small Python or Ruby script to process the data line by line. In cases where SQL is not well suited to the task, Hadoop will be less annoying from a programming point of view, but it has no advantage over simple scripts.

In addition to being harder to program, Hadoop is almost always slower than simple alternatives. SQL queries can be executed very quickly due to the rational use of indexes - to compute a join, PostgreSQL simply looks at the index (if there is one, of course) and see the necessary key value. Hadoop requires a full table scan, followed by a full sort. Sorting can be accelerated by splitting the data into several machines, but on the other hand, you will have to send data to them. In the case of binary data processing, Hadoop will require multiple calls to NameNode, just like a Python script will require multiple calls to the file system.

And I have more than 5 TB of data!


Well, I can only sympathize - here you can not do without Hadoop. There are few other reasonable options (although large servers with many hard drives may suffice), and most of them are much more expensive. You do not need many other options (large servers with a large number of hard drives can still be in the game), and most of your other options are much more expensive.

The only advantage of Hadoop is scaling. If you have one table containing many terabytes of data, Hadoop may be a good option. If you do not have such a table, avoid Hadoop like the plague. It’s not worth it, and you’ll get results with less effort and in less time, using traditional methods.
Ps. Advertising

I have a startup whose goal is to provide data analysis (both large and small), real-time recommendations and optimization for publishers and e-commerce sites. If you want to become a beta user, write to stucchio@gmail.com.

Also, I am a consultant. If your company needs a strategy for working with big data in the cloud, I can help. But I’m warning you right away - it’s quite possible that I’ll show you Pandas and offer you to compare, instead of pushing Hadoop right away.
Pps. Hadoop is a great tool.

I have nothing against Hadoop. I regularly use it for tasks that would be difficult to do without him. (Tip: I prefer Scalding instead of Hive and Pig. It allows you to use Scala (a good programming language) and easily write concatenated tasks without hiding the underlying MapReduce.) Hadoop is an excellent tool that goes to certain trade-offs and is focused on specific usage scenarios . I just hope to convince you to think seriously before choosing on the Hadoop machine in the Cloud to handle 500 MB of Big Data.

Source: https://habr.com/ru/post/194434/


All Articles