Hive vs Pig. Why do I have so many ETL?

It is better to lose a day, but then fly in five minutes (s )

Hello colleagues.
I want to share with you thoughts about the differences between the Hive and Pig frameworks that are part of the Hadoop ecosystem. In fact, these are two very similar products, the goal of which is one - to take over the entire technical implementation of MapReduce, providing in exchange the opportunity to describe the data processing process at a more abstract level. In this article, we will see how the samples look in these two systems, and try to understand in what cases we need to use a particular solution.

Heve

So let's start with Hive. Its main feature is a SQL-like query language HQL (Hive query language). It allows you to work with data in the usual way, as if we were working with a regular relational database. Scripts can be run either via the console or via the command line.

Hive this:

SQL-like HQL language
Interactive console
Built-in aggregation functions
Custom Function Support (UDF)
Data as a table

Hive can work:

with text files (you can specify a delimiting character)
with compressed text files (Gzip, Bzip)
with arrays, dictionaries, unions (union)
has a huge number of built-in functions for working with: collections, dates, strings, JSON
with mathematical functions (rounding, logarithms, roots, trigonometry)
with aggregation functions (sum, min, max, avg ...)
If all of the above is not enough, you can use custom functions, as well as mappers and reducers (python, java)

')
A simple example:

--  . (  ) CREATE EXTERNAL TABLE win_bids_log ( date_field string, request_id string, user_ssp_id string, dsp_id string, win_price int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 'hdfs://inpit/bid-logs'; CREATE EXTERNAL TABLE win_bids_by_dsp ( dsp_id string, win_bids_cout int, win_price int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION ''hdfs://output/win-bids-by-dsp''; INSERT OVERWRITE TABLE win_bids_by_dsp SELECT dsp_id, COUNT(dsp_id), SUM(win_price) FROM win_bids_log GROUP BY dsp_id;

As you can see, everything is quite simple and clear. Pretty nice to write requests in a familiar language. But this happiness continues until you have to deal with more complex queries.

The example is more complicated:

 INSERT OVERWRITE TABLE movieapp_log_stage SELECT * FROM ( SELECT custid, movieid, CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid, time, CAST((CASE recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, activity, CAST(null AS INT) rating, price FROM movieapp_log_avro WHERE activity IN (2,4,5,11) UNION ALL SELECT m1.custid, m1.movieid, CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid, m1.time, CAST((CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, m1.activity, m1.rating, CAST(null as float) price FROM movieapp_log_avro m1 JOIN ( SELECT custid,movieid, CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,MAX(time) max_time, activity FROM movieapp_log_avro GROUP BY custid, movieid, genreid, activity ) m2 ON ( m1.custid = m2.custid AND m1.movieid = m2.movieid AND m1.genreid = m2.genreid AND m1.time = m2.max_time AND m1.activity = 1 AND m2.activity = 1 ) ) union_result;

Of course you can figure it out, but it’s still worth admitting that in this case there is definitely a lack of orderliness. Spread it all on the shelves, but with the comments. Is not it?

Total:

Hive pluses:

Good old SQL is good for describing samples. And just everyone knows him.
MapReduce under the hood. It leaves a lot of overheads associated with strapping around mr. Description of data models, input and output formats, chains of MR tasks.
Interactivity. Good for analyzing data in different sections.
Speed of development
No dependencies, compilations, builds (all this is hidden)

Hive Cons:

Not everything can be put in the HQL paradigm
It fits well in the head, with simple samples. But as complexity grows, it becomes harder and harder to understand them. Especially if the sample was not written by you.

Pig

Let's talk now about Pig. It is based on the procedural language Pig Latin. To understand it you need to spend some time.
Let's understand and hike find out the differences from Hive

Pig is:

Pig Latin language
Interactive console
Built-in aggregation functions
Custom Function Support (UDF)
Data - in the form of structures (Tuple, Bag)

Pig can work:

with text files (you can specify a delimiting character)
with compressed text files (Gzip, Bzip)
with arrays, dictionaries, unions (union)
has a huge number of built-in functions for working with: dates, lines, structures
with mathematical functions (rounding, logarithms, roots, trigonometry)
with aggregation functions (sum, min, max, avg ...)
If all of the above is not enough, then you can use custom functions (jython, java)

As you can see, Pig can do the same as Hive. The only difference is in the presentation of data and language. But it is precisely this difference that takes Pig to a completely different level.

Consider Pig in more detail.
This framework works with special data structures - Tuple and Bag.

Tuple is an ordered set of fields. A structure whose fields can be accessed by index and / or name.
Bag - collection (set) Tuple.

Pig Latin basic functions:

LOAD
STORE
GENERATE
JOIN
GROUP
Filter
UNION
Distinct
ORDER

Let's look at an example of how to transform the data while working with Pig. We will work with the log file RTB exchange. The data is presented as follows:

time - time
bid_id - bid ID,
user_id - user ID
dsp_id - id (player) ID
bid

Pig - load data (LOAD)

The LOAD function is used for loading, we also specify the separator symbol '\ t' and the data signature (if necessary, you can specify the type).

 --   HDFS (Pig   Hadoop) fs -rm -f -r -skipTrash /data/pig/out --    'raw_data' raw_data = LOAD '/data/pig/example/' USING PigStorage('\t') AS (time, bid_id, user_id, dsp_id, bid:int);

At the output we get this structure (Tuple). In requests to its fields can be accessed through a point. For example: raw_data.dsp_id

 raw_data -> tuple   . ------------------------------------------------------------------------------------------- time, bid_id, user_id, dsp_id, bid ------------------------------------------------------------------------------------------- (2014.02.14 14:08:27.711, 56949, 45234534553459, DSP-2, 12) (2014.02.14 14:08:28.712, 61336, 45221696259999, DSP-1, 56) (2014.02.14 14:08:29.713, 74685, 45221699381039, DSP-2, 89) (2014.02.14 14:08:30.714, 56949, 45221695781716, DSP-1, 21) (2014.02.14 14:08:25.715, 27617, 45221682863705, DSP-3, 22)

Pig - iterative data processing (FOREACH - GENERATE)
FOREACH - GENERATE allows you to iteratively "run" on a set of data and apply any operations to each record, or simply output certain fields to the output, removing all unnecessary.

 -- .  timestamp   SUBSTRING norm_data = FOREACH raw_data GENERATE SUBSTRING(time, 0,10) AS date, dsp_id, bid;

At the output we get the same set, but with a cropped date, and only two fields: dsp_id, bid.

 norm_data -> tuple       --------------------------------------- date, dsp_id, bid --------------------------------------- (2014.02.14, DSP-2, 12) (2014.02.14, DSP-1, 56) (2014.02.14, DSP-2, 89) (2014.02.14, DSP-1, 21)

Pig - grouping data (GROUP)
GROUP - allows you to group data, while giving a nontrivial structure to the output.

 --  dsp_id  date group_norm_data = GROUP norm_data BY (dsp_id, date);

At the exit we have:
group as a key. It can be accessed through the group prefix.
and a collection of aggregates with the norm_data prefix

 group_norm_data -> (  ) : [ (norm_data), (norm_data) ] ---------------------------------------------------------------------------------- ( group), array of norm_data ---------------------------------------------------------------------------------- ( (DSP-1, 2014.02.14), {(2014.02.14, DSP-1, 56), (2014.02.14, DSP-1, 21)} ) ( (DSP-1, 2014.02.17), {(2014.02.17, DSP-1, 34), (2014.02.17, DSP-1, 24)} ) ( (DSP-2, 2014.02.14), {(2014.02.14, DSP-2, 89), (2014.02.14, DSP-2, 12)} )

Pig - unit sweep (FLATTEN)
Sometimes it is necessary to unfold the aggregates into a linear structure (“straighten”).
For this there is a function FLATTEN

 --      ft_group_norm_data = FOREACH group_norm_data GENERATE FLATTEN(group), FLATTEN(norm_data);

From a complex grouped structure, we get a rectilinear set of Tuples.

 ft_group_norm_data -> tuple    ---------------------------------------------------------------------- dsp_id, date date dsp_id bid ----------------------------------------------------------------------- (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 56) (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 21) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 15) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 31)

Pig - aggregation functions (SUM)
Let's count something. For example, the amount of daily bids made by each bidder.

 --   ,    sum_bids_dsp = FOREACH group_norm_data GENERATE group, SUM(norm_data.bid) AS bids_sum;

 sum_bids_dsp ->  : bids_sum ------------------------------------------------------ group, bids_sum ------------------------------------------------------ ( (DSP-1, 2014.02.16), 82) ( (DSP-1, 2014.02.17), 58) ( (DSP-2, 2014.02.14), 101) ( (DSP-2, 2014.02.16), 58)

Pig - GROUP ALL
It is often necessary to count the number of “records” in a sample. Simply applying COUNT to the sample will fail. The data should be rolled up into one group and then the aggregation functions should be applied.

 --  ,   . --      . group_all = GROUP sum_bids_dsp ALL;

At the exit, we have a group - “all” and a collection of all previous units.

 ( all, { ((DSP-1,2014.02.14),77), ((DSP-1,2014.02.15),67), ((DSP-1,2014.02.16),82),((DSP-1,2014.02.17),58),((DSP-2,2014.02.14),101),((DSP-2,2014.02.16),58),((DSP-2,2014.02.17),123),((DSP-3,2014.02.14),22),((DSP-3,2014.02.15),109),((DSP-3,2014.02.16),136),((DSP-3,2014.02.17),81) } )

now calculate the amount and amount

 summary = FOREACH group_all GENERATE COUNT(sum_bids_dsp), SUM(sum_bids_dsp.bids_sum);

Output

 ------------------------------------------------------ count, sum ------------------------------------------------------ (11, 914)

In my opinion, this is what you need. Data processing is presented in an ordered manner. Everything is easily broken into steps. Each stage can be supplied with comments.

Total:

Pig pluses:

Procedural approach. Orderliness! The language allows you to break the logic into blocks, each step can be expanded to describe the comments.
Formation of MapReduce under the hood. It leaves a lot of overheads associated with strapping around mr. Description of data models, input and output formats, chains of MR tasks.
Interactivity. Good for analyzing data in different sections.
The speed of development. No dependencies, builds

Pig Cons:

Not everything can be put into the language of Pig Latin
Pig Latin together with data structures is more complicated, unlike HQL
For UDF, Jython is used. This may limit the use of some libraries.

Summary:

Hive is good for small and simple samples. HQL is similar to SQL, so you can quickly start working with this framework.
Pig Requires language and data structure learning. But then, having understood once, you get a more powerful tool, in which it is easier to implement complex and multistage samples. You get simple and ordered code, with accessible and relevant comments.

If you and your colleagues know SQL well, work with it daily, and you are not confused by furious queries, then Hive is a great solution. However, if you work with SQL occasionally and your data workflow does not fit into simple queries, then you definitely need to spend a day and deal with Pig. In the future, this can save a lot of time for you and your colleagues.

Source: https://habr.com/ru/post/223217/

All Articles

Hive vs Pig. Why do I have so many ETL?

It is better to lose a day, but then fly in five minutes (s )

Heve

Pig

More articles: