📜 ⬆️ ⬇️

Hive vs Pig. Why do I have so many ETL?

image

It is better to lose a day, but then fly in five minutes (s )



Hello colleagues.
I want to share with you thoughts about the differences between the Hive and Pig frameworks that are part of the Hadoop ecosystem. In fact, these are two very similar products, the goal of which is one - to take over the entire technical implementation of MapReduce, providing in exchange the opportunity to describe the data processing process at a more abstract level. In this article, we will see how the samples look in these two systems, and try to understand in what cases we need to use a particular solution.

Heve


So let's start with Hive. Its main feature is a SQL-like query language HQL (Hive query language). It allows you to work with data in the usual way, as if we were working with a regular relational database. Scripts can be run either via the console or via the command line.

Hive this:

Hive can work:

')
A simple example:
--  . (  ) CREATE EXTERNAL TABLE win_bids_log ( date_field string, request_id string, user_ssp_id string, dsp_id string, win_price int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 'hdfs://inpit/bid-logs'; CREATE EXTERNAL TABLE win_bids_by_dsp ( dsp_id string, win_bids_cout int, win_price int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION ''hdfs://output/win-bids-by-dsp''; INSERT OVERWRITE TABLE win_bids_by_dsp SELECT dsp_id, COUNT(dsp_id), SUM(win_price) FROM win_bids_log GROUP BY dsp_id; 

As you can see, everything is quite simple and clear. Pretty nice to write requests in a familiar language. But this happiness continues until you have to deal with more complex queries.

The example is more complicated:
 INSERT OVERWRITE TABLE movieapp_log_stage SELECT * FROM ( SELECT custid, movieid, CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid, time, CAST((CASE recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, activity, CAST(null AS INT) rating, price FROM movieapp_log_avro WHERE activity IN (2,4,5,11) UNION ALL SELECT m1.custid, m1.movieid, CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid, m1.time, CAST((CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT) recommended, m1.activity, m1.rating, CAST(null as float) price FROM movieapp_log_avro m1 JOIN ( SELECT custid,movieid, CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,MAX(time) max_time, activity FROM movieapp_log_avro GROUP BY custid, movieid, genreid, activity ) m2 ON ( m1.custid = m2.custid AND m1.movieid = m2.movieid AND m1.genreid = m2.genreid AND m1.time = m2.max_time AND m1.activity = 1 AND m2.activity = 1 ) ) union_result; 


Of course you can figure it out, but it’s still worth admitting that in this case there is definitely a lack of orderliness. Spread it all on the shelves, but with the comments. Is not it?

Total:

Hive pluses:


Hive Cons:


Pig



Let's talk now about Pig. It is based on the procedural language Pig Latin. To understand it you need to spend some time.
Let's understand and hike find out the differences from Hive

Pig is:

Pig can work:

As you can see, Pig can do the same as Hive. The only difference is in the presentation of data and language. But it is precisely this difference that takes Pig to a completely different level.

Consider Pig in more detail.
This framework works with special data structures - Tuple and Bag.


Pig Latin basic functions:

Let's look at an example of how to transform the data while working with Pig. We will work with the log file RTB exchange. The data is presented as follows:

Pig - load data (LOAD)

The LOAD function is used for loading, we also specify the separator symbol '\ t' and the data signature (if necessary, you can specify the type).
 --   HDFS (Pig   Hadoop) fs -rm -f -r -skipTrash /data/pig/out --    'raw_data' raw_data = LOAD '/data/pig/example/' USING PigStorage('\t') AS (time, bid_id, user_id, dsp_id, bid:int); 

At the output we get this structure (Tuple). In requests to its fields can be accessed through a point. For example: raw_data.dsp_id
 raw_data -> tuple   . ------------------------------------------------------------------------------------------- time, bid_id, user_id, dsp_id, bid ------------------------------------------------------------------------------------------- (2014.02.14 14:08:27.711, 56949, 45234534553459, DSP-2, 12) (2014.02.14 14:08:28.712, 61336, 45221696259999, DSP-1, 56) (2014.02.14 14:08:29.713, 74685, 45221699381039, DSP-2, 89) (2014.02.14 14:08:30.714, 56949, 45221695781716, DSP-1, 21) (2014.02.14 14:08:25.715, 27617, 45221682863705, DSP-3, 22) 

Pig - iterative data processing (FOREACH - GENERATE)
FOREACH - GENERATE allows you to iteratively "run" on a set of data and apply any operations to each record, or simply output certain fields to the output, removing all unnecessary.
 -- .  timestamp   SUBSTRING norm_data = FOREACH raw_data GENERATE SUBSTRING(time, 0,10) AS date, dsp_id, bid; 

At the output we get the same set, but with a cropped date, and only two fields: dsp_id, bid.

 norm_data -> tuple       --------------------------------------- date, dsp_id, bid --------------------------------------- (2014.02.14, DSP-2, 12) (2014.02.14, DSP-1, 56) (2014.02.14, DSP-2, 89) (2014.02.14, DSP-1, 21) 

Pig - grouping data (GROUP)
GROUP - allows you to group data, while giving a nontrivial structure to the output.
 --  dsp_id  date group_norm_data = GROUP norm_data BY (dsp_id, date); 

At the exit we have:
group as a key. It can be accessed through the group prefix.
and a collection of aggregates with the norm_data prefix
 group_norm_data -> (  ) : [ (norm_data), (norm_data) ] ---------------------------------------------------------------------------------- ( group), array of norm_data ---------------------------------------------------------------------------------- ( (DSP-1, 2014.02.14), {(2014.02.14, DSP-1, 56), (2014.02.14, DSP-1, 21)} ) ( (DSP-1, 2014.02.17), {(2014.02.17, DSP-1, 34), (2014.02.17, DSP-1, 24)} ) ( (DSP-2, 2014.02.14), {(2014.02.14, DSP-2, 89), (2014.02.14, DSP-2, 12)} ) 

Pig - unit sweep (FLATTEN)
Sometimes it is necessary to unfold the aggregates into a linear structure (“straighten”).
For this there is a function FLATTEN
 --      ft_group_norm_data = FOREACH group_norm_data GENERATE FLATTEN(group), FLATTEN(norm_data); 

From a complex grouped structure, we get a rectilinear set of Tuples.
 ft_group_norm_data -> tuple    ---------------------------------------------------------------------- dsp_id, date date dsp_id bid ----------------------------------------------------------------------- (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 56) (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 21) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 15) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 31) 

Pig - aggregation functions (SUM)
Let's count something. For example, the amount of daily bids made by each bidder.
 --   ,    sum_bids_dsp = FOREACH group_norm_data GENERATE group, SUM(norm_data.bid) AS bids_sum; 


 sum_bids_dsp ->  : bids_sum ------------------------------------------------------ group, bids_sum ------------------------------------------------------ ( (DSP-1, 2014.02.16), 82) ( (DSP-1, 2014.02.17), 58) ( (DSP-2, 2014.02.14), 101) ( (DSP-2, 2014.02.16), 58) 

Pig - GROUP ALL
It is often necessary to count the number of “records” in a sample. Simply applying COUNT to the sample will fail. The data should be rolled up into one group and then the aggregation functions should be applied.
 --  ,   . --      . group_all = GROUP sum_bids_dsp ALL; 


At the exit, we have a group - “all” and a collection of all previous units.
 ( all, { ((DSP-1,2014.02.14),77), ((DSP-1,2014.02.15),67), ((DSP-1,2014.02.16),82),((DSP-1,2014.02.17),58),((DSP-2,2014.02.14),101),((DSP-2,2014.02.16),58),((DSP-2,2014.02.17),123),((DSP-3,2014.02.14),22),((DSP-3,2014.02.15),109),((DSP-3,2014.02.16),136),((DSP-3,2014.02.17),81) } ) 

now calculate the amount and amount
 summary = FOREACH group_all GENERATE COUNT(sum_bids_dsp), SUM(sum_bids_dsp.bids_sum); 

Output
 ------------------------------------------------------ count, sum ------------------------------------------------------ (11, 914) 

In my opinion, this is what you need. Data processing is presented in an ordered manner. Everything is easily broken into steps. Each stage can be supplied with comments.

Total:

Pig pluses:

Pig Cons:

Summary:


If you and your colleagues know SQL well, work with it daily, and you are not confused by furious queries, then Hive is a great solution. However, if you work with SQL occasionally and your data workflow does not fit into simple queries, then you definitely need to spend a day and deal with Pig. In the future, this can save a lot of time for you and your colleagues.

Source: https://habr.com/ru/post/223217/


All Articles