Hi, Habr! In this article we will continue to look at the possibilities of the hive engine that translates SQL-like queries in the MapReduce task.
In the
previous article, we looked at the basic features of hive, such as creating tables, loading data, and performing simple SELECT queries. Now let's talk about advanced features that will allow you to squeeze the most out of Hive.

User Defined Functions
One of the main obstacles to working with Hive is the stiffness of standard SQL. This problem can be solved by using language extensions — the so-called User Defined Functions. Quite a lot of useful features are built right into the Hive language. Here are some of the most interesting in my opinion (information taken from
official documentation ):
')
JsonA fairly common task when dealing with large data is the processing of unstructured data stored in json format. To work with json hive, support the special
get_json_object method, which allows you to extract values ​​from json documents. To retrieve values ​​from an object, a limited version of the JSONPath notation is used. The following operations are supported:
- $: Returns the root object
- .: Turns in a child object
- []: Address by index in array
- *: Wildcard for
Examples of working with Json from official documentation:Let there is a table: src_json, consisting of one column (json) and one row:
{"store": {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net", "owner":"amy" }
Examples of queries to the table:
hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json; amy hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json; {"weight":8,"type":"apple"} hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json; NULL
XpathSimilarly, if the data that needs to be processed using hive is not stored in json, but in XML, it can be processed using the
xpath function
, which allows parsing XML using the
appropriate language . An example of parsing xml data with xpath:
hive> select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()') from sample_table limit 1 ; ["b1","b2"]
Other useful built-in functions:The built-in library contains a fairly rich set of built-in functions. There are several groups:
- Mathematical functions (sin, cos, log, ...)
- Functions for working with time (from_unix_timestamp, to_date, current date, hour (string date), timediff, ...) - a very rich choice of functions for converting dates and times
- Functions for working with strings. It supports both generally applicable functions, such as lengh, reverse, regexp, and specific ones, such as parse_url or already considered get_json_object)
- Many different system functions are current_user, current_database, ...
- Cryptographic functions - sha, md5, aes_encrypt, aes_decrypt ...
A complete list of built-in hive functions can be found
here .
Writing your own UDF
It is not always enough built-in functions hive to solve the problem. If there is no built-in function, you can write your own UDF. This is done in java.
Let us analyze the creation of a custom UDF using the example of a simple string conversion function in lowercase:
1. Create the com / example / hive / udf package and create the Lower.java class in it:
mkdir -p com/example/hive/udf edit com/example/hive/udf/Lower.java
2. Implement the Lower class itself:
package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } }
3. Add the necessary libraries to CLASSPATH (in your hadoop distribution, references to jar files may be slightly different):
export CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar
4. Compile our UDF-ku and collect the jar-archive:
javac com/example/hive/udf/Lower.java jar cvf my_udf.jar *
5. In order to be able to use a function in hive, you need to explicitly declare it:
hive> ADD JAR my_udf.jar; hive> create temporary function my_lower as 'com.example.hive.udf.Lower'; hive> select my_lower('HELLO') from sample_table limit 1; hello
Transformation table using scripts
Another way to extend the standard HIVE functionality is to use the TRANSFORM method, which allows you to convert data using custom scripts in any programming language (this is especially suitable for those who do not like java and do not want to write udf on it).
The syntax for using the command is as follows:
SELECT TRANSFORM(<columns>) USING <script> as <new_columns>
<script> - in this case, it is a program that receives data on stdin, converts it, and outputs converted data to stdout. In fact, this is very similar to the streaming interface for running map-reduce tasks, which we wrote about in the
Big Data article
from A to Z. Part 2: HadoopExample:Suppose we have a table with the salaries of users who receive salaries in different currencies:
+-------------------+---------------------+-----------------------+ | user_salary.name | user_salary.salary | user_salary.currency | +-------------------+---------------------+-----------------------+ | alexander | 100000 | RUB | | evgeniy | 4000 | EUR | | alla | 50000 | RUB | | elena | 1500 | EUR | +-------------------+---------------------+-----------------------+
We want to get a sign in which there will be ruble salaries for all users. To do this, we write a script in python that performs data conversion:
import sys EXCHANGE_RATE = 75 for line in sys.stdin: name, salary, currency = line.rstrip("\n").split('\t') if currency == 'EUR': print name + "\t" + str(int(salary) * EXCHANGE_RATE) else: print name + "\t" + salary
The script implies that the input data comes in a tsv format (columns are separated by a tab). If the table contains the value NULL, the script will receive the value '\ N'.
Then use this script to convert the table:
0: jdbc:hive2://localhost:10000/default> select transform(name, salary, currency) using 'python transform_to_rub.py' as (name, rub_salary) from user_salary; +------------+-------------+ | name | rub_salary | +------------+-------------+ | alexander | 100000 | | evgeniy | 300000 | | alla | 50000 | | elena | 112500 | +------------+-------------+
In fact, using the TRANSFORM operation makes it possible to completely replace the classic MapReduce with hive.
Mapjoin
As we wrote in the article about the
techniques and strategies for working with MapReduce - in order for JOIN to implement two tables, in general, several MapReduce tasks are needed. Since hive works on MapReduce, JOIN is also an expensive operation for it.
However, if one of the two tables that you need to fully join into the RAM memory of each node, you can do with MapReduce by loading the label into memory. This pattern is called MapJoin. In order for Hive to use MapJoin it is necessary to give him a hint (“hint” in Hive terminology).
Example: SELECT COUNT(*) from store_sales JOIN time_dim on (ss_sold_time_sk = t_time_sk)
In this example, it is assumed that the table “store_sales” is large, and the table “time_dim” is small and is stored in memory. / * + MAPJOIN (time_dim) * / - this is exactly the same hint for for HIVE about starting the MAPJOIN task
Transactional model
The transactional
ACID model implies support for 4 basic properties:
- Atomicity - the operation is either entirely performed by completely changing the data, or it falls and leaves no traces behind it.
- Consistency - after the application performs an operation, its result becomes available for all subsequent operations.
- Isolation - the operations of some users do not have side effects on other users.
- Durability - changes made as a result of a successful operation preserve the result even in the event of a system failure.
Generally speaking, Hive is not well suited for dealing with changing data, but there are several cases where support for changing data is needed. First of all it is:
- Data added in streaming mode (from systems such as flume , kafka ). I want the data to be available for analysis in hive immediately as they arrived
- Updating the schema — for example, adding a new column to the hive table. I want the column to either be successfully added to each record, or to fall and not to be added to any
- Sometimes you still need to update individual records.
For these purposes, starting from version 0.14, hive has been implemented to support the transactional model, implemented by three operations - INSERT, UPDATE and DELETE.
Support for these operations is very limited:
- Currently only ORC format files are supported.
- By default, transaction support is disabled. To enable it, you must make the appropriate changes to the hive configuration file.
- There is no support for BEGIN, COMMIT, ROLLBACK commands familiar with relational databases.
Transaction support is implemented using delta files. That is, when performing the data update operation, the data in the source file is not updated, but a new file is created where it is noted which lines were changed. Later, hive will combine them with the compaction operation (the same is used in hbase).
In general, since transaction support is very limited, it is worth thinking very seriously before using this functionality in Hive. It may be worth looking in the direction of HBase or traditional relational databases.
Conclusion
In this and
previous article in the series, we looked at the main features of Hive, a powerful tool that facilitates working with MapReduce tasks. Hive is perfect for analysts who are used to working with SQL, can be easily integrated into the existing infrastructure with the support of the JDBC driver, and taking into account support for User Defined Functions and custom transformations, it allows you to completely transfer data processing from the classic MapReduce to yourself. However, hive is not a “silver pill” - for frequently updated data you can look towards tools such as Hbase and classic relational databases.
In the next articles of the series, we will continue to look at tools for working with big data and methods for processing them.4
Youtube Channel about data analysis