Hi Habr! Not so long ago, I had the pleasure of attending a
PyData Moscow meeting at the Yandex site. I can not call myself python developer, but I have interests in the field of analytics and data analysis. Having visited this event, I learned about the existence of ClickHouse DBMS, developed in Yandex and uploaded on
GitHub under an open license. Column SQL DBMS with domestic roots aroused my interest. In this article, I will share the experience of installing and configuring ClickHouse, as well as trying to access it from a Spring application using Hibernate.
Assembly and installation
I began my introduction to the DBMS from the
documentation page.
Very surprised by the recommendations on the use of RAM. Developments are advised to use as much memory as you have data (in any case, as long as the amount of data is less than 200 GB). It is not entirely clear whether we are talking about the total memory of all the machines in a cluster or about the memory of each machine. In any case, I could be content with only 8 GB of memory of my virtual machine. In this case, the size of the test data set was more than 60 GB in uncompressed form.
The first difficulty with which I had to stock up was the recomnation to use Ubuntu and the rest of the rpm packages (UPD: I later discovered this
repository ). I prefer the rpm system and made decisions to build a DBMS from source codes under Centos 7.
')
The guys from Yandex prepared us
instructions for assembly. Here are some interesting moments:
- We need gcc as much as version 6, when only 4.8 goes out of the box in centos. Curiously, what is the reason for this limitation? I did not spend time studying the code clickHouse, perhaps using the new C ++ standard, which is not supported in version 4. Anyway, GCC 6 also had to be assembled from source codes. The clickhouse build instructions describe the gcc build process, but again it applies primarily to Ubuntu. The following article helped. Building the compiler took me 1.5 hours.
- The second interesting feature of the assembly was the dependence of clickhouse on libmysqlclient. I wonder how and why the client functions of MySQL are used in clikchouse?
The assembly of the DBMS itself took about 40 minutes. The setup of clickhouse in my case was to edit the configuration
/usr/local/etc/clickhouse-server/config.xml
, since I decided to keep the database on a separate section. I was able to start the server using
sudo /usr/local/bin/clickhouse-server --config /usr/local/etc/clickhouse-server/config.xml
I decided to check the work of the DBMS on an example from
this post by downloading data on air travel in the US from 1987 to 2015. The download process and query examples are given above. It took me 24 minutes to load, the size of the database on the disk was 14 GB with the amount of downloaded data of 61.6 GB.
The execution time of the test queries from the article above on my machine was:
Request | Time with |
---|
which directions were the most popular in 2015 | 2.067
|
from which cities more flights depart | 3.744 |
from which cities you can fly the maximum number of destinations | 7.012 |
How does the delay of departure of flights from the day of the week | 3.195 |
from which cities, planes are often delayed with departure for more than an hour | 3.392 |
what are the longest flights | 12.466 |
distribution of arrival delay times, by airline | 4.596 |
which airlines stopped flights | 1.653 |
which cities began to fly more in 2015 | 0.380 |
flights to which cities are more dependent on seasonality | 8.806 |
Once again, I note that the amount of memory of my machine was only 8 GB, which is almost 8 times less than recommended by the developers. In this regard, the numbers are higher - no more than a very rough estimate of how clickhouse will behave on a developer's laptop, rather than in production. In general, the DBMS worked stably and, in my opinion, fairly quickly. In the analyzed table there were 166 million records.
We are friends with ClickHouse and Java
The next step in my study of clickhouse is to attempt to access it from Java. The guys from Yandex posted jdbc
driver .
Connecting it to a maven project is very simple:
<dependency> <groupId>ru.yandex.clickhouse</groupId> <artifactId>clickhouse-jdbc</artifactId> <version>${clickhouse-jdbc-version}</version> </dependency>
When using there are only a few points that unknowingly can cause difficulties. First of all, clikchouse supports several protocols. By default, the native binary protocol uses port 9000, and the http protocol - 8123. The JDBC driver can only work with http, so we specify the port 8123. The name of the database is default. User name and password, by default, do not need to be set. In order for clickhouse to allow connections from remote machines, add the
<listen_host>::</listen_host>
line in the configuration file.
Further, all as in the usual JDBC:
private static final String DB_URL = "jdbc:clickhouse://localhost:8123/default"; private final Connection conn; public ClickHouseJDBCDemo() throws SQLException { conn = DriverManager.getConnection(DB_URL); } public void popularYearRoutes(int year) throws SQLException { String query = "SELECT " + " OriginCityName, " + " DestCityName, " + " count(*) AS flights, " + " bar(flights, 0, 20000, 40) AS bar " + "FROM ontime WHERE Year = ? GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20"; long time = System.currentTimeMillis(); try (PreparedStatement statement = conn.prepareStatement(query)) { statement.setInt(1, year); try (ResultSet rs = statement.executeQuery()) { Util.printRs(rs); } } System.out.println("Time: " + (System.currentTimeMillis() - time) +" ms"); }
As you can see, nothing unusual. Full sample code is available
here .
And what about Hibernate?
My next step was an attempt to connect ORM. Naively believing that once there is JDBC, there will be ORM, I reached for Spring Data JPA. But, unfortunately, I ran into a number of problems here. First of all, when connecting to the database, you must specify SQL Dialect. I could not find the implemented dialect for ClickHouse.
As a result, I simply created a descendant of the standard
org.hibernate.dialect.Dialect
:
public class ClickHouseDialect extends Dialect { }
Now connect to the database becomes possible. The problem begins when you run queries. The fact is that clickhouse
does not support alias for table names. In other words, a query of the form:
SELECT alias.column from tablename alias
for clickHouse is not valid.
The problem is that judging by all hibernate does not allow to disable the generation of alias for tables by defining SQL Dialect. In any case, I could not find how to achieve this.
Having a little rummaged in the hibernate code, I began to come to the conclusion that this functionality is sewn pretty
deeply in its core. However, my experience with hibernate is not that great. I would be glad if anyone could share a known way to circumvent this limitation.
Total
Explore clickhouse was an exciting experience. The DBMS showed itself from the good side. It was possible to collect it from source codes, execute queries using the console client and jdbc. Unfortunately, incomplete support for ANSI SQL constructs did not allow Hibernate to be easily connected. However, it should take into account the purpose of the database. Home niche clickhouse - complex analytical queries, most of which still have to fully or partially write by hand. Although having the ability to simply view and sample data "out of the box" using the ORM would not be superfluous.
I think it will be extremely interesting to watch the development of the project, especially given the growing attention to clikchouse in the community lately and the active support of the project from Yandex.