📜 ⬆️ ⬇️

Access ClickHouse with JDBC

Hi Habr! Not so long ago, I had the pleasure of attending a PyData Moscow meeting at the Yandex site. I can not call myself python developer, but I have interests in the field of analytics and data analysis. Having visited this event, I learned about the existence of ClickHouse DBMS, developed in Yandex and uploaded on GitHub under an open license. Column SQL DBMS with domestic roots aroused my interest. In this article, I will share the experience of installing and configuring ClickHouse, as well as trying to access it from a Spring application using Hibernate.

Assembly and installation


I began my introduction to the DBMS from the documentation page.

Very surprised by the recommendations on the use of RAM. Developments are advised to use as much memory as you have data (in any case, as long as the amount of data is less than 200 GB). It is not entirely clear whether we are talking about the total memory of all the machines in a cluster or about the memory of each machine. In any case, I could be content with only 8 GB of memory of my virtual machine. In this case, the size of the test data set was more than 60 GB in uncompressed form.

The first difficulty with which I had to stock up was the recomnation to use Ubuntu and the rest of the rpm packages (UPD: I later discovered this repository ). I prefer the rpm system and made decisions to build a DBMS from source codes under Centos 7.
')
The guys from Yandex prepared us instructions for assembly. Here are some interesting moments:


The assembly of the DBMS itself took about 40 minutes. The setup of clickhouse in my case was to edit the configuration /usr/local/etc/clickhouse-server/config.xml , since I decided to keep the database on a separate section. I was able to start the server using sudo /usr/local/bin/clickhouse-server --config /usr/local/etc/clickhouse-server/config.xml

I decided to check the work of the DBMS on an example from this post by downloading data on air travel in the US from 1987 to 2015. The download process and query examples are given above. It took me 24 minutes to load, the size of the database on the disk was 14 GB with the amount of downloaded data of 61.6 GB.
The execution time of the test queries from the article above on my machine was:
RequestTime with
which directions were the most popular in 20152.067
from which cities more flights depart3.744
from which cities you can fly the maximum number of destinations7.012
How does the delay of departure of flights from the day of the week3.195
from which cities, planes are often delayed with departure for more than an hour3.392
what are the longest flights12.466
distribution of arrival delay times, by airline4.596
which airlines stopped flights1.653
which cities began to fly more in 20150.380
flights to which cities are more dependent on seasonality8.806

Once again, I note that the amount of memory of my machine was only 8 GB, which is almost 8 times less than recommended by the developers. In this regard, the numbers are higher - no more than a very rough estimate of how clickhouse will behave on a developer's laptop, rather than in production. In general, the DBMS worked stably and, in my opinion, fairly quickly. In the analyzed table there were 166 million records.

We are friends with ClickHouse and Java


The next step in my study of clickhouse is to attempt to access it from Java. The guys from Yandex posted jdbc driver .

Connecting it to a maven project is very simple:

  <dependency> <groupId>ru.yandex.clickhouse</groupId> <artifactId>clickhouse-jdbc</artifactId> <version>${clickhouse-jdbc-version}</version> </dependency> 

When using there are only a few points that unknowingly can cause difficulties. First of all, clikchouse supports several protocols. By default, the native binary protocol uses port 9000, and the http protocol - 8123. The JDBC driver can only work with http, so we specify the port 8123. The name of the database is default. User name and password, by default, do not need to be set. In order for clickhouse to allow connections from remote machines, add the <listen_host>::</listen_host> line in the configuration file.

Further, all as in the usual JDBC:

  private static final String DB_URL = "jdbc:clickhouse://localhost:8123/default"; private final Connection conn; /** * Creates new instance * @throws SQLException in case of connection issue */ public ClickHouseJDBCDemo() throws SQLException { conn = DriverManager.getConnection(DB_URL); } /** * Queries db to get most popular flight route for ths given year * @param year year to query * @throws SQLException in case of query issue */ public void popularYearRoutes(int year) throws SQLException { String query = "SELECT " + " OriginCityName, " + " DestCityName, " + " count(*) AS flights, " + " bar(flights, 0, 20000, 40) AS bar " + "FROM ontime WHERE Year = ? GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20"; long time = System.currentTimeMillis(); try (PreparedStatement statement = conn.prepareStatement(query)) { statement.setInt(1, year); try (ResultSet rs = statement.executeQuery()) { Util.printRs(rs); } } System.out.println("Time: " + (System.currentTimeMillis() - time) +" ms"); } 

As you can see, nothing unusual. Full sample code is available here .

And what about Hibernate?


My next step was an attempt to connect ORM. Naively believing that once there is JDBC, there will be ORM, I reached for Spring Data JPA. But, unfortunately, I ran into a number of problems here. First of all, when connecting to the database, you must specify SQL Dialect. I could not find the implemented dialect for ClickHouse.

As a result, I simply created a descendant of the standard org.hibernate.dialect.Dialect :

 public class ClickHouseDialect extends Dialect { } 

Now connect to the database becomes possible. The problem begins when you run queries. The fact is that clickhouse does not support alias for table names. In other words, a query of the form:

 SELECT alias.column from tablename alias 

for clickHouse is not valid.

The problem is that judging by all hibernate does not allow to disable the generation of alias for tables by defining SQL Dialect. In any case, I could not find how to achieve this.

Having a little rummaged in the hibernate code, I began to come to the conclusion that this functionality is sewn pretty deeply in its core. However, my experience with hibernate is not that great. I would be glad if anyone could share a known way to circumvent this limitation.

Total


Explore clickhouse was an exciting experience. The DBMS showed itself from the good side. It was possible to collect it from source codes, execute queries using the console client and jdbc. Unfortunately, incomplete support for ANSI SQL constructs did not allow Hibernate to be easily connected. However, it should take into account the purpose of the database. Home niche clickhouse - complex analytical queries, most of which still have to fully or partially write by hand. Although having the ability to simply view and sample data "out of the box" using the ORM would not be superfluous.

I think it will be extremely interesting to watch the development of the project, especially given the growing attention to clikchouse in the community lately and the active support of the project from Yandex.

Source: https://habr.com/ru/post/332112/


All Articles