We do a quick search on tours based on ClickHouse

In this article, we will look at how to create a search for a database of tours (a tour of itself is a collection of a hotel and a flight) and consider two options - ClickHouse and MySQL (two engines - InnoDB and MyISAM).

What is the difficulty of searching for tours

Tour operators (TezTour, TUI, Natalie Tours, etc) sell their vouchers in an unobvious, at first glance, way:

A certain number of hotel rooms are reserved for a certain set of dates.
Redeemed several aircraft.
A new package of tours is issued, which contains combinations of all possible types of rooms, lengths of stay, cities and departure dates.

After that, such combinations (the number of which can amount to hundreds of millions or even billions) are searched. An example of the search form can be seen in TezTour - the user can select only one departure city, the type of accommodation and the country, and the user can select arbitrary parameters.
')
Despite the fact that the total number of tours (combinations) is in the hundreds of millions, for each fixed set of parameters (city of departure, type of accommodation, country) there are, in the worst case, tens of millions of options. But even for such a number of tours it is not so easy to search, because you need to find records that meet free criteria that users specify, and sorting can be more or less arbitrary (as a rule, sorting is done by price, but this is not the only possible criterion ). In this article, we will look at the simplified architecture of real-time search in MySQL and ClickHouse-based tours, without taking stops into account (a slang term that means that some options have run out of numbers or seats in the aircraft, and such tours should be excluded from the issue). We will learn how to do a quick search and be able to show results with sorting by any field.

Where to get the data

Initially, I wanted to take a real base of some tour operator and load it as test data for our system - this would check the adequacy of our search by comparing our issuance and issuing a search on the tour operator’s website. But, unfortunately, I did not manage to easily find a base of tours in the public domain, so instead I would have to limit myself to a set of pseudo-random data that will imitate real tours.

Fieldset

Since the search for a city, country, and type of accommodation is fixed, we can create tables with the names of the corresponding parameters, for example, tours_12_55_1001, where 12, 55 and 1001 are the IDs of the city, country and type of accommodation, respectively. This allows us to dramatically reduce the number of records that we need to view and at the same time save space due to the fact that we will not store duplicate values in tables.

However, even for one such table there may be tens of millions of variants, and for the test sample we will generate exactly 10 million records with the following structure:

tour_id - uint64 - unique offer ID
package_id - uint64 - ID of the corresponding package of tours (packages are added and revoked entirely)
date_start - uint8 - arrival date - the number of days from a certain moment
stars - uint8 - the number of stars near the hotel + special values, for example, for apartments
pansion - uint8 - food type (Bed & Breakfast, All Inclusive, etc)
nights - uint8 - length of stay (number of nights)
region_id - uint16 - hotel region ID
hotel_id - uint32 - Hotel ID
airport_id - uint8 - arrival airport ID
price - uint16 - offer price, in dollars

For different countries and different tour operators, the list may differ (for example, the departure airport may also be present, or the price is in a different currency, and then it will not fit in uint16, etc.), but we will focus on this option as the most basic .

The code for the generation is written in 10 minutes by any programmer, I will give an example on go:

Hidden text

package main import ( "encoding/csv" "flag" "math/rand" "os" "strconv" ) var n = flag.Int("n", 10000000, "How many entries to generate") func main() { wr := csv.NewWriter(os.Stdout) defer wr.Flush() wr.Write([]string{ "tour_id", "package_id", "date_start", "stars", "pansion", "nights", "region_id", "hotel_id", "airport_id", "price", }) for i := 0; i < *n; i++ { wr.Write([]string{ strconv.FormatInt(int64(i), 10), strconv.FormatInt(rand.Int63(), 10), strconv.FormatInt(rand.Int63n(90), 10), strconv.FormatInt(rand.Int63n(10), 10), strconv.FormatInt(rand.Int63n(10), 10), strconv.FormatInt(rand.Int63n(30), 10), strconv.FormatInt(rand.Int63n(1000), 10), strconv.FormatInt(rand.Int63n(1000000), 10), strconv.FormatInt(rand.Int63n(10), 10), strconv.FormatInt(rand.Int63n(10000), 10), }) } }

The received csv-file occupies about 500 MB and contains 10 million records in which tour_id monotonously increases, and the remaining fields have random values in more or less approximate to reality ranges.

Search criteria

We need to be able to search for any sets of values for each of the fields and with any sorting, but, as a rule, the ranges of departure dates and duration are not very wide - people rarely look for tours with a range of departure dates, for example, from March 1 to May 1 . We can assume that a search with a wide range of departure dates should also work, but optimize the response time and resource consumption is needed for the case when this range is significantly limited, for example, averages no more than 10 days. This knowledge is useful to us when choosing indices.

MySQL implementation

I chose MySQL because I am well acquainted with this database, and also because it is relatively well suited for this task.

Let's try to use 3 slightly different approaches: simple InnoDB, InnoDB + primary key (date_start, id) and MyISAM. The structure of the table will be the same for all 3 approaches and the differences will be only in the engine and in what indexes we impose:

 CREATE TABLE `tours` ( `tour_id` bigint(20) NOT NULL, `package_id` bigint(20) NOT NULL, `date_start` tinyint(4) NOT NULL, `stars` tinyint(4) NOT NULL, `pansion` tinyint(4) NOT NULL, `nights` tinyint(4) NOT NULL, `region_id` smallint(6) NOT NULL, `hotel_id` int(11) NOT NULL, `airport_id` tinyint(4) NOT NULL, `price` smallint(6) NOT NULL )

InnoDB

First, try using the standard InnoDB engine with standard indexes - PRIMARY KEY (`tour_id`), KEY `date_start` (`date_start`) . The key for date_start is to speed up sampling over the range of departure dates, and PRIMARY KEY tour_id is made with the expectation that tour_id increases monotonically, which should give us the optimal insertion speed.

Let's insert our data into the default installation of MySQL 5.7.17 (innodb_buffer_pool_size = 128 MB, the rest of the settings, see under the spoiler):

Default innodb settings

innodb_adaptive_flushing ON
innodb_adaptive_flushing_lwm 10
innodb_adaptive_hash_index ON
innodb_adaptive_hash_index_parts 8
innodb_adaptive_max_sleep_delay 150000
innodb_api_bk_commit_interval 5
innodb_api_disable_rowlock OFF
innodb_api_enable_binlog OFF
innodb_api_enable_mdl OFF
innodb_api_trx_level 0
innodb_autoextend_increment 64
innodb_autoinc_lock_mode 1
innodb_buffer_pool_chunk_size 134217728
innodb_buffer_pool_dump_at_shutdown ON
innodb_buffer_pool_dump_now OFF
innodb_buffer_pool_dump_pct 25
innodb_buffer_pool_filename ib_buffer_pool
innodb_buffer_pool_instances 1
innodb_buffer_pool_load_abort OFF
innodb_buffer_pool_load_at_startup ON
innodb_buffer_pool_load_now OFF
innodb_buffer_pool_size 134217728
innodb_change_buffer_max_size 25
innodb_change_buffering all
innodb_checksum_algorithm crc32
innodb_checksums ON
innodb_cmp_per_index_enabled OFF
innodb_commit_concurrency 0
innodb_compression_failure_threshold_pct 5
innodb_compression_level 6
innodb_compression_pad_pct_max 50
innodb_concurrency_tickets 5000
innodb_data_file_path ibdata1: 12M: autoextend
innodb_data_home_dir
innodb_deadlock_detect ON
innodb_default_row_format dynamic
innodb_disable_sort_file_cache OFF
innodb_doublewrite ON
innodb_fast_shutdown 1
innodb_file_format Barracuda
innodb_file_format_check ON
innodb_file_format_max Barracuda
innodb_file_per_table ON
innodb_fill_factor 100
innodb_flush_log_at_timeout 1
innodb_flush_log_at_trx_commit 1
innodb_flush_method
innodb_flush_neighbors 1
innodb_flush_sync ON
innodb_flushing_avg_loops 30
innodb_force_load_corrupted OFF
innodb_force_recovery 0
innodb_ft_aux_table
innodb_ft_cache_size 8000000
innodb_ft_enable_diag_print OFF
innodb_ft_enable_stopword ON
innodb_ft_max_token_size 84
innodb_ft_min_token_size 3
innodb_ft_num_word_optimize 2000
innodb_ft_result_cache_limit 2000000000
innodb_ft_server_stopword_table
innodb_ft_sort_pll_degree 2
innodb_ft_total_cache_size 640000000
innodb_ft_user_stopword_table
innodb_io_capacity 200
innodb_io_capacity_max 2000
innodb_large_prefix ON
innodb_lock_wait_timeout 50
innodb_locks_unsafe_for_binlog OFF
innodb_log_buffer_size 16777216
innodb_log_checksums ON
innodb_log_compressed_pages ON
innodb_log_file_size 50331648
innodb_log_files_in_group 2
innodb_log_group_home_dir ./
innodb_log_write_ahead_size 8192
innodb_lru_scan_depth 1024
innodb_max_dirty_pages_pct 75.000000
innodb_max_dirty_pages_pct_lwm 0.000000
innodb_max_purge_lag 0
innodb_max_purge_lag_delay 0
innodb_max_undo_log_size 1073741824
innodb_monitor_disable
innodb_monitor_enable
innodb_monitor_reset
innodb_monitor_reset_all
innodb_old_blocks_pct 37
innodb_old_blocks_time 1000
innodb_online_alter_log_max_size 134217728
innodb_open_files 2000
innodb_optimize_fulltext_only OFF
innodb_page_cleaners 1
innodb_page_size 16384
innodb_print_all_deadlocks OFF
innodb_purge_batch_size 300
innodb_purge_rseg_truncate_frequency 128
innodb_purge_threads 4
innodb_random_read_ahead OFF
innodb_read_ahead_threshold 56
innodb_read_io_threads 4
innodb_read_only OFF
innodb_replication_delay 0
innodb_rollback_on_timeout OFF
innodb_rollback_segments 128
innodb_sort_buffer_size 1048576
innodb_spin_wait_delay 6
innodb_stats_auto_recalc ON
innodb_stats_include_delete_marked OFF
innodb_stats_method nulls_equal
innodb_stats_on_metadata OFF
innodb_stats_persistent ON
innodb_stats_persistent_sample_pages 20
innodb_stats_sample_pages 8
innodb_stats_transient_sample_pages 8
innodb_status_output OFF
innodb_status_output_locks OFF
innodb_strict_mode ON
innodb_support_xa ON
innodb_sync_array_size 1
innodb_sync_spin_loops 30
innodb_table_locks ON
innodb_temp_data_file_path ibtmp1: 12M: autoextend
innodb_thread_concurrency 0
innodb_thread_sleep_delay 10000
innodb_tmpdir
innodb_undo_directory ./
innodb_undo_log_truncate OFF
innodb_undo_logs 128
innodb_undo_tablespaces 0
innodb_use_native_aio OFF
innodb_version 5.7.17
innodb_write_io_threads 4

 LOAD DATA LOCAL INFILE 'tours.csv' INTO TABLE tours FIELDS TERMINATED BY ','

During the upload, make sure that the MySQL server is the bottleneck, not the client (the client’s CPU usage should not be 100%) and look at the query execution time:

 Query OK, 10000000 rows affected, 11 warnings (1 min 35.09 sec) Records: 10000001 Deleted: 0 Skipped: 1 Warnings: 11

It seems that 1.5 minutes to insert 10 million entries is ok for InnoDB. Vorningi belong to the first line in the CSV-file, because the column names are written there.

The resulting table size is 490 MB of data and 162 MB of indices. In the process of executing the query, MySQL wrote 3.5 GB onto the disk and the CPU load was around 100% (1 of 4 cores were utilized on a test laptop).

Before you do a search query, let's look at other possible data insertion options.

InnoDB + PRIMARY KEY (date_start, tour_id)

Since we know in advance that most requests will affect only a relatively small range of departure dates, you can cluster data by departure date instead of tour_id — in this case, reading over the date_start range will give less random reading from the disk (or from buffer_pool), which means searching by such a table should be faster.

Let's see how things go with the insert:

 Query OK, 10000001 rows affected, 10 warnings (1 min 13.34 sec) Records: 10000001 Deleted: 0 Skipped: 0 Warnings: 10

Contrary to expectations, inserting into such a table went even a little faster than into the “normal” structure. This is probably due to the fact that we do not have an "extra" index on date_start, and it is included in the PRIMARY KEY. The disc was recorded slightly less than last time - only 3 GB.

The size of the table is 538 MB of data and 0 MB of index (there are no secondary indexes, and the PRIMARY KEY is recorded as part of the data in InnoDB)

MyISAM

In past years, MyISAM was the default engine in MySQL and was famous for its speed of insertion and execution of FULL SCAN. What is not an ideal candidate for our task? The structure of the table remains the same as in the case of the first InnoDB variant.

 Query OK, 10000000 rows affected, 11 warnings (40.39 sec) Records: 10000001 Deleted: 0 Skipped: 1 Warnings: 11

The result is definitely better, although not by an order of magnitude. Tuning settings can achieve better results for both MyISAM and InnoDB, but this is a topic for a separate large article. 350 MB was recorded on the disk, and the size of the table was 286 MB of data + 205 MB of index. Why, according to the instruments, the disc was written less than the size of the table in MyISAM, I could not understand. The reader can try to reproduce the experiment on his MySQL instance and compare the numbers.

Clickhouse

Well, the last option will be a database from Yandex called ClickHouse . This is a column database with automatic parallelization of queries and a bunch of other nishtyakov. There are binary builds for Linux and even Docker-containers and macOS support is claimed. I could not find any binary builds for macOS, so I assembled and laid out: yadi.sk/d/RffdbxaM3GL7Ac (the build requires gcc-6, several dozen gigabytes of space and about 2 hours on my laptop). Tests I will drive on macOS.

First create a table:

 CREATE TABLE tours ( tour_id UInt64, package_id UInt64, date_start UInt8, stars UInt8, pansion UInt8, nights UInt8, region_id UInt16, hotel_id UInt32, airport_id UInt8, price UInt16, date Date MATERIALIZED toDate(date_start) ) ENGINE = MergeTree(date, date_start, 8192)

When creating a table of type MergeTree (recommended by default), you need to specify a column with a date (it is sharding). Here I was a bit lazy and just made a MATERIALIZED date column with the value toDate (date_start). In real data, it will be much easier to simply make a date_start column with a Date type - in clickhouse, the date does not take up much space due to column compression.

ClickHouse, insert data

Now insert the data:

 $ cat tours.csv | pv | clickhouse --client --query 'INSERT INTO tours FORMAT CSVWithNames' 524MiB 0:00:04 [ 123MiB/s]

If you have not yet installed the pv utility, I recommend to install it, it allows you to print the progress of execution of operations and shows how many bytes passed through the pipe.

You can see that the insertion of the csv-file occurred in 4 seconds with an insertion speed of about 100 MB / s (as stated by the authors). When inserted, the server spent 7 seconds of CPU time.

The size of the data is 378 MB, and 378 MB was also recorded on the disk (apparently, the entire volume was sorted in memory and recorded in one piece).

Unfortunately, since our test data is random, we did not manage to get significant benefits from column compression. Real data should compress better, so the final size should be even smaller.

Data insertion times, comparison

Let's collect all inserts in one table:

Method	Time	Size table	Written to disk
InnoDB	95 seconds	652 MB	3.5 GB
InnoDB + PK	73 seconds	538 MB	3.0 GB
MyISAM	40 seconds	491 MB	350 MB?
Clickhouse	4 seconds	378 MB	378 MB

Sample

We will make several samples that correspond to more or less truthful types of queries that come in the search for tours:

 -- 1. « » SELECT * FROM tours ORDER BY price ASC LIMIT 20; -- 2. « »   10  (1/9  ) SELECT * FROM tours WHERE date_start BETWEEN 40 AND 50 ORDER BY price ASC LIMIT 20; -- 3. 5     500  600  SELECT * FROM tours WHERE nights = 5 AND price BETWEEN 500 AND 600 ORDER BY price ASC LIMIT 20; -- 4.   500  600 ,      SELECT * FROM tours WHERE price BETWEEN 500 AND 600 ORDER BY nights DESC, stars DESC LIMIT 20; -- 5.      SELECT * FROM tours WHERE date_start BETWEEN 10 AND 20 AND hotel_id = 767036 ORDER BY price LIMIT 20;

For MyISAM, the request times are shown in parentheses if IGNORE INDEX (date_start) is specified.
For ClickHouse, the brackets indicate the request times, if you choose not *, but only tour_id and price.

So, the results:

Method	1 ms	2 ms	3 ms	4 ms	5 ms
InnoDB	4300	3800	3800	3800	3700
InnoDB + PK	4600	600	4,000	4,000	540
MyISAM	1300	1600 (1000)	800	700	1500 (670)
Clickhouse	120 (40)	40 (14)	150 (54)	180 (80)	14 (8)

And the same with the cold cache (on the laptop SSD):

Method	1 ms	2 ms	3 ms	4 ms	5 ms
InnoDB	5500	5000	4800	4700	4700
InnoDB + PK	12,700	1960	12,100	12,000	1720
MyISAM	14800 (1060)	800	920	799	14900 (767)
Clickhouse	570 (188)	177 (90)	591 (224)	608 (254)	122 (55)

Analysis

You can see that, without tuning, MyISAM is best suited for MySQL, and the index on date_start only makes us worse, especially in the case of a cold cache (this is because MyISAM doesn’t get random access to the rows too efficiently - it’s always read one by one a line with the read () system call, which in the case of a cold cache has disastrous consequences even on SSD). Some queries are better executed faster on the InnoDB + PK scheme, but only on the heated cache. In the case of a cold start, MyISAM is much more preferable, and without an index on date_start (you can instead spread the table over date ranges and start queries in parallel).

Since ClickHouse is a column base, a query of the SELECT * type is not very convenient for it — this can be seen from the query execution time. If possible, you need to select only those columns that are needed (meta-information for tour_id should in any case be duplicated somewhere else in a safe repository, and you can get it from there instead of sampling directly from ClickHouse). Also, ClickHouse automatically parallels the execution of requests for all available cores, and also scans only the necessary columns, so it works very quickly when the cache is hot. In the case of a cold start, it is especially noticeable, why it is preferable to choose not all columns, but only the necessary ones.

However, even with a cold start, ClickHouse leads in the execution time of requests in all cases, without exception.

Conclusion

In this article, we built a simplified model for searching by tours, excluding stops, based on MySQL (3 different approaches) and ClickHouse, and in all cases the database from Yandex showed a significant advantage in speed, sometimes by 2-3 orders of magnitude.

I hope this article will help speed up and cheapen the search for those who do a tour search, as well as provide a basic understanding of what tasks ClickHouse is suitable for. Thank you for reading to the end, I will be glad to hear comments.

Ps . If you can use MySQL and ClickHouse to make the results better and publish your benchmarks, please write in the comments, this will be a useful addition to the article, since my figures are only approximate, and the correct settings can change the results several times for some types of queries.

Links

Source: https://habr.com/ru/post/324846/

All Articles