There is a company providing a platform for working with big data. This platform allows you to store genetic data and effectively manage them. For the full operation of the platform, you need the ability to process dynamic requests in the runtime environment in no more than two seconds. But how to overcome this barrier? To transform the existing system, it was decided to use SQL data warehouse. Look under the cat for details!

Situation
Since approximately 99.5% of the human genome is common, sequence sequencing typically compares sequence patterns with selected known reference genomes. Comparison results are saved to a raw data file with a .vcf extension. It contains only differences between genome samples and the reference genome. It contains approximately one million lines, each of which represents a gene mutation, that is, a gene index and a value that differs from the value of the reference genome for a given index.
')
The resulting VCF file goes through a normalization procedure, which prepares it to use the various scripts available on the platform. It is a sequence of steps for adding metadata and annotations from external gene databases. The normalization process creates a huge file containing about 1 million lines and approximately 50 columns.
Problem
Doctors and researchers need the ability to process dynamic queries in a runtime environment, using a dashboard to review patient sequencing results. The basic query supports filtering samples based on the contents of several columns. Extended queries can associate a sample with a reference database (with 60 million rows). A more detailed scenario implies a series of join operations within a single query, which allows genetic analysis of the pedigree and contacting the repository of the external reference database.
Since requests are generated using the dashboard, it was necessary to get the result in less than two seconds. If it took more time to get the result, the operation was considered completed unsuccessfully. Processing such requests at runtime (for example, if each sample contains a large number of rows and columns) requires a significant amount of computational resources and takes a lot of time.
Interaction
The goal of our cooperation with this promising company was to measure the performance of query processing in order to further evaluate the effectiveness of a number of scalable storage system architectures:
Running Impala on Cloudera Cluster
Impala is recommended for processing SQL queries within the framework of interactive research analytics of large data sets, and the
Hive and
MapReduce modules for performing batch tasks with a long processing time.
Impala also uses a column storage system that handles requests of the type described above much more efficiently. This is due to the way physical data is stored on disk. When performing SELECT queries and projecting several columns, only the corresponding blocks of the selected columns are read from the disk, since the data in different rows of the same column is located next to each other. For more information about column storage, see
here .
Running Spark in an HDInsight Cluster
Spark is the technology that is recommended for processing big data in HDFS first. Spark will successfully prepare the system to use other technologies (for example, Impala). For more information about Spark technology, see
here .
SQL Data Warehouse
SQL Data Warehouse is a distributed database management system for mass-parallel processing (MPP). By posting data and processing across sites, SQL Data Warehouse provides flexible scaling options. For more information about SQL Data Warehouse, see
here .
Requests
We agreed on the number of queries that support generic scripts. We used the same sample files for different storage systems and checked how long it took to process requests for each of these files. For clarity, the following queries are written "in the image and likeness of SQL." They are processed by each storage system in accordance with its own syntax.
The reference database table contains 60 million lines within a similar VCF file structure, as well as additional metadata for each gene. In the normalization process, this data is used to annotate the lines in the sample VCF file.
The following table provides a description of the queries and the results of performance measurements:
. | Technology | Impala | Spark in HD Insight | Spark in Cloudera | SQL Data Warehouse | SQL Data Warehouse |
---|
. | Cluster Type | 3 x D13 | 3 x D12 | 3 x DS13 (5 P30 disks each) | 400 DWU (Data Warehouse) | 1000 DWU (data storage units) |
Index | Scenario | with | with | with | with | with |
one | Choose from samples, sort by gene position, take the first 100 entries | 2 | five | 2 | one | one |
2 | Select from the table of the reference database, filter by one field, sort by another field, take the first 100 records | eleven | 96 | 38 | 2 | 6 |
3 | Associate the sample with the reference database table by gene identifier, sort by gene position, take the first 100 entries | 775 | 347 | 168 | 15 | 6 |
four | Associate the sample with the reference database table by gene identifier, take the first 100 entries | 211 | 275 | 121 | one | one |
five | Associate the sample with the reference database table by gene identifier, filter by one field, sort by gene position, take the first 100 records | 13 | 111 | 61 | one | one |
6 | Associate the sample with the reference database table by gene identifier, group by one field, sort by another field, take the first 100 records | 37 | 41 | 23 | five | 2 |
7 | Select from the table of the reference database, group by one field, sort by another field, take the first 100 records | 12 | 25 | 20 | 15 | 7 |
eight | Select from the sample row file that does not exist in the reference database table, take the first 100 entries | 37 | 367 | 432 | five | 2 |
9 | Link four samples by gene identifier, filter by one field, sort by another field, take the first 100 entries | 7 | sixteen | 20 | 3 | 2 |
ten | Add 1 million records to a table containing 60 million records | 7 | 35 | 33 | four | 3 |
eleven | Add 1 million entries to a new empty table | 4.5 | 35 | 33 | 6 | 21 |
12 | Add 60 million entries to a new empty table | 140 | 500 | 280 | 214 | 124 |
Conclusion
Considering the results obtained, the above alternatives do not satisfy the requirement for the duration of treatment (2 s).
We recommend the following steps.
- In the short term: use a relational DBMS (for example, SQL Server, MySql, Postgres) to process most real-time queries sent from a dashboard that need to be processed in less than two seconds. Make sure that the execution plan does not provide for a full table scan.
- Deploy a small Hadoop cluster (including Spark, Impala, HDFS and Parquet). Use it to download individual complex web requests sent from a dashboard that do not require a response from a relational database to Impala in less than a second. After a while, increase the number of queries processed outside the relational DBMS, using Impala to perform the most complex and resource-intensive tasks.
- Use additional ad hoc queries and optimize data preparation with Impala and Spark. Pre-process the raw data using Spark and Impala, and then load the appropriate data slices into a relational DBMS. Use Impala technology to build new special requests in the web application and add the appropriate changes to the interface (for example, "Wait ...", "Loading ...") so that the increased response time does not affect the user experience.