📜 ⬆️ ⬇️

How to write less code for MR, or why the world needs another query language? History of Yandex Query Language

Historically, many parts of Yandex have developed their own systems for storing and processing large amounts of data, taking into account the specifics of specific projects. With such a development, efficiency, scalability and reliability have always been a priority, therefore, as a rule, there was no time left for convenient interfaces for using such systems. A year and a half ago, the development of large infrastructure components was separated from product teams into a separate area. The goals were as follows: start moving faster, reduce duplication among similar systems, and lower the entry threshold for new internal users.



Very soon, we realized that a common high-level query language could help here, which would provide uniform access to existing systems, as well as eliminate the need to re-implement typical abstractions on low-level primitives adopted in these systems. Thus began the development of Yandex Query Language (YQL) - a universal declarative query language for data storage and processing systems. (I’ll say right away that we know that this is not the first thing in the world, which is called YQL, but we decided that this does not interfere with business, and left the name.)
')
On the eve of our meeting , which will be devoted to the Yandex infrastructure, we decided to tell the readers of Habrahabr about YQL.

Architecture


Of course, we could look towards the popular in the world of open source ecosystems - such as Hadoop or Spark. But they were not even seriously considered. The fact is that support was required for the data warehouses and computing systems already distributed in Yandex. Largely because of this, YQL was designed and implemented extensible at any level. All the levels, we take turns below.



In the diagram, user requests are moved from top to bottom, but we will discuss the affected elements in reverse order, from bottom to top, so that the story is more coherent. First, a few words about currently supported backends or, as we call them, data providers:



Core


Technically, YQL, although it consists of relatively isolated components and libraries, is primarily provided to internal users as a service. This allows, from their point of view, to look like a “one-stop service” and minimize labor costs for organizational issues such as issuing accesses or firewall settings for each of the backends. In addition, both implementations of the classic MapReduce in Yandex require the presence of a client process synchronously awaiting the completion of a transaction, and the YQL service takes it on itself and allows users to work in the “launched and forgotten mode came after the results later. But if you compare the model of service with the distribution in the form of a library, there are also disadvantages. For example, you should be much more careful about incompatible changes and releases - otherwise you can break user processes at the most inappropriate moment.

The main entry point to the YQL service is the HTTP REST API, which is implemented as a Java-based application on Netty and not only runs incoming computation requests, but also has a wide range of supporting duties:

Using Java made it possible to quite quickly implement all this business logic due to the presence of ready-made asynchronous clients for all the necessary systems. Since there are no too strict requirements on latency, there were few problems with garbage collection, and after switching to G1, they almost disappeared. In addition to the above, ZooKeeper is used for synchronization between nodes, including the publisher-subscriber pattern when sending notifications.

The execution of user requests for computation is orchestrated by separate processes in C ++ called yqlworker. They can be run either on the same machines as the REST API or remotely. The fact is that there is a network communication between them using the MessageBus protocol developed and widely used in Yandex. A copy of yqlworker is created for each request using the fork system call (without exec). This scheme allows you to achieve sufficient isolation between the requests of different users and at the same time - thanks to the mechanism of copy-on-write - not to waste time on initialization.

As can be seen from the diagram with high-level architecture, Yandex Query Language has two views:

From the query, regardless of the selected syntax, a calculation graph (Expression Graph) is created, which logically describes the necessary data processing using primitives that are popular in functional programming. Such primitives include: λ-functions, mapping (Map and FlatMap), filtering (Filter), folding (Fold), sorting (Sort), applying (Apply) and many others. For SQL syntax, the lexer and parser based on ANTLR v3 build the Abstract Syntax Tree, which is then used to build the calculation graph. For the s-expression syntax, the parser is almost trivial, since the grammar is extremely simple, and programs operate on these abstractions anyway.

Further, to obtain the desired result, the request goes through several stages, returning to the already passed if necessary:

At any stage of the request life, it can be serialized back into the s-expressions syntax, which is extremely convenient for diagnosing and understanding what is happening.

Interfaces


As mentioned in the introduction, one of the key requirements for YQL was usability. Therefore, special attention is paid to public interfaces and they are developing very actively.

Console client




The picture shows an interactive mode with autocompletion, syntax highlighting, color themes, notifications and other decorations. But the console client can be launched in the input-output mode from files or standard streams, which allows integrating it into arbitrary scripts and regular processes. There are both synchronous and asynchronous running of operations, viewing a query plan, attaching local files, navigating through clusters and other basic features.

Such rich functionality appeared for two reasons. On the one hand, in Yandex there is a noticeable layer of people who prefer to work mainly in the console. On the other hand, this was done in order to gain time to develop a full-featured web interface, which we will talk about later.

A curious technical nuance: the console client is implemented in Python, but is distributed as a statically linked native application without dependencies with a built-in interpreter that compiles for Linux, OS X, and Windows. In addition, it is able to automatically update itself automatically - just like modern browsers. All this was just enough to organize thanks to the internal infrastructure of Yandex for building code and preparing releases.

Python library




Python is the second most common programming language in Yandex after C ++, so the YQL client library is implemented on it . In fact, it was originally developed as part of a console client, and then was allocated to an independent product, in order to be able to use it in other Python environments, without reinventing similar code.

For example, many analysts like to work in the Jupyter environment, for which the so-called% yql magic was created on the basis of this client library:



Together with the console client, two special subroutines are delivered that launch pre-configured Jupyter or IPython with an already available client library. It is onii shown above.

Web interface




We left the main tool for learning the language YQL, developing queries and analytics for a snack. In the web interface, due to the lack of technical limitations of the console, all YQL functions are available in a more visual form and are always at hand. Some of the interface features are shown on examples of other screens:


… and not only

All pens in the REST API itself are annotated by code, and based on these annotations, detailed online documentation is automatically generated using Swagger. From it you can try pozadavat requests without a single line of code. This makes it easy to use YQL, even if the options listed above for some reason did not fit. For example - if you like Perl.

Opportunities


It is time to talk about which plan the tasks can be solved with the help of Yandex Query Language and what opportunities are provided to users. This part will be rather abstract, in order not to lengthen the already long post.

SQL



User Defined Functions

Not all types of data transformations are conveniently expressed declaratively. Sometimes it's easier to write a loop or use some kind of ready-made library. For such situations, YQL provides the mechanism of user-defined functions, they are also User Defined Functions, they are also UDF:



Aggregation functions

Internally, the aggregation functions use a common framework with support for DISTINCT and execution both at the top level and in GROUP BY (including with ROLLUP/CUBE/GROUPING SETS from the SQL standard: 1999). And these functions differ only in business logic. Here are some examples:


For performance reasons, a Map-side Combiner is automatically created in terms of MapReduce for aggregation functions with the combination of intermediate aggregation results in Reduce. DISTINCT now always works exactly (without approximate calculations), so it requires an additional Reduce for marking up unique values.

JOIN tables

The fusion of tables by keys is one of the most popular operations, which is often needed to solve problems, but to implement it correctly in terms of MapReduce is almost a science. Logically, all standard modes are available in Yandex Query Language, plus several additional ones:



To hide details from users, for MapReduce-based backends, the JOIN execution strategy is selected on the fly depending on the required logical type and physical properties of the participating tables (this is the so-called cost based optimization):

StrategyShort descriptionAvailable for logical types
Common join1-2 Map + ReduceEverything
Map-side join1 MapInner, Left, Left only, Left semi, Cross
Sharded Map-Joink parallel maps (k <= 4 by default)Inner, Left semi with unique right, Cross
Reduce Without Sort1 Reduce, but requires pre-properly sorted inputin developing


Development directions


Among our immediate and medium-term plans for Yandex Query Language:




Summing up




— , 15 , .

Source: https://habr.com/ru/post/312430/


All Articles