We build reliable data processing - lambda architecture inside Google BigQuery

In this article I want to share a way that allowed us to stop the chaos with data processing. I used to consider this chaos and subsequent re-processing inevitable, but now we have forgotten what it is. I give an example implementation on BiqQuery, but the trick is quite universal.

We have quite a standard process of working with data. Source data in the most raw form is regularly loaded into a single repository, in our case in BigQuery. From some sources (our own production), data comes every hour, from others (usually third-party sources), data is sent daily.

Subsequently, data is processed to a usable state by various users. These may be internal dashboards; reports to partners; results that go into production and influence the behavior of the product. These operations can be quite complex and include several data sources. But for the most part, we are coping with this inside BigQuery using SQL + UDF. Results are stored in separate tables there.

The obvious way to organize this processing is to create a schedule of operations. If the data is loaded daily at one in the morning, we will set up the processing at 01:05. If this data source is loaded around the 5th minute of each hour, then we set up the processing for 10th minute of each hour. The intervals of 5 minutes for users are not critical and it is assumed that everything should work.
But the world is cruel! Data does not always come on time. Or do not come at all if you do not fix it. If your hourly load ended at the 11th minute, and the transformation was launched at the 10th minute - then please wait another hour to see this data in the deshborde. And if the operation uses several sources, then the situation will be even more fun.

Moreover, the raw data being loaded is not always correct (the data is generally not always correct!). Periodically, data must be cleaned or reloaded. And then you need to restart all operations and with the correct parameters so that everything can be fixed.
')
All this, of course, problems with the raw data and it is necessary to solve them. But this is a war in which you can not finally win. Something will still be broken. If the data source is internal, then your developers will be busy with new cool features, not tracking reliability. If it is third-party data, then generally a pipe. I wish that at least the processing did not get in the way and as soon as the raw data was repaired, all customers immediately saw the correct results.

This is really a big problem. And how else to solve?

Solution # 1 - Remove Problem Details

If processing leads to problems, then do not do it! Do not do any processing at all and store intermediate results. As soon as the user needs results, everything must be calculated on the fly from raw data. Given the speed of BigQuery, this is quite realistic. Especially if all you do with the data is GROUP BY date and count (1), and only data from the last 14 days is needed.

Most analytics work with just such queries. Therefore, we are actively using this solution. But this approach does not work with complex transformations.

One problem is the complexity of the code. If you add up all operations in a single SQL query, it will not be read. Fortunately, this is solved by means of tables of type view . These are logical tables in BigQuery, data is not stored in them, but generated from a SQL query on the fly. This greatly simplifies the code.

But another problem is performance. Everything is bad here. No matter how fast and cheap modern databases are. If you run a complex transformation in one year of historical data, it will take time and cost money. There are no other options. This problem makes this strategy inapplicable in a fairly large percentage of cases.

Decision number 2 - to build a complex system

If there is no possibility to do without a processing management system, then you need to build this system well. Not just a script execution schedule in cron, but a data load monitoring system that determines when and what transformations to run. Probably the pub / sub pattern is very suitable here.

But there is a problem. If building a complex system is less simple, then it’s very difficult to maintain it and catch bugs. The more code, the more problems.

Fortunately, there is a third solution.

Solution number 3 - lambda architecture! …well something like that

Lambda architecture is a famous data processing approach that takes advantage of the processing of data on a schedule and in real time:

* How normally to translate into Russian I do not know, batch job is that a batch job? Who knows, tell me!

Usually this is all built using multiple solutions. But we use essentially the same trick just inside BigQuery.

And this is how it works:

Scheduled processing (Batch layer). We perform daily SQL queries that transform the data currently available, and store the results in tables. All requests have the following structure:

The results of this query will be stored in table_static (will overwrite it). Yes, BigQuery allows you to save the results of the query in the table that was used in this query. As a result, we take the old, already counted data (so as not to recalculate them) and connect with the new data. X days is the selected period for which we want to recalculate the data to take into account all possible corrections of the raw data. It is assumed that in X days (how many are individually for the source) all the adjustments will already be made, everything that is broken will be repaired and the data will no longer change.

Real-time access (Speed layer + Serving layer). Both of these tasks are combined into one SQL query:

Yes, this is the same query! We save it as a view (view) with the name table_live and all users (dashboards, other requests, etc.) pull the results from this view. Since representations in BigQuery are stored at a logical level (only a query, not data), each time it is accessed, it will recalculate the last X days on the fly and all changes in the initial data will be reflected in the results.

Since the request is the same in both cases, in reality, in order to avoid duplication of the code, the daily request (from the batch layer) looks like this:

SELECT * FROM table_live

(and save the results to table_static)

This approach has several important advantages:

Each user gets up to date results, there is no resynchronization between raw data and aggregated data (this is on condition that all the jambs with raw data are resolved in X days)
Users get results quickly. No one recounts 2 years of data every time they are accessed. Recalculation of the last X days in BigQuery occurs reasonably quickly. X is chosen so as not to create performance problems. It’s possible to switch to the clock instead of the days, of course, but we didn’t have to do this with our data.
No need to even think about the schedule of daily operations. You just need to do it once a day, but it does not depend on the time of loading the raw data. In addition, there are no problems if the transformation falls in one day or if it is launched twice. For the user to notice the problem, it is necessary that the processing "lay" more than X days.
To build such a system, you need a minimum amount of code (read - problems). The SQL query itself is 10 lines longer than the original and is now stored inside BigQuery as a view. Plus, you need to run daily processing, but this is an elementary task.
The cost of this system may be even less than when creating a complex system of schedules. It seems that it should be more expensive, since we constantly recalculate X days. But our experience shows the opposite. If you load data every hour, you will have to do a recalculation every hour, including weekends. And you have to recalculate, X days ago, to reflect the adjustments. Total for the week will be typed 24 * 7 = 168 times. But in reality, the user can open this dashboard only three times a week. With our approach, we will have to recount 7 times on a daily schedule and plus 3 times on the fly. Significantly less.

PS If you like to use tables divided by dates in BigQuery (we love very much), then there is a solution for that. But this is a topic for another post. Hint - the functions for working with these tables do not swear, if part of the tables are only views.

PPS If views in BigQuery supported caching (as it works with regular queries), that would be really cool. This would essentially make them materialized views. And the effectiveness of our approach would be even higher. If you agree , you can put an asterisk here so that this feature will be implemented faster.

Source: https://habr.com/ru/post/279491/

All Articles

We build reliable data processing - lambda architecture inside Google BigQuery

Solution # 1 - Remove Problem Details

Decision number 2 - to build a complex system

Solution number 3 - lambda architecture! …well something like that

More articles: