Data engineer climb

I joined the Facebook team in 2011 as a business analyst engineer. By the time I left the team in 2013, I was already a data engineer.

I was not promoted or assigned to this new position. In fact, Facebook has come to the conclusion that the work we do is a classic business intelligence. The role that we ultimately created for ourselves was a completely new discipline, and my team and I were at the forefront of this transformation. We developed new approaches, ways to solve problems and tools. In this case, most often, we ignored the traditional methods. We were pioneers. We were data engineers!
')

Date-engineering?

The science of data as an independent discipline is experiencing a period of adolescent self-assertion and self-determination. At the same time, data engineering could be called its “younger brother”, who also went through something similar. Data-engineering received signals from its “senior relative”, searched for its place and its own identity. Like data scientists, data engineers also write code. It is highly analytic, with a large share of visualization.

But unlike scientists working with data and inspired by a more mature progenitor of the sphere - programming - data engineers create their own tools, infrastructure, frameworks and services. In fact, we are much closer to programming than to the science of data.

In connection with the previously established roles, data engineering could be considered as a superset of business intelligence and databases that brings in more programming elements. This discipline includes the specialization of working with distributed Big Data systems, an extended Hadoop ecosystem, streaming data processing and working with Scale.

In small companies, where there is still no definitive infrastructure for data storage, engineers are assigned the role of its creation and support within the organization. This includes platform support tasks on Hadoop / Hive / HBase, Spark or something similar.

In small ecosystems, people tend to use hosting services, such as Amazon or Databricks, or get support from companies like Cloudera or Hortonworks, which essentially play the role of intermediaries between other companies.

In large ecosystems, there is a tendency towards specialization and the creation of a formal position to manage this area, as the team's need for data infrastructure is constantly growing. This unites the team to solve higher level tasks. As the engineering aspect of the data engineer position grows, aspects of its original business role become secondary. For example, the emphasis is reduced from creating and maintaining report portfolios and informational tables.

Now we have an excellent set of self-service tools, where analysts, scientists and the spherical "information worker" work more intelligently and can operate on data offline.

ETL is changing

We observed a massive departure from the practice of drag-and-drop ETL (Extract Transform and Load) to the program approach. Products based on know-how platforms like Informatica, IBM DataStage, Cognos, AbInitio or Microsoft SSIS are not common among modern data engineers and are being replaced by more general software and programming skills, along with an understanding of software platforms or configurable Airflow, Oozie, Azkabhan platforms or Luigi. This is a fairly common practice among data engineers who manage their workflow through, for example, a scheduler.

There are many reasons why complex software elements are not developed on the principle of "drag and drop" tools: ultimately, self-written code is the best solution for software. Although the reasoning on this topic is beyond the scope of this publication, it is easy to conclude that all of the above applies to the writing of the ETL, as it applies to any other software.

Own code allows the use of arbitrary levels of abstraction, allows you to describe logical operations in the usual way, suitable for collaboration and interacts well with the source of version control. The fact that ETL tools have evolved to extrude graphical interfaces seems like a full circle of data processing history.

It must be emphasized that abstractions influenced by traditional ETL tools deviated from their original purpose. There is no doubt the need for the abstract complexity of data processing exists, as in the calculation and storage. But I will note that these solutions should not be simplified by means of ETL (for example, a source / target bundle, filtering, etc.) because of the fashion, use the drag-and-drop approach. Abstractions must have a higher level. For example, a necessary abstraction of a modern data environment is a configuration for experimenting with A / B testing frameworks.

What kind of experiments? How will they go? What are the related procedures? What percentage of users should take part in the test? When are the results expected? What will we measure? In this case, we have a specific framework with which we can determine the input data with high accuracy, upload statistics and get final calculations. We expect that adding new data will simply lead to additional calculations and the data will be updated. It is important to understand that in this example the parameters of the abstraction are not determined by the traditional ETL tools and that the task of such an abstraction did not occur with the help of drug-and-drop'a.

For a modern data engineer, traditional tools in the form of ETL are obsolete, because their logic cannot be expressed in the form of code. As a result, the necessary abstractions created using these tools cannot be understood intuitively.

Now, knowing that ETL is not enough, one can argue about the creation of this direction from scratch. A new stack, new tools, a new set of rules and, in many cases, a new generation of specialists are needed.

Data modeling is changing

Typical modeling techniques — like the star schema — they define our data modeling approach for analyzing workloads associated with the data warehouse, but are less and less significant for us. The best traditional practices in data warehousing lose ground when it comes to changing stacks. At the same time, it is cheaper to store and process data than ever before, and the emergence of distributed bases and linear scaling saves such a scarce resource as “engineer time”. Here are some changes that are observed in data modeling techniques:

Additional denormalization (support of surrogate keys can be used as a trick, but this makes tables less readable), using real (human) readable keys and table change attributes is becoming more common, reducing the need for expensive connections that may be too heavy for distributed databases. Also note that code support and compression in a serialization format, such as Parquet or ORC, or within a DBMS using Vertica, leads to a serious loss of performance, which is usually associated with denormalization. These systems were created to normalize data, and storage is optional.
BLOBs: modern databases were created with the support of "Blobs" through their own types and functions. This opens up new possibilities in data processing modeling and can allow several functional granules to be stored in the table at once when a dynamic scheme is required.
Dynamic schemes: since the advent of map reduce and with the increasing popularity of documentation for supporting blobs and databases, it has become much easier to develop database schemas without executing DML. This simplifies an iterative approach to data storage and eliminates the need to achieve a complete consensus between sales and development.
Systematic snapshuting of the dimension (saving a full copy of the table size for each ETL cycle graph, usually done in different sections of the table) is used as a simple way to cope with the SCD (slow changing dimension). However, it requires little effort and, unlike the classical approach, it is easy to understand when writing ETL and queries. It also makes it easy and relatively cheap to denormalize the dimension attribute and the table in order to track its readings when the operation is completed. In retrospect, complex SCD modeling techniques are not intuitive and reduce accessibility.
Accordingly, the consistency of measurements and metrics is still extremely important in the environment of modern databases, but besides this we also need the speed of interaction with a large team, which includes many experts who also contribute to the work, and a certain compromise is needed here.

Roles and responsibilities

Data store

“The data warehouse is a copy of all the transferred data, which is specially structured for querying and analyzing,” - Ralph Kimball.

“The data warehouse is a domain-specific, integrated, time-varying and non-volatile data collection method and a guide to decision-making,” Bill Inmon.

Data storage is more relevant than ever, and data engineers are responsible for many aspects of its formation and operation. The data storage is the coordinate center of the data engineer and everything revolves around it.

Modern data storage is now more open than it used to be. Now scientists, analysts and software engineers are taking part in its creation and operation at the same time. Data has become too important a center of activity for any company to restrict access to it and more and more types of specialists can manage it. Although this allows scaling for the organization of workflows within an organization and meeting its information needs, as a result, this approach leads to a chaotic and imperfect infrastructure element.

Data engineers of companies often undergo internal certification to improve their skills in working with data warehouses. In Airbnb, for example, there is a set of “core” schemes that are managed by a data engineers team as part of a service agreement (SLA), where the parameters are clearly defined and are strictly followed. We are talking about business metadata and documentation of the highest level, for which maintenance requires a clear set of best practices.

Often, such a data warehouse becomes for the engineering team a “center of advanced development”, which defines standards and applies the best solutions and processes for certifying database objects. Such a team can take part in the education of other specialists by sharing their best decisions. All this is done to ensure that other engineers are improved in the field of working with data warehouses. For example, Facebook has its own Data Camp education program, and Airbmb has a Data University. There, engineers are trained to work with the database.

Data engineers are “librarians” of data warehousing, people who catalog and organize metadata that defines workflows. In the fast-growing and partly chaotic world of data, metadata and tool management is becoming a vital component of any modern platform.

Performance and Optimization

Data is becoming more and more strategic when companies grow and their infrastructure budgets are impressive. This makes it more and more rational for data engineers to increase productivity and optimize data processing and storage. Since budgets are rarely reduced (in this area), optimization consists in more efficient use of resources or “straightening” the exponential growth of workload and costs to a linear form.

Knowing the enormous complexity of the engineering stack of working with databases, we can assume that optimizing such a stack is also not an easy task. As a rule, decisions are made that require a minimum of costs while bringing great benefits.

Of course, in the interests of the engineer to create a scalable infrastructure. This allows the company to save resources at all stages.

Data integration

The integration of data and the practice of business integration systems by sharing data are more important than ever. Software and SaaS are becoming the new standard for companies. At the same time the need to synchronize data between these systems is becoming more and more critical. Moreover, SaaS needs new management standards from the company if we want to bring in the data obtained on the side to our repository so that they are related to the data we already have. Of course, SaaS has its own analytical solutions, but from time to time they lack the prospects for working with the rest of the proposed data set. Often, these SaaS models offer to accept relational data without integration and exchange of primary keys, which ultimately leads to a catastrophe that should be avoided at all costs. No one wants to manually maintain two repositories and a client for two lists on different systems or worse.

The head of the company often signs a contract with SaaS suppliers without taking into account the problem of data integration. The integration load is systematically minimized by solution providers for the sake of higher sales, which ultimately falls on the shoulders of data engineers who have to perform unplanned work. Not to mention the fact that typical SaaS APIs are often lousy and do not have clear documentation and sufficient flexibility. All this means that you can expect anything, for example, changes in the solution without prior notice from the supplier.

Services

Data engineers work with higher levels of abstraction. In some cases, this means that these same engineers, scientists, or analysts can provide manual services and tools to automate the work.

Here are some examples of services that data engineers and database infrastructure maintenance engineers can create that can be exploited.

Data uptake: services and tools built around “scraping” the database, loading logs, extracting data from external sources or an API ...
Calculation of metrics: frameworks for calculating and summarizing participation, growth, or segmentation-related indicators.
Anomaly detection: automating data consumption and warning the right people about anomalous events or the emergence of trends to significant changes.
Metadata management: tools built around the generation and consumption of metadata, making it easy to find information both inside and outside the repository.
Experimentation: Writing experimental A / B tests and frameworks is often an important component of a company's analytics with a significant amount of engineering data.
Toolkit: analytics begins with recording the events and attributes associated with these events. Data engineers are selfishly interested in high-quality data going up.
Professionalization: the creation of sources of information that specialize in building actions in chronology, which allows analysts to understand the behavior of the user.

As well as software developers, data engineers must be constantly looking for ways to automate their work and set abstractions that allow them to grow. The level of need for process automation may vary depending on the situation, but it should be carried out in all directions.

Skills Required

Knowledge of SQL: if English is the language of world business, then SQL is the language of data. How successful do you want to be a businessman if you don’t speak good English? Technology and generation are changing, but SQL stands firmly on its feet, like Lingua Franca of the data world. The data engineer should be able to use SQL to express such things as “subquery correlation” and window functions of any complexity. SQL / DML / DDL are primitive and simple enough to have no secrets from the data engineer. In addition to the declarative nature of SQL, the engineer must be able to read and understand the execution plans of the database, as well as have an idea of how all the stages, indices, and various join and distributed measurement algorithms work within the framework of this plan.

Methods of data modeling: for a data engineer, the “entity-relationship” modeling should become a reflex, along with a clear understanding of normalization and an intuitive sense of the line between denormalization and the need to make concessions. - .

ETL: , «» ETL . .

: , - , , , . , -, , , , . , , .

Finally

FAcebook, Airbnb Yahoo!, - Google, Netflix, Amazon, Uber, LYFT , - , . , . , , , !

Source: https://habr.com/ru/post/321030/

All Articles