Who are the data engineers, and how are they?

Hello again! The title of the article speaks about itself. On the eve of the start of the course "Data Engineer" we suggest to find out who the data engineers are. The article has a lot of useful links. Enjoy your reading.

A simple guide on how to catch the wave of Data Engineering and not let it pull you into the deep.
')
One gets the impression that today everyone wants to become a Data Scientist. But what about Data Engineering? In fact, it is a kind of hybrid data analyst and data scientist; A data engineer is usually responsible for managing workflows, processing pipelines, and ETL processes . Due to the importance of these functions, it is currently another popular professional jargon that is actively gaining momentum.

High wages and huge demand are only a small part of what makes this job extremely attractive! If you want to join the ranks of heroes, it’s never too late to start learning. In this post I have gathered all the necessary information to help you take the first steps.

So, let's begin!

What is Data Engineering?

To be honest, there is no better explanation than this:

“A scientist can open a new star, but cannot create it. He will have to ask the engineer to do it for him. ”

–Gordon Lindsay Glegg

Thus, the role of a data engineer is quite weighty.

From the name it follows that data engineering is associated with data, namely with their delivery, storage and processing. Accordingly, the main task of engineers is to provide a reliable infrastructure for data. If we look at the AI hierarchy of needs, data engineering takes the first 2-3 steps: collecting, moving and storing, preparing data .

What does a data engineer do?

With the advent of big data, the responsibility has changed dramatically. If earlier these experts wrote large SQL queries and distilled data using tools such as Informatica ETL, Pentaho ETL, Talend, now the requirements for data engineers have increased.

Most companies with open vacancies for the post of data engineer have the following requirements:

Excellent knowledge of SQL and Python.
Experience with cloud platforms, in particular, Amazon Web Services.
Preferably knowledge of Java / Scala.
Good understanding of SQL and NoSQL databases (data modeling, data storage).

Keep in mind, this is only the most necessary. From this list, we can assume that data engineers are experts in the field of software development and backend.
For example, if a company begins to generate a large amount of data from different sources, your task as a data engineer is to organize the collection of information, its processing and storage.

The list of tools used in this case may differ, it all depends on the volume of this data, the speed of their arrival and heterogeneity. Most companies do not encounter big data at all, so you can use a SQL database (PostgreSQL, MySQL, etc.) with a small set of scripts that send data to the storage as a centralized repository, the so-called data repository.

IT giants such as Google, Amazon, Facebook or Dropbox have higher requirements: knowledge of Python, Java or Scala.

Experience with big data: Hadoop, Spark, Kafka.
Knowledge of algorithms and data structures.
Understanding the basics of distributed systems.
Experience with data visualization tools such as Tableau or ElasticSearch will be a big plus.

That is, there is a clear shift towards big data, namely in their processing under high loads. These companies have increased requirements for system resiliency.

Data Engineers Vs. data scientists

Well, it was a simple and fun comparison (nothing personal), but in fact everything is much more complicated.

First, you need to know that there is a lot of ambiguity in the distinction between the roles and skills of a data scientist and a data engineer. That is, you can easily be puzzled by what skills are necessary for a successful data engineer. Of course, there are certain skills that overlap with both roles. But there are also a number of diametrically opposed skills.

Data science is a serious matter, but we are moving towards a world with a functional date, where practitioners are able to do their own analytics. To enable data pipelines and integrated data structures, you need data engineers, not scientists.

Is a data engineer more in demand than a data scientist?

- Yes, because before you can prepare a carrot cake, you must first collect, peel and stock up on carrots!

A data engineer understands programming better than any data scientist, but when it comes to statistics, everything is exactly the opposite.

But the advantage of a data engineer: without him / her, the value of the prototype model, most often consisting of a horrible quality code fragment in a Python file obtained from a data scientist and somehow producing a result, tends to zero.

Without a data engineer, this code will never become a project, and no business problem will be effectively solved. A data engineer is trying to turn this all into a product.

Basic Information Data Engineer Should Know

So, if this job awakens the light in you and you are enthusiastic - you are able to learn it, you can master all the necessary skills and become a real rock star in the field of data mining. And, yes, you can accomplish this even without programming skills or other technical knowledge. It is difficult, but possible!

What are the first steps?

You should have a general idea of what is what.

First of all, Data Engineering refers to computer science. More specifically, you need to understand efficient algorithms and data structures. Secondly, since data engineers work with data, an understanding of the principles of the databases and the structures underlying them is necessary.

For example, conventional B-tree SQL databases are based on the B-Tree data structure, as well as, in modern distributed repositories, LSM-Tree and other hash-table modifications.

* These steps are based on the wonderful article by Adil Khashtamov . So, if you know Russian, support this author and read his post .

1. Algorithms and data structures

Using the right data structure can significantly improve the performance of the algorithm. Ideally, we should all study data structures and algorithms in our schools, but this is rarely ever covered. In any case, it’s never too late to read.
So here are my favorite free courses for studying data structures and algorithms:

Plus do not forget about the classic work on the algorithms of Thomas Kormen - Introduction to algorithms . This is the perfect reference when you need to refresh your memory.

To improve your skills, use Leetcode .

You can also dive into the world of databases with the stunning videos of Carnegie Mellon University on Youtube:

2. Learning SQL

Our whole life is data. And in order to extract this data from the database, you need to "speak" with them in the same language.

SQL (Structured Query Language - Structured Query Language) is the language of communication in the data area. Regardless of what someone says, SQL lived, is alive and will live for a very long time.

If you have been in development for a long time, you probably noticed that rumors about the imminent death of SQL appear periodically. The language was developed in the early 70s and is still very popular among analysts, developers and just enthusiasts.
Without SQL knowledge in data engineering, there’s nothing to do, since you will inevitably have to create queries to extract data. All modern big data stores support SQL:

Amazon Redshift
HP Vertica
Oracle
SQL Server

... and many others.

To analyze a large layer of data stored in distributed systems, such as HDFS, SQL: Apache Hive, Impala, etc. mechanisms were invented. You see, it is not going to go anywhere.

How to learn SQL? Just do it in practice.

For this, I would recommend to get acquainted with an excellent textbook, which, by the way, is free, from Mode Analytics .

A distinctive feature of these courses is the presence of an interactive environment in which you can write and execute SQL queries directly in the browser. Resource Modern SQL will not be superfluous. And you can apply this knowledge in Leetcode tasks in the Databases section.

3. Programming in Python and Java / Scala

Why study the Python programming language, I already wrote in the article Python vs R. Choosing the best tool for AI, ML and Data Science . As for Java and Scala, most of the tools for storing and processing huge amounts of data are written in these languages. For example:

Apache Kafka (Scala)
Hadoop, HDFS (Java)
Apache Spark (Scala)
Apache Cassandra (Java)
Hbase (java)
Apache Hive (Java)

To understand how these tools work, you need to know the languages in which they are written. The functional approach of Scala allows you to effectively solve the problems of parallel data processing. Python, unfortunately, cannot boast of speed and parallel processing. In general, knowledge of several languages and programming paradigms has a good effect on the breadth of approaches to solving problems.

To immerse yourself in the Scala language, you can read Scala Programming from the author of the language. Twitter has also published a good introductory guide - Scala School .

As for Python, I consider Fluent Python the best intermediate book.

4. Tools for working with big data

Here is a list of the most popular tools in the big data world:

Apache spark
Apache kafka
Apache Hadoop (HDFS, HBase, Hive)
Apache cassandra

More information on building large blocks of data can be found in this amazing interactive environment . The most popular tools are Spark and Kafka. They are definitely worth exploring, it is advisable to understand how they work from the inside. Jay Kreps (co-author Kafka) in 2013 published a monumental work of The Log: what every software developer should know about the abstraction of real-time data integration , by the way, the main ideas from this talmuda were used to create Apache Kafka.

The introduction to Hadoop can be the Complete Hadoop Guide (free of charge) .
The most comprehensive Apache Spark guide for me is Spark: the complete guide .

5. Cloud platforms

Knowledge of at least one cloud platform is in the list of basic requirements for applicants for the position of data engineer. Employers give preference to Amazon Web Services, in second place is the cloud platform Google, and closes the top three Microsoft Azure.

You should be well versed in Amazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Distributed systems

Working with big data implies the presence of clusters of independently working computers, the connection between which is carried out over the network. The larger the cluster, the greater the likelihood of failure of its member nodes. To become a cool expert in the field of data, you need to delve into the problems and existing solutions for distributed systems. This area is old and complex.

Andrew Tanenbaum is considered a pioneer in this area. For those who are not afraid of theory, I recommend his book “Distributed Systems” , for beginners it may seem difficult, but it really helps you hone your skills.

I consider “Designing Data-intensive Applications” by Martin Kleppmann to be the best introductory book. By the way, Martin has a wonderful blog . His work will help systematize knowledge about building a modern infrastructure for storing and processing big data.

For those who like to watch videos, Youtube has a course on Distributed Computer Systems .

7. Data conveyors

Data conveyors are something you can't live without as a data engineer.

Most of the time, a data engineer builds a so-called pipeline date, that is, creates a process for delivering data from one place to another. These can be custom scripts that go to the external service API or make a SQL query, supplement the data and put it in a centralized repository (data repository) or unstructured data repository (data lakes).

To summarize: the data engineer’s main checklist

To summarize, you need a good understanding of the following:

Information Systems;
Software development (Agile, DevOps, Design Techniques, SOA);
Distributed systems and parallel programming;
Database basics - planning, design, operation and troubleshooting;
Designing Experiments - A / B tests to prove concepts, determine the reliability, performance of systems, and also to develop reliable ways to promptly provide good solutions.

These are just a few requirements in order to become a data engineer, so study and understand the data systems, information systems, continuous delivery / deployment / integration, programming languages and other topics on computer science (not in all subject areas).

And finally, the last but very important thing that I want to say.

The way of becoming Data Engineering is not as simple as it may seem. He does not forgive, frustrates, and you must be prepared for this. Some moments in this journey may push you to quit. But this is a real work and educational process.

Just don't embellish it from the start. The whole point of the journey is to learn as much as possible and be ready for new challenges.

Here is a great picture I encountered that illustrates this point well:

And yes, do not forget to avoid burnout and rest. This is also very important. Good luck!

How do you like the article, friends? We invite you to a free webinar , which will be held today at 20.00. As part of the webinar, we will discuss how to build an efficient and scalable data processing system for a small company or a startup with minimal costs. As a practice, we will become familiar with Google Cloud data processing tools. See you!

Source: https://habr.com/ru/post/452670/

All Articles

Who are the data engineers, and how are they?

More articles: