📜 ⬆️ ⬇️

“Without the data engineer, the value of the analyst model tends to zero” - an interview with the data engineer Nikolai Markov

Hi, Habr! Data Engineering is becoming increasingly popular, many companies are gradually opening up the relevant vacancies. In this regard, we interviewed Nikolai Markov, Senior Data Science Engineer, Aligned Research Group LLC and lecturer on the Big Data Specialist and Data Engineer programs about what data scientists and data engineers should be able to do, what they often lack and how to find their place in data analysis.



- Tell us about your path to the date of the engineer. What attracted you to the field of work with data and why, instead of such a popular and attractive Data Scientist, did you become a Data Engineer, which you are only now beginning to talk about and understand its value, and also make educational programs?

I started working as a software engineer, I wrote Python code for a long time, even from university. True, frankly, initially with mathematics in my school was not very. And when Data Science began to form as a separate direction, I realized that this is one of those areas where people really force computers to do what they are intended for. That is, not to move the baytik from one place to another or not to draw beautiful interfaces (I once worked in a web studio, a useful and interesting experience, but there is not at least some benefit from all projects).
')
In this area there really is a global universal sense, but the trick is that you need to know mathematics well in it. This was one of the important reasons for me not to go directly to the data scientist, since I would have to overhaul the course of development. For data analysts, Python, Spark are just tools, and the basis is a mathematical apparatus, parameter selection, and model tuning. I liked programming too much to dive headlong into another pool.

Then I suddenly realized that in fact people who are good at solving math problems have another weakness - they often have no experience in industrial coding, building some kind of finished product. And I impudently concluded that I have an advantage in this.

So I came to Data Engineering. Because, first of all, it allows me to do what I love, but at the same time in a more interesting field for me. And secondly, it is a sought-after area. It's cool when you get paid for something you truly enjoy doing, right?

The longer I do this - the clearer I see that Data Science is a very diverse field, everyone can find their niche in it. When I began to teach, I set my goal to bring this thought to the students: it is extremely rare to find people who can do everything and solve math and write code, you should look at any problem more broadly. There are a lot of tasks in this area, these are not only models, but also building an architecture, deploying, setting up a complete workflow chain from a prototype to some kind of finished product. All these steps are important, and each requires an appropriate specialist.

- What difficulties did you face professionally, what were the challenges at the beginning of your career and in the future?

Difficulties like everyone else - inexperience first, had to deal a lot with different technologies. You come to a new company, your stack changes completely. I remember a few years ago, when I came to work at Mail.ru Group to write in Python, I was put 24/7 to support a script written in Perl (and I would not call it an example of quality code), which I did not know before, I had to teach him from scratch. Such things happen quite often, on the other hand, it allows you to gain experience in different areas and take a different look at the task.

Accordingly, when I went closer to analytics, mathematics became one of the main problems, but here a lot of books I bought helped me: on algebra, statistics, probability theory and all that. For more than a year I didn’t get out of literature, at first I had to force myself every day, then it became a habit. And in the end, it seems, it gave good results. However, probably, this is not for me to judge.

- Where do you see the advantages and disadvantages of using data data engineers as compared to data scientist?

Here, as in that story: being a data engineer, you understand programming better than any data scientist, and statistics the other way around. It turns out a little sad that other people are engaged in research, mathematics, and the selection of cunning features. But here the data engineer has the advantage that without it, the value of this prototype model, most often consisting of a piece of code in the Python file (and it’s good on it, and not on some R!) Is of terrible quality, which I came from a data scientist and somehow gives some kind of exhaust, tends to zero. Without the data engineer, this code will never become a project, no business problem will be solved. So far this is just an experiment on the knee. Data engineer is trying to turn this into a product.

- To date, a huge number of qualified specialists in the region are leaving abroad. What do you think Russian companies need to do to prevent the outflow of personnel?

Here are the recipes as everywhere: competitive salary, social “buns”, everything is as usual. Now with the crisis it turned out that it became more profitable for a very large number of people to work remotely for some foreign office: the salary is higher, there are more opportunities. Our companies are in a difficult situation, and, in my opinion, the only thing that can be done is to carefully select candidates and not be afraid to invest in smart guys. Maybe you should not hire a bunch of students who yesterday studied linear algebra, but rather spend more budget and hire a couple of good specialists who can solve specific business problems. Example: let's say we have a store and we want to build a recommendation system for the site. Then it is reasonable to purposefully search for specialists from this series. Instead of exchanging money for people who have heard somewhere, Data Science is now in a trend and want to go there to work, but in fact they don’t always pull on Junior.



- Why did you decide to engage in teaching?

I always liked to write articles. Besides, I somehow have historically in the family all scientists, teachers, it turns out that it is in the blood. I also like the philosophical principle: if you yourself want to understand some problem, then the most correct way is to take and write a serious article about it, and if you have already reached some serious level, but you want to fix it so that it bounces off your teeth then you can go teach.

Collaboration with the guys from Newprolab in the framework of teaching began with the fact that I was on the first set of the “Big Data Specialist” course, and I didn’t really like the way programming was taught, there was a great lack of engineering details. I went up to the guys, offered my help, well, it started. Now here I am teaching Python.

- Tell us about your approach to teaching. What, in your opinion, is the most important thing in the process of learning programming and data analysis?

My approach is to get people interested, to infect them with energy. I like it when they sincerely delve into and ask questions, and for me, which is already there, this is another opportunity to chat on the topic that interests me and those around me. The most important thing, as it seems to me, is to develop curiosity in oneself; without it, it is difficult to get some benefit from the lessons.

If we talk about some specific things, then, for example, I try to make presentations not just beautiful pictures, but so that you can keep it for yourself, all links are clickable, you can read this or that article so that in the end people have some then the base from which you can make a start.

- Based on your teaching experience, what aspects of programming and, in particular, data analysis are given to students most difficult? What is the reason for this?

Tricky question. Now I work with a very mixed audience: the data analysis course comes as people from top (and not so) management who try to understand what machine learning is and how to apply it in business, as well as engineers with experience who already know how codify, but still do not know how these skills are 100% applied in data analysis. As a result, it turns out that the first is very difficult to program, because they have never done this in their lives, and the second one needs to remember mathematics, literally starting from the first year of university. Yes, and a set of tools for data analysis is often quite different from the standard set of a typical IT company on the market, for course participants this is also a challenge.

- How do you manage to keep a balance in teaching between these groups with different backgrounds?

We must pay tribute to the organizers that when people start working in a team, some help others, and most often such disparate teams work well: people solve problems from different spheres and pull each other to a more or less average level. Then, after they wrote some basic things in Python, decided a few labs together, people come together, they feel more confident. Also, its role is played by the fact that, unlike the university, here people gathered to study a specific problem in person and spend a large chunk of their time on it. And when you spend a lot of time and energy on something, it becomes important, and you sincerely want to penetrate into it and “add pressure”. And yes, I also think that the course fee is an additional incentive for training. Although there are free good courses too, take at least the same course from ODS (I'll talk about it a little later).



- From your observations: what soft and hard skills are often lacking for both beginners and experienced data scientists and data engineers to become really high-class specialists?

It seems to me that I have already come to terms with the fact that data scientists should not be able to program very well, this is not their job. Although there are still some people and courses who believe that the data scientist should be able to not only train models, but also write perfect code, draw visualizations, build pipelines, and do presentations. In fact, there are different tasks for different skills. Although, of course, there will not be much confusion from a model written on a blackboard with a marker or in a notebook, it still needs to be some kind of code, something better than a label with macros in Excel.

The situation with engineers is such that they do not require a deep understanding of mathematics and how these models work. But in order for data engineers to effectively convert these models into products, in my opinion, they need experience in technology companies. For me, one of the most basic problems with a scientist and junior engineers is that they don’t even know how the workflow is usually built. In an amicable way, there should be code review, the code should be stored in the version control system, continuous integration, etc. Of course, as I said above, it’s not worth waiting for the data scientist to know all this well. But the engineers are not hurt. Although the situation is such that even intelligent engineers working in technology companies do not always have these skills, either because they built this infrastructure from scratch, without focusing on anyone, or because there were no right people nearby. who would explain why it is worth doing this way and not otherwise.

- What kind of life hacks you can share with novice experts in working with data both in terms of studying the field, and in terms of building a career? Tell us about your vision of the best way to analyze data.

As for studying the area, I can throw in a bit of arrogant PR: there is a very good Data Science community in Russia - Open Data Science , there is a Slack channel with 6500 people, many of whom are always ready to help, so you need to use it, especially at the start. Of course, an obligatory skill is English, if it is lame, then you need to pull it up first, because, despite the efforts of publishers, the best books on data analysis are still in English.

In terms of building a career, not only in Data Science, but in any field, a very important skill is now - the ability of people to explain how what you did works. The ability to present your work is extremely important, especially for the data scientist. As I have already said, no one will understand the model written on a sheet of paper, another thing is when there is a drawn graph, a visual presentation. I understand that there are introverts who want to sit in a corner, code or calculate something, but it is better to learn to overpower yourself at least sometimes.

Regarding data engineering, it’s still more cunning, because companies that are now starting to build Data Science departments already have some IT processes that are set up in a different direction. They have already had some product for a long time, an IT department that supports the site / service and periodically carefully throws some new features on it so that nothing falls. Moreover, the older the company is, the IT processes that are more severe than any legacy in it, all code strictly goes through 10 code review stages, fails no sooner than 10 days after it is flooded, etc. These are good practices for building stable products, but bad for building a Data Science infrastructure. Data analysis is changing actively, this area requires experimentation: I quickly assembled the model card on my knees, launched the project, watched how it responded, corrected something.

It turns out that from the point of view of engineers, besides the ability to push through the necessary ideas, one also needs the ability to integrate into existing infrastructures. For example, in the company for which I work, before we had normal data engineering, there was simply no adequate dialogue between Data Science and Operations. Moreover, there was a very old infrastructure for data analysis and its own build-system written over many years. And when we started trying to build Data Science around this, we realized that the environment is holding us back. As a result, we had to force a rather small team to build a company within the company, that is, a small sub-infrastructure. And it seems to me that this is not only our local problem, this is what is happening in all companies that are starting to introduce such things.

Well, for efficiency you need to learn how to talk to people. Of course, these are skills a bit towards management, but one way or another we will have to deal with these things. The sphere is quite new, the transition to new practices is happening everywhere, and people who can build this transition from legacy to data driven insights, break through this ossification will always be in value and honor.

Useful resources for data engineers from Nikolai Markov


Books



Courses




, — Data Science Breakfast — , , . #data_breakfast ODS, !

Source: https://habr.com/ru/post/340582/


All Articles