“If you want to create something really cool, you have to dig deeper and know how your code works in the system, on hardware”

Habr, hello! I wonder how many programmers and developers have discovered data science or data engineering, and build a successful career in the field of big data. Ilya Markin, Software engineer in Directual , is just one of the developers who have switched to data engineering. We talked about the experience in the role of a tmlid, a favorite tool in data engineering, Ilya told about conferences and interesting profile channels of javista, about Directual from the user side and technical, about computer games, etc.

- Ilya, thank you for taking the time to meet. Congratulations on the relatively recent transition to a new company, and on the birth of your daughter, your worries and worries are many. Immediately the first question: what was so interesting you were offered to do in Directual, that you left DCA?
')
- Probably, first you need to tell me what I did in DCA. I got into the DCA (Data-Centric Alliance) after completing the Big Data Specialist program. At that moment I was actively interested in the topic of big data and realized that this is the area in which I want to develop. After all, where there is a lot of data, there are plenty of interesting engineering problems that need to be solved. The program helped me to quickly immerse myself in the ecosystem of the world big date, there I got the necessary basic knowledge about Hadoop, YARN, the Map-Reduce, HBase, Spark, Flink paradigm, and much more, and how it works under high load.

I was invited for an interview by the guys from DCA. DCA is a major participant in the RTB market ( Real Time Bidding is an advertising technology that allows you to organize an auction between sellers and buyers of real-time advertising. The object of bargaining at an online auction is the right to show ads to a specific user. RTB is based on the maximum accuracy of selection of the target visitor - Ed. ). DCA had a high coverage of Runet users: about 600 million cookies, a cookie is not equal to the user - one user can have many cookies: different browsers, different devices. We received dozens of terabytes of data about visits to Internet pages per day. All this was processed and the cookie was laid out in a certain set of segments. Thus, we could determine, for example, cat lovers from 20 to 25 years old living in Moscow, in order to continue to offer them to buy food for their beloved cat near the house. And there are many such examples, there are quite simple, there are complex ones. There was a lot of java, scala and C ++ under the hood. I joined the company as a developer, and six months later I became a team leader.

I left DCA at the end of spring, by that time I was tired of the managerial load and began to look at technical positions. It turned out that I could not write the code for a week. We met with the team, discussed interesting solutions, thought through the architecture, painted tasks. When I took something from the list, I sometimes did not have time to complete the task, because there were a lot of timlide cases. Maybe the problem is in me, and I could not allocate the time correctly.

And yet I gained a rewarding experience. First, work with the team and with the business: it is interesting to be at the junction of development and business, when you receive a request for the implementation of some functionality, you think, evaluate the possibilities. Often it is necessary to make a decision which will be more useful in this particular situation: write something quickly “on the knee” or spend 2 weeks, or even more, but issue a consistently working, normal solution.

- And what decisions were most often chosen - “on the knee” or in 2 weeks?

- The developer, in the depths of his soul, is always a perfectionist, he can endlessly engage in some interesting task, remake it, optimize it. Of course, you need to know when to stop. Chosen solutions that were somewhere in the middle.

Secondly, I finally was in a position where you can participate in decision-making, to be aware of what is happening in the company. I don’t like to just sit and code in my box, I want to know what is happening with the product, how it shows itself, how users react.

Thirdly, I began to conduct interviews, visited “on the other side of the barricades”, so to speak. The first interview was very exciting to read, read the resume and thought: “Damn, now the star will come, and I don’t even know half of what he wrote. What I will talk to him about at all. ” And in the process of communication you become sober and understand why the demand in the IT-market exceeds the supply. It is difficult to find a good specialist, most often he sits where he is satisfied with everything. A ready-made specialist for your specific tasks and technologies, who will not need to be retrained / retrained, to find it at all unrealistic, you have to connect connections, ask friends, acquaintances, colleagues. Networking is very important here. So, for example, I brought my friend to the company, in which I was sure and with whom I had worked previously at the previous place. They also took a recent university graduate who had little experience with our stack, but during the interview I realized that he was a very promising guy.

Often people work with frameworks, not with specific tools, I think this is now a problem. A candidate with a two-year experience of a Hadoop-Big Data-developer comes, you begin to ask how Hadoop works, what parts it consists of, and the person does not know. Since Hadoop provides certain interfaces to simplify working with it, this is enough for a certain range of tasks. And often a person does not even go beyond the scope of these interfaces, that is, the code he gets from this to this. And what happens to the packed code after it has sent it to the system, the person no longer cares. This is enough for many, they don’t want to understand it deeper. Conducting an interview is an excellent experience not only of hiring, it also provides self-confidence as a specialist, which is very useful.

Why Directual. When I was the coordinator for the Data Engineer program, Artem Marinov and Vasya Safronov from Directual came to speak to us. Artyom, by the way, at one time interviewed me at DCA (again about the benefits of networking), and now invited me to talk. They needed a rocky, but they were ready to consider a javista who understood how jvm worked under the hood. So I was here.

- What is so interesting you offered to do in Directual? What attracted you?

- Directual is an ambitious startup that implements all the announced projects, that is, it does what it promises. I was pleased to be part of the team and take an active part in all implementations. And for me it was important that the company pays for itself by working with clients, and does not live on the money of investors.

I will tell you a little about the project from both the user and the back side.

Directual slogan - “Let people create!”. This is the main idea - to enable any person who does not have the knowledge and experience in writing code to program in our visual editor.

How it works: the user through the browser in our platform can “roll up cubes” (read - functional nodes of a process) - that is, to collect a script, which will be used to process incoming data. Data can be completely any. The processed output data can have a different view - from a report to a PDF to sending a notification to several administrators. Simply put, any business process can be programmed in minutes, while not being able to write code. The company works in two directions - box solutions for corporate clients, as well as the cloud option for a wide range of users.

In order to make it clearer how it works, I will give a few examples.
In any online store there are a number of functional stages (“cubes” in our case) - from showing the goods to the customer to adding them to the cart and arranging the delivery to the final consumer. Using the platform, we can collect and analyze data: the frequency of purchases, the time they were completed, the user's path, and so on, which will allow us to more closely interact with customers (for example, develop seasonal offers, individual discounts). However, this in no way means that our platform is the designer for creating online stores!

Directual copes well with the automation of logistics processes and the work of hr-direction of large companies, and with the creation of any other technological solutions - from a farm for growing greens to a smart home. On the platform, for example, you can create a telegram bot in a few clicks - we have almost every employee who writes the core of the system, has his own bot. Someone made a librarian assistant, someone - a bot that helps to learn English words.

We kind of "select" the work of some programmers, because now there is no need to contact them for help, prepare TK, check the performance of the work. Now you just need to know how your business should work, you need to understand the processes themselves, the rest we do.

- Listen, but the software for the farm for growing herbs, for example, has long been there. What makes you different?

- Yes, it is true, there are concrete solutions for green production farms. However, you do not develop this software yourself, you buy a ready-made solution. With the help of our platform you can customize software for yourself, for your business and your tasks, you do not need to hire developers.

- And what exactly are you doing?

- The company is divided into 2 parts: the development of the core of our system and the project office, which, in fact, is our zero customer, if I may say so. I am developing a system kernel.

As I said, we want to give anyone the opportunity to work on our platform. For this we are working on our cloud. And there are a lot of problems. What is the difficulty here: for example, there are 10 thousand users, they have several data flow scenarios, and in each stream there are 10-20 branch blocks. Imagine what the load on the iron. And we need to be able to clearly differentiate everything so that the processes of one client do not interfere with the processes of another, do not inhibit work. If one client has any problem that we need to solve, then we should not hurt the work of another client.

Since the user does not need to think about how this all works under the hood, he is free from the choice of storage. We support different databases - they can be both relational databases and NoSql. In general, the system behaves to them the same way. But the client does not need to think about it - when creating an account, depending on the tasks, the system will help to make the best choice of storage.

Our platform is a good example of a highly loaded distributed system, and my task is to write good code so that all this works flawlessly. As a result, here I got what I wanted: I work with the tools that interest me.

- And how did you come to the field of work with data?

- At my first job, I mostly did the same type of tasks in a fairly narrow segment (read - parsil xml :)), and I quickly disliked it. I started listening to podcasts, I realized how big the world is around, so many technologies that everyone is talking about - Hadoop, Big Data, Kafka. Then I realized that I had to learn, and the program “ Big Data Specialist ” turned up very appropriately. As it turned out, I didn’t lose: the first module (MapReduce, Hadoop, Machine Learning, DMP-systems - author's note) was very useful, I wanted to study it, but the second module is about recommender systems I just did not know where to apply, I never touched it. And then I went to DCA to work with what I was interested in. There, a colleague told me that in addition to the data-scientist, there is also a data engineer in this area, he told who he was and what could be useful for the company.

After that, you just announced a pilot launch of the Data Engineer program, of course, I decided to go. I already knew some of the products that were on the program, but for me it was a good overview of the tools, structured everything in my head, I finally understood what the data engineer should work with.

“But most companies do not share these two positions, two professional profiles of a person, they are trying to look for universal specialists, who will collect data and prepare them, and make a model, and under the highload they will bring in products. What do you think, what is it connected with, and how correct is it?

- I really liked the performance on the program “Big Data Specialist” Pavel Klemenkov (then worked at Rambler & Co), he talked about ML-Pipeline and mentioned about programmers-mathematicians. He spoke just about such universal specialists, that they exist, there are few of them, and they are very expensive. Therefore, Rambler & Co is trying to develop them at home, to look for strong guys. These professionals really hard to find.

I believe that if you really have a lot of data and you need scrupulous work with them (and not just predicting a person’s gender and age or increasing the likelihood of a click, for example), then these should be two different people. There is a rule of 20/80: data scientist is 80% data science, 20% - he can write and write something in the product, and data engineer - 80% software engineer and 20% he knows what models are like them apply and how what to count, without deepening in mathematics.

- Tell us about the most important discovery for you in data science \ data engineering? Maybe the use of some tool \ algorithm radically changed your approach to solving problems?

- Probably the fact that, having enough data, you can extract a lot of useful information for their future actions. Even if sometimes you do not know what these raw, impersonal data are, you can still do something based on them: break into groups, find some peculiarities, just put some patterns into numbers using mathematical methods. True, analysts have been able to do this before, but the fact that it has now become more accessible has increased the power of iron - that’s cool! The threshold for entering data science has now decreased, you may not know so much to try to do something with some tools.

- What was the biggest file at work? What lesson have you learned from this?

- Probably upset you, I haven’t had one yet, maybe ahead. I honestly thought, recalled, but nothing was so, very boring. It's like the administrators: if you haven't “dropped the prod”, haven't “wiped the base”, you're not a real administrator. That's probably not the real developer.

- What data engineering tools do you use most often and why? What is your favorite instrument?

- I like Apache Kafka very much. A cool tool in terms of both the functionality it provides and engineering. The specificity of Kafka's work lies in the close relationship of the program code and the operating system on which it runs - Linux (read - “works fast and well”). That is, it uses various native linux functions, which allow for excellent performance even on weak hardware. I believe that in our area it should be so - it is not enough just to know a programming language and a couple of frameworks for it. If you want to create something really cool that it will be pleasant to use not only for you, but also for others, then you need to dig deeper and know how your code works in the system, on hardware.

- What conferences do you attend? What profile columns \ blogs \ ng channels do you read?

- As I said, it all started with podcasts, namely, “ Debriefing ” - from the guys from the world of java.

There is also https://radio-t.com - a cool Russian-language podcast on high and it-technology topics, one of the most popular (if I'm not mistaken) in our language.

I follow the news from JUG.ru , the guys make cool hardcore conferences, organize meetings. I try to go to those in Moscow, in St. Petersburg, too. The top java conference is Jpoint in Moscow (also known as Joker in St. Petersburg), I always go to Jpoint or watch it online.

I look at what Confluent is doing - the guys who earn corporate support for kafka and are the main committers to it. Also develop convenient tools around Apache Kafka in opensource. I try to use their versions.

The Netflix techlog on the medium is a cool resource about solutions for one of the largest platforms for delivering video content to the user. Highload and distributed systems on the “do not want”)

Channels in the telegraph: https://t.me/hadoopusers - a place where in our language you can chat on data-engineering-new topics; https://t.me/jvmchat - people of java world, discuss its problems, their problems and not only.

- Maybe something else for the soul?

- I grew up on computer games, I once played very actively, now there is no time for that. And at some point I thought: “Since I can’t play games, what’s stopping me from studying this area?” And if I have free time, I’m taking some java, C # or C ++ framework that can play write and do something. All this rarely reaches the final product, but I get pleasure. , — “ ” — , “ -- ”, : , -, 2D/3D-, , , , . , : , , , . , . .

-:
— Java Python?
— Java, .

— Data Science Data Engineering?
— Data Engineering

— ?
— It depends, , , .

— ?
— - , .

— ?
— . , , . , , .

Source: https://habr.com/ru/post/424363/

All Articles

“If you want to create something really cool, you have to dig deeper and know how your code works in the system, on hardware”

More articles: