Since 2014, in Moscow State University. MV Lomonosov operates an educational program in the field of Data Mining and information retrieval from Mail.Ru Group. Her students study various disciplines in this field and are trained in the relevant departments of the company, as well as in the laboratory at Moscow State University, which we opened in the fall of 2014. We have already written about the Technosphere
here and
here , and in this article we want to tell in more detail about the curriculum, its results, the activities of the laboratory at the university, and also take a short interview with the interns of the program.

Technosphere Program
Currently, the program consists of four semesters and ten disciplines, some of which are annual (two semesters run). Within each discipline, the emphasis is on practice, students carry out the project individually or in groups. Initially, the Technosphere program was annual, but pretty soon we realized that this time was not enough, and we decided to extend it to two years, increasing the duration of studying certain disciplines. We added only one discipline “Introduction to data analysis”, where we took some necessary aspects from the field of mathematics and statistics, as well as the main topics on the tools used (programming languages, libraries, etc.). Lectures on this course are available on our
YouTube channel . We have expanded the rest of the disciplines. In particular, a part on in-depth study of C ++ was added to multithreaded programming in C ++, and the information retrieval unit became the key one.
Also, to help those who come from non-core faculties, we have added a preparatory course "Algorithms and Data Structures". In this course, students analyze various basic algorithms and learn to reasonably choose data structures for specific tasks. These lectures are also available in the relevant YouTube
playlist .
')
Developed educational projects (some interesting examples)
Introduction to data analysis. During the semester, students must perform reproducible research based on open data. The work is carried out in groups of 4-6 people, each team has its own theme. The study is performed in Project Jupyter, it is allowed to use the languages ​​Python, R, Java; The basis should be based on a methodology similar to CRISP-DM, which determines the stages of the study.
The children are offered a choice of initial data sets (open data from the US government, data from the US sociological service, UN data, the open data portal of the European Union), you can also independently find another data set.
Algorithms for intelligent processing of large amounts of data. The task of classifying users of the social network Twitter is being solved. As input data, students are given a list of user identifiers, for some of them the category to which the user belongs is given. An example of a category might be computer games hobbies (addicted to / not addicted). The goal of the project is to create an algorithm that most accurately predicts the category of those users for whom it is not given. In the course of execution, students independently collect data using open web service APIs, implement and apply various algorithms for constructing and selecting features, as well as machine learning algorithms.
Methods for processing large amounts of data. The final semester project is devoted to determining the presence of mutations in the cells. For this purpose, the
p53 Mutants data set was used, the feature of which is the shifted distribution of labels: less than 1% of positive samples. The work can be done in any programming language, but most often the guys use Python and C ++.
Two-year education of the children will complete the semester, fully dedicated to the development of the graduation project in teams. What it will be - we will tell in the spring of next year, when the children reach the fourth semester.

Where are students being trained?
The knowledge and skills that students receive in the framework of the Technosphere can be applied in various departments - advertising technologies (advertising targeting, news categorization), Mail.Ru Search, Antispam. Currently, out of 23 graduates of Technosphere, 12 are interns in the projects Tarantool, My World, Mail and some others. We asked them to share their impressions.
Svyatoslav Feldsherov- In which division do you work?- I have been working in the Tarantool division for about three months. I chose him precisely because I was more or less aware of what the team was doing and I liked the atmosphere inside. In addition, our leader, Konstantin Osipov, read a semester course in the Technosphere, it was difficult and delightful in his own way: he knew right away that he had a lot of cool tasks.
- What problems do you solve?- I am working on a large and capacious task - the introduction of SQL support in Tarantool. In addition to me, she is engaged in another graduate of Technosphere. The fact is that now communication with Tarantool takes place in the Lua language, and we realize the possibility of using SQL. The main idea is for more users to work with our system. We mainly write in C and C ++.
- Where are the results of the work?- Do not apply yet. We make a prototype, and then there will be a refactoring stage and a big processing of our results by more experienced participants. In general, we still have a long way to go.
- What are your development plans - in which units would you like to work?- I like what I'm doing now. If everything goes well and I understand that this is mine, I want to delve into this area. But sometimes you read articles, communicate with comrades, and really want to drop everything and continue to do data analysis (this was the main direction of training in the Technosphere). But for the present it is completely incomprehensible how it will be. Time will tell.
Mikhail Galkov- In which division do you work as an impression?- I work in the department of recommendation systems, although I expected that I would work in the Search. I did not know much about the recommender systems, I met only at Tekhnosfera lectures, so at first I had to make up knowledge and read many scientific articles on the topic, but I had enough motivation: we had very ambitious tasks that were really interesting to solve. From what was told at the lectures, almost everything was somehow confirmed. Big data is really “big”. Understanding the algorithms and tools for processing them helped a lot, as well as the distribution of time for working with data / algorithms turned out to be almost correct, for myself I estimate about 70 to 30.
- What problems did you solve at the very beginning of the internship?- At first, I analyzed articles with various algorithms, investigated their features and compared them using various metrics.
- What problems are you solving now?- Basically, I am engaged in the construction of algorithms for our universal recommender system and data preparation.
- Where are the results of the work?- In Odnoklassniki and on the main Search.
- What are your development plans - in which units would you like to work?- So far I want to continue to engage in recommendations, but it would be interesting to apply this knowledge in the search itself.
- What are your impressions about the mentor?- I was very lucky that Dmitry Solovyov turned out to be my mentor. In addition to helpful comments and help in setting goals and their achievements, I especially appreciate the exchange of new ideas and the willingness to try new things.
Laboratory of Technosphere for Students

In the autumn of 2014, only half a year after the launch of the Technosphere, we decided to open our own laboratory, where students of our project could work on real tasks from the company's divisions. The most interested in delegating tasks to students turned out to be one of the divisions of the advertising technologies department, which is engaged in audience segmentation. We asked the head of this unit,
Arthur Kadurin , a few questions. Read our interview below.
- What tasks were set before the guys when designing the laboratory?- Initially, we planned to give the children various tasks, both research and applied. But as a business, it was difficult for us to immediately believe that students would be able to perform part of the tasks remotely, outside the walls of the company, because it is important for us that the task was not just completed, but completed in a
timely manner . At the same time, we really wanted to try to work with students, and we singled out among all the tasks that were solved by the department, the most suitable for remote work is the actualization of categories within the catalog of advertising headings. It is important both for the users of our system (advertisers are interested in choosing the right target audience so that they can spend the advertising budget more efficiently) and for visitors to the sites on which our advertising system works (they are less shown uninteresting advertising, less negative to banners) .
- And who works in the laboratory?- At our time, at the time of the creation of the laboratory, there were already two employees working - a graduate of Technopark from MSTU. Bauman, therefore, initially, when designing the laboratory, there was a plan to recruit about ten laboratory technicians from all educational programs. But, since the proposed tasks are quite specific and require training, we decided to recruit guys only from the Technosphere.
In total, about eight people work in the laboratory at one time. Unfortunately, there is a routine. It is connected with the fact that the program was annual and after the end of the Technosphere the guys moved to the company's divisions. Now that the program is two years old, the situation has improved.
- What problems does the laboratory solve?- Now we have several fundamentally different types of tasks. The obligatory part of the work is, in fact, assessor marking. The guys collect themed sites and pages, write regular expressions or select keywords, so that we can more accurately understand the user's interest or that the interest has changed. In addition, they themselves assess the quality of the work of their colleagues, checking whether the system is marking pages on the anonymized data, and correcting errors. Perhaps this job takes the most time.
However, a significant part of the markup process is automated in one way or another, and the guys do a significant part of this automation. They write scripts to process the site tree, collect keywords and prepare data for writing regular expressions, and also automate the creation of regular expressions themselves.
Obviously, we work with a huge number of domains and an even greater number of pages, so the collected data often have to be stored and processed using Hadoop, then the guys write the MapReduce tasks themselves and run them on our training cluster.
However, part of the work, namely the expert assessment, cannot be automated. And the main value of the work of laboratory technicians for us is this very “human” examination. At first, we thought it would be good to have time to completely update the catalog of topics once a year, now, including thanks to the automation offered by laboratory technicians, we fit in 4-6 months and, most likely, we will accelerate.
In addition to tasks directly related to the thematic catalog, since the end of last year we decided to engage in research work in the laboratory. Probably, this is also not in the full sense of scientific tasks, because in any case they come from business, but nevertheless these are studies, and among KPI there are articles and speeches at conferences. To do this, we singled out two guys from the “production” part of the laboratory, as well as a curator from our side - my employee Larisa Markeeva. We still can not boast of significant results, but I can say that what the guys have already done is encouraging.
-
What is the structure in the laboratory?- Besides me, the guys are supervised by our colleague from Moscow State University, Sergei Stupnikov. Sergey has been working at the Faculty of VSC since 2008, he is also a senior research fellow at the Institute of Informatics Problems of the Russian Academy of Sciences. In the laboratory of the Technosphere, he oversees the work of the guys every week, which ensures the delegation of the task from the company to the university.
We asked the curator
Sergey Stupnikov about the internal tasks of the laboratory.
- Are there differences in the formulation and solution of business problems, and not scientific ones?- In business, it is often enough to find a suitable solution in the literature and adapt it to the needs of the company. Of course, it is necessary to bring the implementation to the industrial level, hone efficiency. For scientific problems, the appearance of a new idea, its prototyping, is typical.
- Tell us about the tasks that were originally set for the laboratory?- The laboratory is a division of the Technosphere, which unites the students of the project and allows them to solve both purely practical tasks and tasks with a scientific component and receive remuneration for it. My responsibilities include coordinating the activities of the interns and reports to the data analysis department and the research and education department. In the scientific part now there are several interesting tasks. A couple of them are associated with the Giraph framework, which implements a computational model for iterative graph analysis. It can implement various interesting algorithms. Efforts are now focused on aspects of depth learning. There are a lot of prospects for this method, employees of the company are engaged in this as well.
- How does working in the laboratory help students in research?- Of course, recently we have been more engaged in applied tasks. But work helps to get different skills, including communication experience with management and in a team. I hope that the scientific result of the work of the guys will appear in the future.
I can identify several components of the work in the laboratory. The most important thing is the ability and desire to generate new ideas, maybe small at first, both in applied and in scientific terms. You can also highlight practical skills:
- tasks with a scientific component involve reading and analyzing scientific articles, including in English;
- you need to be able to prototype your tasks (to program algorithms), to show their effectiveness;
- You should be able to present your results, form them in the form of reports and articles.
- When is better to do?- The laboratory requires investing its time, self-discipline. Work in the laboratory allows not to be sprayed on various areas (when students earn their living often not quite in their specialty), but concentrate on one - data analysis. Students who study in the Technosphere have tasks for the basic studies at the university, for studies at the Technosphere, and if they are doing well in these two areas, then I recommend taking additional work in the laboratory.
- What would you like to wish students?- I would like them to choose an interesting field of activity in which they will be engaged in their life. In addition to the ability to study, the university provided an opportunity to get acquainted with various business areas, where you can find an interesting application for your knowledge and skills. I am glad that MSU students have this opportunity.
The laboratory interns Anton Goy and Miras Amir also shared their impressions of the work in the laboratory.
Anton Goi- What did you expect from the internship and what are your impressions?- It's great that I have study, home and work close to each other. During the work I figured out many methods. More global challenges appeared. I have two mentors. They help to resolve controversial issues, that's great.
- Where are the results of your work?- In targeted advertising. We show users ads based on their past behavior. If a person has visited the sites that are contained in our catalog, the system understands this and shows what else he might be interested in.
- There were funny stories while working?- I came across different subjects, and some of them were funny. Thanks to the work, now I understand all types of alcoholic beverages and I know everything about the credit system. Sometimes there are also funny site names.
- How does work in scientific activity help?- Now I am engaged in a project on neurophysiology - the definition of images of brain areas that are interconnected. There is a lot of data and it is necessary to process them correctly. In this work, I will use the knowledge gained in the Technosphere and the laboratory.
- What do you want to do?- Machine learning, data analysis. From next school year I plan to go to the office in some division of the company where there are suitable tasks.
Miras Amir- What are your impressions of working in the laboratory?- I have been working in the laboratory recently, less than a month. For two weeks I learned a few new things, communicate with the guys and as a result have grown significantly. The mentor sets practical tasks for us, and this is very different from the tasks at the university. You immediately look at the problem in a different way and develop professionally.
- How does work in scientific activity help?- Since the activity in the laboratory is closely related to the disciplines of my curriculum at the MMP department, I think their combination has a positive effect on my academic performance and self-development: it helps me to be more responsible in my work and to distribute my time qualitatively.
- How do you see your development?- A vacancy in data analysis has recently caught my eye; I think that when I gain experience here, we will apply for a more serious position in the company.
- What would you wish to those guys who are just learning, do not work yet?- I would like them to organize their time in such a way as to start work quickly. Because the skill of working in a team is developing, experience is accumulating.

After a year of laboratory work, we summed up and calculated the effectiveness of the work of the guys compared to contractors and employees who did not have basic training in data analysis. It turned out that the guys did a great job. They not only performed the tasks that the mentor of the laboratory at Moscow State University, Sergei Stupnikov, set for them, but also independently developed and offered automation tools for these tasks. Some of them were slightly supplemented and implemented in the company's business processes.
In the fall of 2015, a second direction was launched within the laboratory - scientific research. Such issues are also given to the company in time. In particular, we are actively exploring the possibility of using neural networks in business tasks, but, unfortunately, we are not yet ready to try them out “in battle”: ensuring their work requires the investment of vast resources, and the benefits of new methods are not yet obvious.
Nevertheless, work is underway, and we have connected interns from the laboratory to their decision. At the moment, two of the six interns in the laboratory are engaged in research.
The guys were tasked with the MCL algorithm. It aims to do clustering in a graph.
In March 2016, Larisa Markeeva became the curator of the direction. Together with Arthur Kadurin they set a scientific task for the children.
Interns needed to understand the basic interpretation of the RBM algorithm (restricted Boltzmann machine) and implement it under the Giraph framework. Thus, the guys should build a distributed computing system on graphs, which was based on the Pregel architecture.
This technique allows you to pre-train neural networks. A neural network that can learn from the cluster can later be used, for example, to optimize the CTR in advertising.
As we have already mentioned, for the time being we will use neural networks in experiments. Perhaps this work out in open source.
As part of the scientific direction of internships, it is proposed to use various technologies: Java, Hadoop. Python, Notebook is used for visualization and prototyping. Also plans to use Apache Spark. Potentially, algorithms that research and develop trainees can form the basis of undergraduate or master's work.
We asked one of the interns,
Pavel Kovalenko , about work in the scientific unit of the laboratory:
- What are your impressions of the lab?- I was called to work at the laboratory in February last year. For me, this is the first experience of this work in a team. The format of the laboratory is very well suited for combining with studies: work at home plus a weekly meeting at the university.
- What problems did you solve at the very beginning?“In the beginning, we worked like assessors.” It required regular expressions to mark the pages of large sites on the subject of their contents (cars, real estate, clothing, etc.). A set of training samples is the first step in solving the problem of automated page classification by subject, on which several students of the laboratory are currently working.
- What problems are you solving now?- At the beginning of this year, my colleague Alexander Shcherbakov and I were offered a new interesting task: working with the Apache Giraph (add-on to Hadoop for distributed processing of graphs), namely the creation of a distributed version of the Boltzmann limited machine. They say that someone at Mail.Ru Group has a need to apply the Boltzmann machine to really big data.
- How does the Technosphere help in scientific activities, studies?- For me, the Technosphere has become an indispensable experience. Perhaps I went there a little early - at the beginning of the second year. Many moments were incomprehensible to me. In particular, the course Data Mining requires a good knowledge of probability theory, but I did not have them at all. Technosphere influenced my choice of the department at the VMK - I really liked the course Data Mining and the ideas of machine learning, so I went to the department of mathematical forecasting methods, which, in fact, deals with machine learning. To work in the laboratory is very helpful course Hadoop Technosphere. Where else can you get practical skills to work on this cluster?
It so happened that the tasks that we do in the laboratory are closely related to my studies at the department, so one does not interfere with the other, but, on the contrary, complements. Practical use of knowledge gained at lectures helps to better understand the principles of the algorithms.
- What are your development plans?- I do not want to look so far ahead. I believe that at the university most of the time should be devoted to study and research. Therefore, the format of the laboratory is so convenient: you can work at a convenient time free from studies, and you don’t need to travel far.
- What are your impressions of a mentor?- With a new task, we have a new mentor - Larisa Markeeva. It is very pleasant to work with her, because she is really well versed in this area (Hadoop and Giraph) and is always open for communication, helps to solve the arising difficulties and gives advice on implementation. In connection with this, a funny story happened recently. To work, we needed Hive (add-on over Hadoop for distributed data processing using SQL-like commands). Hive was installed on the training cluster, but due to improper configuration it did not work at all. Larisa wrote to someone in the company who was in charge of administering the cluster, and he promised to find out soon. And the next day, he quit the Mail.Ru Group. Hope it's not we who brought it. :)
Data Science Championships Mail.Ru Group
In addition to training in the field of large amounts of data, Mail.Ru Group also holds two big championships where children can try their hand: the
Russian AI Cup and the
ML Boot Camp .
Russian AI Cup - the annual championship on programming artificial intelligence on the example of game strategies.
It has been four years in a row. Each time the tasks are different. This year, participants had to program the behavior of a racing car so that it successfully avoided obstacles and did not hit the walls of the track, and also shot the cars of other participants.
The winners of the championship traditionally receive valuable gifts, the prize fund is about a million rubles, but most importantly, absolutely all the participants improved their skills in programming artificial intelligence. According to the results of the competition, the winner of the Russian AI Cup 2015 got a job at Mail.Ru Group and is now developing artificial intelligence for the company's game projects.
ML Boot Camp is a new initiative of Mail.Ru Group to train machine learning developers. On the platform, participants have the opportunity to learn how to solve machine learning and data analysis tasks, try their hand at contests and win valuable prizes. Those interested can practice in the intervals between competitions, mastering educational material and solving test problems. , , .

***
, , , :
- Stepic Hadoop
- ++
- YouTube