📜 ⬆️ ⬇️

To the magistracy without exams: a new direction "Big data" at the Olympiad "I am a professional"

We continue the story about the competition for bachelors, masters and specialists " I am a professional ." It is conducted with the support of the strongest universities. Today we will tell about the new competitive direction, which is supervised by ITMO University, “Big data”.

The general partner of the “I am a Professional” Olympiad in the areas of ITMO University is Computer Science, Information and Cyber ​​Security, Big Data - Sberbank.


Christoph Scholz / Flickr / CC BY-SA
')

A couple of words about the competition "I am a professional"


The Olympiad is held for students of various specialties.

This year 54 directions are registered: there is mathematics, artificial intelligence, software engineering, Internet of things, photonics and many others.

Why participate ? The winners have the opportunity to enter Russian universities without exams and undergo an internship in major companies-partners of the Olympiad: Yandex, Sberbank, MRG, and so on. Students who show good results will have the opportunity to attend winter schools . There you can get acquainted with industry experts.

The format of participation . Registration is until November 22. From November 24 to December 9, a qualifying round will be held online. It can be missed by those who have completed at least two online courses from the list approved by the organizers. In February 2019, the final stages will begin.

They will be held in person at various universities in the country. ITMO University supervises immediately five areas of the Olympiad. We told about some of them, in particular, about “ Robotics ”. Today we present the direction of "Big Data". This is a novelty of this year's Olympiad.

The direction of "Big Data": what you need to know


There are many events and seminars in the world devoted to Big Data.

It is worth mentioning the international conference SIGMOD , SIGKDD or ICML . More and more similar events are taking place in our country. For example, DataFest , Big Data Conference from Rusbase and numerous meetings on technologies for management and analysis of Big Data.

ITMO University also participates in various events and holds its own. Such as a series of conferences YSC ( Young Science Conference ), a lecture by German Gref and the recent closed workshop held at MRG. Big data occupy an important place in the development of new IT-systems and solutions in other areas of activity. ITMO University is actively working with the application and development of Big Data technologies in all planes.
For example, employees of the ITMO University Department of High Performance Computing created the semantic distributed data store Exarch. It provides quick access to data, optimizes their processing. Exarch allows you to cut the execution time of simple tasks in half, compared with tools like HDFS and Cassandra.
Given the experience and research interests of the university in the field of working with big data, we could not miss the opportunity to open such a direction within the framework of the project “I am a professional”. This track of the Olympiad is supervised by Alexander V. Bukhanovsky , doctor of technical sciences, director of the mega-faculty of translational information technologies of the ITMO University. Now he is with the team, which includes graduate students of the university, is preparing tasks.

The Big Data direction includes Data Analysis, Statistics, and Machine Learning plus Distributed Computing and Systems Technologies. The first sub-direction is related to mathematics and approaches to processing large amounts of data. The second is built around programming and high-performance computing aimed at optimizing analytical processes.

Participants will use the Yandex.Contest platform and the most popular programming languages ​​for working with Big Data. These are Java, Scala and Python.

Java and Scala are more commonly used by specialists, called Data Engineers, for ETL and ELT and the implementation of basic algorithms. Python more often acts as a tool in the hands of those called Data Scientist. At the same time, all of these languages ​​are supported by the most common and currently popular solution for processing large data Apache Spark.

Note that at the correspondence stage programming tasks will not be offered. This is due to some limitations of the Yandex.Contest site - there is no possibility to connect real data arrays for processing. To the internal stage of the competition this moment will be resolved.

Preparing for the Olympiad


A special program has been prepared for the participants, which includes three webinars in the profile direction. Lectures are delivered by lecturers from leading universities, explaining and analyzing examples of olympiad tasks.

Here is an example of one of the basic questions on big data.
A large array of different raster photo images in 64-bit bmp format is evenly distributed across 1000 independent storage nodes in a single local area network. To select images of persons on these files, a cluster with 100 computing nodes is involved.

With a single start of the processing process on all nodes, as compared with a single node, the processing acceleration was obtained only 52 times. Does this mean that:

  • A. Cluster is too small, you need more compute nodes to increase efficiency;
  • B. The sizes of the images are different, and because of this objectively it is not possible to achieve greater efficiency;
  • A. The communication channel between the storage and the cluster is too weak;
  • G. It is not yet clear. It is necessary to conduct a series of additional experiments in various configurations.

Answer: G. It is impossible to establish the reason on the basis of a single measurement, since depending on the conditions there can be both option A and B.

Lecture read by Alexander Bukhanovsky:


The second lecture is about the technological aspects of big data processing. The Senior Researcher of the Research and Technological Institute of the ITMO University, Alexander Visheratin, conducted:


In general, to solve the tasks of the Olympiad, it is necessary to study the typical mechanisms underlying the basic Big Data processing operations. We are talking about patterns in the frameworks Apache Spark and Apache Flink (for example, the operations shuffle or broadcast). It would be nice to study the work of iterative algorithms used for machine learning on big data, such as Expectation - Maximization . Knowledge of the data structures and storage organization principles used in modern Cassandra or Clickhouse repositories does not hurt either.

We also recommend you to pay attention to the courses from “Yandex” dedicated to processing Big Data:


By the way, the passage of two of these courses will allow you to pass the qualifying round in the direction of “Big Data” and get directly to the internal stage of the Olympiad.

Source: https://habr.com/ru/post/429346/


All Articles