In 2016/2017, we found that at each of our conferences there are 1-3 reports on Big Data, neural networks, artificial intelligence, or machine learning. It became clear that under this topic you can put together a good conference, which I will tell you today.
Tasty : we decided to gather scientists, practical engineers, architects and put emphasis on technology under one roof - it would seem to be a common thing, but no.
Difficult : digging deeper, you can see that everyone is engaged in separate issues not together, but apart.
')
Scientists build neural networks in theory, architects make distributed systems for corporations with the goal of processing huge data flows in real time, without the ultimate goal of unifying access to them, practical engineers write all this software for very narrow tasks, which then cannot be transferred to something something else. In general, everyone digs his bed and does not climb to the neighbor ... So? No!
In fact : Everyone is engaged in a part of the general. As Smart Data itself (and “smart data” is a very narrow translation) is by its nature, so those who work with it, in fact, make a distributed network of various developments that can sometimes create unexpected combinations. This forms the foundation of smart data in its beauty and practical significance.
So, what kind of puzzle pieces and who creates them, you can see and even discuss with the creators at the conference
SmartData 2017 Piter October 21, 2017. Details under the cut.

Further there will be many letters, we are for big and smart data, although historically the announcement implies a fast and capacious text, as short and precise as a sniper shot on a clear summer night.
Usually we divide all the reports into three or four categories, but this is not the case; each report is an independent story.
First was the Name. And "The name is a feature"
The ambiguity and hackneyedness of the name (even we exploit this turnover for the second time in a month), it would seem, beyond good and evil. But this is not even close to the content of the opening keynout Vitaly Khudobakhshov.
Keynout: Vitaliy khud Khudobakhshov - “Name is a feature”Strange as it may seem to an educated person, the probability of being lonely / lonely "depends" on behalf of. We will talk about love and relationships, or rather, what exactly the social network data can tell about it. It is about the same as saying:
“The probability of being hit by a car, if your name is Seryozha, is higher than if you were called Kostya!” It sounds pretty crazy, isn't it? Well, at least, unscientific.
Thus, we will talk about the most unexpected and counterintuitive observations that can be made using data analysis in social networks. Of course, we will not ignore the issues of the statistical significance of such observations, the effect of bots and false correlations.
In general, Vitaly Khudobakhshov has one of the largest bases for analytics in our country. Vitaly has been accessing her since 2015 as a leading analyst at Odnoklassniki, where he deals with various aspects of data analysis.
Dmitry Bugaychenko - “From click to forecast and back: Data Science pipelines in OK”Vitaly’s smart data and Dmitry’s machine learning are connected, even parts of their life paths are somewhat similar. Big data analysis in Odnoklassniki became for Dmitry a unique chance to combine theoretical training and a scientific foundation with the development of real, sought-after products. And he gladly took this chance, having come there five years ago.
During the report, we will talk about the processing and storage technologies of the Hadoop ecosystem, as well as many other things. This report will be useful to those engaged in machine learning, not only for fun, but also for profit.
As an example, consider one complex task - personalizing the news feed OK. Without going into details, we will discuss data collection (batch and real-time), ETL, as well as the processing necessary to obtain the model.
But just getting a model is not enough, so we will also talk about how to get model-based predictions in a complex, highly loaded distributed environment and how to use them to make decisions.
And if you still want to be completely immersed in the algorithmization of machine learning in practice and working with big data, then the dialogue is better with the person who has passed one of the toughest schools on algorithms in the country (School of Data Analysis - SAD). Meet: Anna Veronika Dorogush.
Anna Veronika Dorogush - “CatBoost is the next generation of gradient boosting”Dual name, as a double direction of the report on the features of the new open gradient boosting algorithm CatBoost from Yandex. The report will discuss how the technology was developed, which is able to work with categorical characteristics, and why put it into open access. Also about CatBoost will be the answers to the eternal: what (apply)? where (works)? who (should pay attention)?
The report will be useful to machine learning and data specialists: after listening to it, they will gain an understanding of how to use CatBoost most effectively and where it can benefit right now.
Artyom ortemij Grigoriev - “Crowdsourcing: how to tame a crowd?”A couple of experts from OK, a couple from Yandex is already a crowd, and with it a fourth, “other” task in the report, this time even more applied. In machine learning and data analysis tasks, it is often necessary to collect a large amount of manual markup. With a small number of performers, work may take months. Artem is going to show how this can be done quickly and cheaply! And this experience can be transferred to other tasks (and they definitely are), where many performers are required for a limited time. The report, based on the experience of creating and using Toloki, the crowdsourcing platform of Yandex, will address issues of quality control, the motivation of performers, as well as various models of aggregation of markup results.
Artem in Yandex since 2010, head of the group, is responsible for developing the infrastructure for collecting expert assessments, developing services for assessors and the crowdsourcing platform Yandex.Toloka.
Alexander alexkrash Krasheninnikov - “Hadoop high availability: Badoo experience”Big and smart data requires a certain infrastructure. Let's be honest - very demanding. What to choose?
Hadoop infrastructure is a popular solution for tasks such as distributed storage and data processing. Good scalability and a developed ecosystem captivate and provide Hadoop with a solid place in the infrastructure of various information systems. But the more responsibility is placed on this component, the more important it is to ensure its resiliency and high availability.
Alexander's report will be about ensuring the high availability of components of the Hadoop cluster. In addition, let's talk:
- about the "zoo" with which we deal; *
- why to ensure high availability: system failure points and consequences of failures;
- about the means and solutions available for this;
- on practical implementation experience: preparation, deployment, testing.
The report will be most useful to those who already use Hadoop (to enhance their knowledge). Another part of the audience will be interested in the report from the point of view of the architectural solutions used in this software package.
* Alexander, like no one else, understands what a “zoo” is.He himself is the Head of DataTeam at Badoo. He is engaged in the development of tools for data processing in the framework of ETL and processing of various kinds of statistics, Hadoop infrastructure. Experience web dev more than 10 years. For best results, do not shun the use of explosive mixtures of programming languages ​​(Java, PHP, Go), databases (MySQL, Exasol) and distributed computing technologies (Hadoop, Spark, Hive, Presto).
Alexander Sibiryakov asibiryakov - "Automatic search for contact information on the Internet"The more data, the more “informational garbage”. After working for five years in Yandex and two years in Avast! As an architect, Alexander built the automatic resolution of false positives. After that, interest in data processing in large volumes and information retrieval only intensified, as it usually happens with the case that you really like.
Alexander's report will talk about a distributed robot for crawling the web, searching for and retrieving contact information from corporate websites. In fact, these are two components: a
web robot for receiving content and a separate
application for analysis and extraction.
The main focus of the report will be on self-extraction, on finding a working architecture, a sequence of algorithms, and ways to collect training data. The report will be useful to anyone who works with the processing of web data or provides solutions for processing large amounts of data.
Alexey natekin Natekin - “Cards, boosting, 2 chairs”According to the general opinion of my colleagues, Alexey is quite charismatic, which is well linked to the fact that he is the dictator and coordinator of Open Data Science, the largest online community in Eastern Europe, Data Scientists. In addition to the above, Alexey is a producer of serious machine learning and data analysis projects.
And now about the report. Everyone loves gradient booming. It shows excellent results in most practical applications, and the phrase “stack xgboost” has become a meme. As a rule, we are talking about booster decision trees, and use the CPU and machines with a bunch of RAM for training. Recently, many people bought video cards for various reasons and decided: why don't we start boosting on them, while neural networks on the GPU are significantly accelerated. Unfortunately, not everything is so simple: there are realizations of GPU boosting, but there are many nuances in their utility and meaningfulness. Let's try together to figure out on the report of Alexey - do you need a video card in 2017-2018 to train a gradient boost?
PS By the way, information has its own expiration date, moreover, each type of data is different. So, in the case of Alexey, another photo was provided, it can be viewed on
the conference website in order to identify him correctly.
Sergey snikolenko Nikolenko - Another victory for robots: AlphaGo and deep learning with reinforcement “Deep convolutional networks for object detection and image segmentation”And this is a vivid illustration of another information expiration date: while this post was being prepared, the whole concept of the report had changed.
As a result, in Sergey’s report we will discuss how networks that recognize individual objects turn into networks that distinguish objects from among the masses of others. We will talk about the famous YoLo, and about single-shot detectors, and about the line of models from R-CNN to the most recently appeared Mask R-CNN. And in principle that convolutional neural networks have long been the main class of models for image processing, and we live with it.
By the way, this is not all the data that can be obtained from Sergey Nikolenko - a specialist in machine learning (deep learning, Bayesian methods, word processing, and much more) and algorithm analysis (network algorithms, competitive analysis). The author of more than 100 scientific publications, several books, author courses "Machine Learning", "Training of deep networks", etc.
Vladimir vlkrasil Dyer - "Back to the Future of the Modern Banking System"Historical note : “the first distributed online banking information exchange system in the Russian Federation began its life in 1993, and it was not Sberbank”
Modern Big Data banking systems are not only processing and storing hundreds of millions of transactions per day and interacting with global trading platforms at space speeds, but also tight control and reporting by auditors and regulators. In the report we will look at what Audit-Driven Development is and where it came from, show you how to organize a bitemporal repository of facts so as not to screw up before the controlling authorities, and prove that any modern distributed system simply has to have a time machine built in. Also during the presentation, the “universal formula of fact” will be revealed and the more often the tasks of the so-called “analytics” turn out to be.
A little about the speaker, or rather, a lot: Vladimir graduated with honors from the Mathematical Support Department of the Saint-Petersburg Electrotechnical University “LETI” and has been developing software for state, educational and financial institutions, as well as automotive and telecommunications concerns for more than 14 years. Works in the St. Petersburg branch of the company Yandex developer Yandex.Market. Vladimir is a resident of the Russian community of Java developers JUG.ru and speaks at industry conferences such as JPoint, Joker, JBreak and PGDay.
Ivan Drokin - “No data? No problems! Deep Learning at CGI »So, the words "dataset", "convolutional networks", "recurrent networks" are not terrible and do not even require decoding? Then the report of Ivan Drokin about deep convolutional networks found its audience.
"We need to go deeper"Currently, deep convolutional networks are
state-of-the-art algorithms in many computer vision problems. However, most of these algorithms require huge training samples, and the quality of the model depends entirely on the quality of the data and their quantity. In a number of tasks, data collection is difficult or sometimes impossible. *
In the report, we consider an example of learning of deep convolutional networks for localizing key points of an object on a fully synthetic data set.
* In one such experiment, data was collected by questioning and digitized by the students of one of the universities as part of a thesis. Processing one array, even with the availability of modern means of working with data, took 2 weeks only to enter data into the system. And for the combat project of these data is required many times more. Do you have extra “student” forces to collect this data? Not? Then, I hope, you have put in the plan and the report of Artem Grigoriev.Immersion Instructor - Ivan Drokin, co-founder and chief science officer of the company Brain Garden, specializing in the development and implementation of intelligent full-cycle solutions. His professional interests include the application of deep learning to the analysis and processing of natural languages, images, video stream, as well as reinforcement training and question-answer systems. He has deep expertise in the financial markets, hedge funds, bioinformatics, computational biology.
Artyom onexdrk Marinov - “We are segmenting 600 million users in real time every day”We will not draw an analogy with the “Big Brother”, since the data in the project on which Artyom’s report is based is collected impersonated. What
does NOT lead a similar analogy?
Every day, users commit millions of actions on the Internet. FACETz DMP needs to structure this data and perform segmentation to identify user preferences (TADAM!). In his report, Artyom will tell how, using Kafka and HBase, you can:
- segment 600 million users after migrating from MapReduce to Realtime;
- handle 5 billion events every day;
- keep statistics on the number of unique users in the segment during stream processing;
- track the effects of segmentation parameter changes.
Artyom Marinov has been involved in advertising technologies since 2013, for the last two years he worked for Data-Centric Alliance as a leader in the development of DMP Facetz, and now works for Directual. Prior to that, for several years he led the development of a number of advertising projects at Creara. Specializes in the field of BigData and work with high loads. The main languages ​​are Java / Scala, in the profession for about 8 years.
Aleksey a_potap Potapov - “Deep Learning, Probabilistic Programming and Meta-Computations : Intersection Point”So, there are generative and discriminant models that can act as approaches to determining some parameters of linear classifiers, which in turn can act as a way to solve classifier problems when a decision is made on the basis of a linear operator on the input data, respectively, possessing the linear separability property and the operation of linear classification for two classes can be represented as a mapping of objects in a multidimensional space onto a hyperplane ... in a house that built by Jack. *
This paragraph is intended to prepare you for the fact that there will be a lot of generative and discriminative models, their connections with each other and practical application in the framework of the two most promising approaches to machine learning - deep learning and probabilistic programming.
* This sequence is not part of the report by Aleksey Potapov, a professor at the ITMO Department of Computer Photonics and Video Informatics.
To understand how much Alexey likes his business, it’s enough to see his activity for 2 years: two manuals and 27 scientific works were published, including 12 papers in refereed journals and one monograph; 5 computer registration certificates were received; and also participation in 5 international conferences, although, perhaps, this is not an indicator.
Ivan ibegtin Begtin - “Open Data: On the Availability of State Data and How to Search for It”Ivan is quite a famous person, but if someone does not know, a brief introduction.
Ivan Begtin is the director and co-founder of
ANO Infoculture , a member of the advisory council under the Government, a member
of the civil initiatives committee of Alexei Kudrin , the winner of the prize in the field of political journalism Vlast N4 (2011), the laureate of the pressName award in the nomination Special attention zone ”(2012), co-founder of the all-Russian contest Apps4Russia. He is the author of public projects “State Expenditures”, “Public Incomes”, “Gosludi”, “The State and Its Information” and many others. Ambassador to Russia of the Open Knowledge Foundation.
Why this data? Ivan’s biography is the best introduction to his report on open government data.
The public policy of open data in Russia and in the world allows unlimited number of users to provide access to data created in government. This opens up new business opportunities, ready to create new products based on them and develop existing ones, but it requires knowledge and understanding of how data collection, analysis and publication are arranged.
Ivan’s report will explain how and for what data is collected, how the government uses them. And, of course, how to access and use them in your project. In addition, the data are not always reliable, errors accumulate in them, so let’s see how these errors can be taken into account.
Maybe, if they are already collected from us in such quantity, it is time to use them?
Mikhail Kamalov - “Recommender systems: from matrix expansions to in-depth learning in continuous mode”"And to this report, we recommend that you take the report of Artem Marinov, which would best emphasize the practical significance of this material."Nothing like?
At present, recommender systems are actively used both in the field of entertainment (Youtube, Netflix), and in the field of Internet marketing (Amazon, Aliexpress). In this regard, the report will discuss the practical aspects of the use of in-depth training, collaborative and content filtering and time filtering as approaches in recommender systems.
Additionally, the construction of hybrid recommender systems and modification of approaches for online learning at Spark will be considered.Immersion into practical systems for ordinary users, which we are with you, will be carried out under the guidance of Mikhail Kamalov, Analyst at EPAM Systems since 2016, an expert in the field of NLP tasks and information retrieval.
Andrei Boyarov - “Deep Learning: Recognizing Scenes and Points of Interest in Images”To everyone who came to this place - our thanks for their patience and interest. As already mentioned at the beginning of this article, in the field of big and smart data there are no studies and reports divorced from the general mission. So, to this report, in the outline of the description of the previous one, we can recommend the report of Ivan Drokin for a deeper immersion into convolutional neural networks.The report by Andrei Boyarov will deal with building a system for solving the scene recognition problem using a state-of-the-art approach based on deep convolutional neural networks. What is important for the corporation Mail.Ru, where Andrew works as a programmer-researcher.The task of recognizing points of interest stems from the recognition of scenes. Here we need to highlight among all the images of the scenes those on which there are various famous places: palaces, monuments, squares, temples, etc. However, in solving this problem, it is important to ensure a low level of false positives. The report will consider the solution of the problem of recognition of sights based on a neural network for scene recognition.
Alexander AlexSerbul Serbul - “Applied machine learning in e-commerce: scenarios and architecture of pilots and combat projects”Being practically at the finish line of creating our network of reports, it is nevertheless necessary to check with practical experience. Alexander Serbul, who oversees the quality control of integration and implementation of 1C-Bitrix LLC, as well as the directions of AI, deep learning and big data, will help us in this. Including, Alexander acts as an architect and developer in the company's projects related to high load and fault tolerance (“Bitrix24”), advises partners and clients on the architecture of high-load solutions, the effective use of cluster technologies “1C-Bitrix” in the context of modern cloud services (Amazon Web Services, etc.)All of the above was invested in a report on pilots and combat projects implemented at the company, using various popular and “rare” machine learning algorithms: from recommender systems to deep neural networks. The projects have focused on the technical implementation of the Java (deeplearning4j), PHP, Python (keras / tf) platforms using the open libraries of Apache Mahout (Taste), Apache Lucene, Jetty, Apache Spark (including Streaming), the spectrum of tools in Amazon Web Services. Given the orientation of the importance of certain algorithms and libraries, the relevance of their application and demand in the market.And these are the most implemented projects:- clustering users Bitrix24 using Apache Spark
- (churn), (CLV) - big data
- 20 000 -
- LSH
- content-based 100 .
- 24 ( n-gramm , )
- - , .
Keynout: Ivan Yamshchikov - “Neurona: why did we teach the neural network to write poems in the style of Kurt Cobain?”How are Kurt Cobain connected, Civil Defense, art (classical, musical and visual) and big data?If habrapost Machine Learning: State of the art suddenly did not catch the eye on the main page of Habr ... It's okay, there is still time. In the meantime, the article is being read, it is possible to build links with other reports: finally, from the neural network and the GPU, architectural monuments, recognition and Deep Learning, we will come to the question of “artificial intelligence”.There are many examples of the use of machine learning and artificial neural networks in business, but in this report Ivan will talk about the creative possibilities of AI. Tell how Neurona did ,Neural Defense and Pianola . In the end, the current tasks in the field of building creative AI will be summarized and presented, and answers will be given to the questions why this is important and interesting.Ivan Yamshchikov is currently a researcher at the Max Planck Institute (Leipzig, Germany) and a consultant at Yandex. He explores the new principles of artificial intelligence, which could help to understand how our brain works.So, our network is complete and waiting for a meeting. All reports are tested for compatibility and complementarity. * You can plan your way according to the finished program on the conference website . And it is recommended to take into account the availability of discussion areas, where you can always talk to the rapporteur personally after his session, not limited to the set of questions at the end of the report.* Recommendations are not obligatory for use and are the personal opinion of the author, which any member or our Program Committee may not share. The author himself is puzzled by questions of AI and further interaction with IOT, so he looks at everything biased.