Today, Yandex has joined CERN . Our partnership with the European Center for Nuclear Research is moving to a new stage of development: scientists from CERN will have access to MatrixNet's machine learning technology from Yandex, as well as new computing power. And Yandex becomes an associate member of the European Center for Nuclear Research in the framework of the CERN openlab project. Other openlab members include Intel, HP, Oracle, Siemens, and Huawei.
Yandex’s cooperation with the Center began in 2011, when we first provided CERN with our server facilities. And in April last year, our developers created a search for the events of the LHCb experiment . LHCb is one of the four main experiments of CERN and one of the examples of how important not only experimental data, but also their processing have become important in modern science. In the course of the LHCb experiments, the collisions of the b-quark (b from the English beauty, in Russian it is called adorable) are investigated. The amount of information about these events only for a year reaches thousands of terabytes. Thanks to the search index we created, CERN scientists have the opportunity to instantly receive the necessary information.
In modern fundamental science, not only technical resources for conducting experiments, but also computational capabilities for processing and understanding their results have become important. Nowadays, especially at CERN, there is so much data that without the use of complex algorithms, even a scientist will find it difficult to draw accurate conclusions about the results of experiments. Technologies that can be used for such purposes have a very small number of companies. ')
We asked Andrei Ustyuzhanin, the project manager of the partnership with CERN at Yandex, about the details of why CERN needs Yandex help and how the work with the experimental data is arranged. Watch the video and read a more detailed text version after kata.
Andrei, what are you doing at Yandex and how did it happen that it is you who direct the work of the company with CERN?
In Yandex, I work with data and algorithms that are engaged in machine learning. And it turned out that machine learning algorithms are the ground for the cooperation of Yandex and CERN.
Before telling how it happened that Yandex had joint projects with the Center, remind a little about what CERN is doing.
CERN is the world's largest and largest physics research lab. And on the one hand, conducting experiments in it requires very serious equipment. On the other hand, powerful algorithms and a lot of computational resources are needed to understand these experiments. And just this is our company.
Why did it happen that CERN does not have them?
For Yandex as a search company that makes money on advertising, having a high-quality algorithm for processing the accumulated data is a matter of life and death. We specialize in algorithms. And for the Center for Nuclear Research, this topic has not yet become so important.
Nevertheless, people who go to CERN have a scientific and physical background. They know what the standard model is, the Schrödinger equation, the Lagrangian of the standard model, and so on. But this does not mean that they have good computer science training.
Tell me more, why is data processing important for CERN?
The fact is that the accumulated, so-called raw, data is VERY-VERY large. Now the collisions recorded by the laboratory at CERN occur at a frequency of 20 million per second. This is a very large amount of data, approximately the volume of the entire Internet every day. It is clear that it is impossible to write all this on some disk arrays.
And therefore it is necessary to use quite complex filtering algorithms. On the one hand, in order to preserve the necessary information, and on the other hand, so that the accuracy of the analysis of this information at the next stage was at a level that would allow confirming or refuting the physical hypothesis.
To process the data, first build a model of how the event should look. We go from the perfect result. The model is built using a quantum event simulator, in which, by the way, our GRID, the Yandex server, is also involved. The emulation indicates what data should be recorded on the detectors after the collapse and what these events look like.
This data is accumulated. Then they go through the same processing as regular events that came out of the detector. Therefore, we can use them as a reference for comparison. In order to understand whether this decay was or was not, we look at this event and with the help of some kind of algorithm we have to say: does it seem that what was in the real detector looks like what we could somehow imagine, calculate and write it down. We compare. If it looks like it means it is.
And how exactly is machine learning used here?
The most subtle thing I have said is to understand that events are similar. Each of them looks very difficult. When proton beams collide, they fly apart into thousands of different particles, which are detected by the detectors. The data from them may be slightly processed and we can already know that there were some particles. But the trick is that some of them do not reach the registration plates, but break up into something else. Therefore, taking into account some probability, it is necessary to determine into which they could disintegrate at all. And to understand that, despite the fact that we are seeing this, here there could be something similar to the event being sought or, conversely, not similar. In order to connect some kind of machine learning algorithm (in particular, Matrixnet), we need to formalize these events in some way, but in general they take up quite a lot of space. We do not need 100 kilobytes with a very detailed description of what happened there and how it looks. We must formulate some signs on the basis of which we can say that these are characteristic signs, the totality of which can with high probability say whether it was it or not. For the decays on which we tested Matriksnet, nine signs stand out. One of them may be, for example, the b-meson lifetime. Next we need to take some sample in which there are these events and not these events, i.e. positive examples and negative examples, and on files with these factors, train our classifier (this could be Matrixnet or another way of machine learning), and then apply the trained neural network to the data that were obtained by calculations.
Do we calculate from emulator data?
Data can be from the emulator, and can be real. If we know for sure that this event didn’t have what we need, which can be said on the basis of some data. In particular, there was an emulator of good and an emulator of bad events. Based on these events, signs were calculated, Matriksnet was trained by signs, then signs were also calculated on real events and the Matriksnet formula was already driven out of them in order to get a forecast. There Matrixnet does not say whether it was bad or good. He speaks some likelihood that this was it.
What data does machine learning algorithms learn?
On data that was entirely emulated, or on a set of simulator and real data. But the subtlety is that we somehow have to clearly know that these data are bad. That is, that there was not the event that we are looking for. And in order to understand this, we go for a little trick. It lies in the fact that all the events that Matrixnet gives us are displayed on the mass scale. We know that energy equals mass multiplied by the speed of light in a square, so if we managed to hit two particles with a certain energy, then we can understand which particles were born in this collision. That is, the mass of particles produced cannot be greater than the impact energy. And everything that we have measured, we draw on the graph of mass, we know that what we need is in a certain range.
That is, the spread of these two muons can only occur around a boson mass, which is equal to 5633 MeV. We are interested in a certain mass of events. The fact that the b-meson was divided into two muons means that this should occur somewhere plus or minus some way around the mass of this b-meson. And this mass is known. And therefore we can say that all the events that we find are either in the signal range, or are located somewhere on the side, and then this is exactly the noise. That is what we do not want to see. In order to understand that the event was bad, it is enough to measure its energy. If it is in these side ranges, it means that this event is bad and can be used as a negative example for training our algorithm. These are the real events that we can add to the training set. This is a question of what Matriksnet is learning from.
And what exactly is the role of Yandex in the work of CERN?
In 2011, when our cooperation was just beginning, Yandex provided CERN with a cluster of machines for modeling physical events. These were several dozen servers connected to the GRID of the CERN computer network. These servers helped physicists to understand what a particular event looks like in their detector, what kind of “pictures” they will see and what these pictures correspond to.
In April last year, at the second stage of our joint work, we launched a search project that allowed us to find events in terms of accumulated data using some simple signs.
The next stage, which we are planning for 2013, is the implementation of a more advanced way of searching and filtering events based on Matrixnet machine learning technology. In Yandex, it is used, it would seem, for tasks completely unrelated to physics: searching, filtering spam in mail, predicting clicks on advertising banners, and so on. With its use in the Center, our cooperation goes further, and Yandex makes a more and more qualitative contribution to the development of science at CERN.
And why do we need this cooperation ourselves?What does it mean for Yandex?
Yandex was originally founded by physicists. This is a company whose leadership is not indifferent to the questions of our universe. At CERN, there is a need for high-quality processing of accumulated data. This is the key point for making discoveries in modern physics. Now they are done not so much with the help of high-quality instruments, but with the help of an analysis of the results obtained in these instruments. And it is precisely in this that Yandex sees its need for the application of its technologies to science. In addition, testing our algorithms on this data, we see that they are getting better and better.
Russia is not yet a member of CERN, but only an observer country.Yandex joins CERN.Tell me, what is the difference between the membership of the country and the company?
Russia in December last year applied for membership as an associate member at CERN. Those. she will be in the near future. But company membership and country membership are completely different things.
The company is more interested in applying its technology to the data that CERN has and in improving its technology through this cooperation. And the state - in obtaining some global research results, in their application for some practical technologies. For example, obtaining new medical devices, developing new space technologies, or perhaps creating a new type of fuel - who knows. And since the movement here occurs at different speeds, to some extent - the greater the mass, the less acceleration =)
What other companies are now members of CERN's openlab?
At the moment there are six commercial companies that belong to CERN. These are Intel, HP, Oracle, Siemens, Huawei and - from January 2013 - Yandex.
Are there any discoveries that have been made using Yandex technologies?
We are just beginning this way. Last December, we tried to use Matrixnet to analyze the very rare decay of a B meson into two muons. And different theories predicted a different amount of these decays. The standard model said they should be 3.5. Others that they should be an order of magnitude or two more. And the discovery process itself was based on the fact that all the data that had accumulated as a result of the work of CERN over the past two years was passed through various machine learning algorithms — including through Matrixnet. It was necessary to estimate how many events of this decay occurred during these two years.
If this event coincided with the prediction of one of the theories, it testified to its fairness. It was calculated, compared, and it turned out that the standard model predicts best. This slightly closed the other theories - supersymmetries that claim to be the theory of everything.
In the future we plan to make the interface to the Matriksnet public for CERN employees, so that if they wish they can use it. Compared to the current algorithm, which is used at CERN by default, the development of Yandex provides a more accurate and precise selection of events. That is, a clearer signal - less noise. And we hope that MatrixNet will allow physicists to make many more interesting discoveries.