CTOcast # 2: Ignatius Kolesnichenko (iBinom - analysis of the human genome)

Introducing the second podcast release about technologies, processes, infrastructure, and people in IT companies. Today IGNatiy Kolesnichenko, Technical Director of iBinom, is visiting CTOcast.

Listen to the podcast

')
A few words about our interlocutor and the iBinom company:

Ignaty Kolesnichenko graduated from the Faculty of Mechanics and Mathematics of Moscow State University and the School of Data Analysis. Works in the company "Yandex" (since 2009). Started in Yandex. Traffic, the last couple of years engaged in distributed computing. In 2013, he became a co-founder of the iBinom company. Leads seminars on computational complexity.

iBinom was founded in 2013 (among the creators are Andrei Afanasyev, Valery Ilyinsky, Ignaty Kolesnichenko). The company is developing a SaaS solution for analyzing human genome data. The results of this analysis can be used by doctors without special knowledge in bioinformatics, which makes the iBinom service unique. At the moment, the beta version of the project is ready and testing is actively carried out at doctors and clinics.

Text version of the podcast (1st part)

About Olympiad Programming, Education and Personal Experience

Alexander Astapenko: Ignatius, I propose to start all the same not with iBinom, but with you, from your career. Can you tell us a little about how you became interested in programming, about working at Yandex and how did you eventually come to iBinom?

Ignatius Kolesnichenko: ... For the first time, I more or less learned what programming is at lyceum No. 1511 at MEPI. Prior to this, programming was also there, but this is all not serious. Once I read the ad and I think: “Interestingly, programming, it turns out, can be Olympiad! We need to go see what it is. " I came, looked and found out that I needed to know a lot of things. It is necessary to be able to program, I before, it turns out, did not know how ...

Alexander Astapenko: As far as I know, you had an interesting story with olympiads. Or did it start later, at the university?

Ignatius Kolesnichenko: I started to participate in competitions in school. ... And when entering, in essence, I chose between physics and mathematics. At some point I decided - mathematics, because I really liked programming. I thought that I would come to the Mekhmat, and there, for sure, the circles would work with me. It turned out that there is nothing of the kind in programming circles on the Mechanism of Mathematics. I had to search for people, a team for a long time in order to start to participate in this somehow. But if you exercise due persistence, the team is and everything can be done.

It's great that there are a lot of people at the school who have a lot of experience in Olympiad programming and who have won in different Olympiads at school. You get to know them, they teach you something. You start to participate and get into this community. Although there is no specific circle, but simply the fact that the community exists, there is someone who can compete with and train with someone. Well, for a couple of years we got to know that we started to take good places in the quarterfinals, we went to the semifinals. The coolest guys, of course, were at the ACM World Championship Final. I did not go to the finals, but, nevertheless, it gave a great experience.

Alexander Astapenko: You can tell in a few words for those who are not quite aware of what tasks you encountered at the Olympiads. Does participation in such events provide applied experience and is it useful in real life and development?

Ignatius Kolesnichenko: I will immediately answer the second question. Yes! Now about the task. Tasks in Olympiad programming consist of two parts. Often, one of them is simple, the other is difficult - to come up with an algorithm. ... All tasks in Olympiad programming are tested automatically. There is a set of tests imposed on a task that you do not see and which is hidden from you. And there is a time limit for which your program should work on tests, by the amount of memory it should use. It is necessary to invent and write such an algorithm, which is within these limitations. Well, and, in fact, the tasks are divided into two types: there are those in which it is difficult to come up with an algorithm, and there are those in which the algorithm seems to be not very difficult to invent, but there are many different details and you need to carefully write and implement all this. That is, the ability of the programmer in the olympiad program consists of two parts: “inventing” and a neat realization of what I have invented in good time, since there are 5 hours and there are 10 tasks ...

Is there any benefit from this? Yes, definitely there. Because a good programmer needs algorithms, he must be able to estimate time, memory, speed of work, he must be able to write programs efficiently. And it is advisable to do it quickly. Because even if you write excellent code, but spending two days on each item, it is no good and you will not write much like that. At some point, these vectors, of course, diverge, and the Olympiad programming the farther, the more sharpened for regular workouts, for honing the writing of algorithms ...

Pavel Pavlov: What else distinguishes a good programmer from a bad one, apart from understanding algorithms, algorithmic thinking and the ability to solve such problems?

Ignatius Kolesnichenko: The Olympics base is certainly not enough to be a good programmer, because in life you are confronted with completely different tasks and problems.

First, the real programs are much larger. At the Olympics, all programs range from 100 to 400--500 lines. In life, however, it is necessary to write systems that consist of tens of thousands of lines, which are very complex and voluminous. Maybe there is every detail and quite simple, but to think through all this interaction is very difficult. And another important point - the ability to think over the API. This is one part.

And the second part: since the programs are large, you need to be able to work with them for the future. Not that we are now writing code, have written and everything - it works, closed and forgot about it. It is necessary to cover the code with tests, to think that someone will read it in the future. Olympiad programmers do not think at all that someone will read the code, so there are single-letter variables, functions are often not distinguished, that is, such a canvas of code. The main thing is to quickly write it to make it work. This is suitable when you need to quickly write a prototype in life and check whether the idea works. But this is completely inappropriate for code that goes into production, which needs to be maintained, developed, and so on.

Here, these skills are such, very unstructured, non-trivial and, in my opinion, are acquired simply with experience. It is impossible to read a book and learn how to cover the code with tests, invent the system architecture, invent its API correctly. You do it once, you do two, you do three, and for the third time you already understand that now I am doing quite well.

Pavel Pavlov: You touched on the topic of the education system ... Was it difficult to find people with whom you could have some common interests in terms of developing your skills and knowledge? How do you think an adequate level of education in universities, schools? Or do you have to basically boil in some kind of narrow environment to get this kind of knowledge?

Ignatius Kolesnichenko: There really is a problem in Russian education. We have top technical universities - Moscow State University, MEPI, MIPT, Baumanka, where programming courses, especially everything related to industrial programming, do not correspond at all to the level that is now generally shown to the field, and to the level that is shown in same European or American universities.

On the other hand, there are a lot of talented guys. In Russia, one of the strongest Olympiad communities and in all of these top universities there are people who are involved in Olympiad programming. You can get into this crowd and there, in fact, to make up for your own gap in knowledge. They also often give some lectures or simply share knowledge with each other. But, of course, this is not a very correct approach, because this is how olympiad programmers grow, and then they still have to retrain the industry.

... But it seems that the situation is changing. In this sense, there is such a wonderful place as the School of Data Analysis. In addition, as far as I know, Yandex is opening a new faculty. More precisely, HSE opens a new faculty with the support of Yandex. I have a feeling that there should be very cool, but let's see and see.

Alexander Astapenko: You hire people to work and see that the guys also participated in programming contests, is this an important factor for you?

Ignatius Kolesnichenko: This is certainly a plus, if a person has gone through olympiad programming, but he is not decisive.

Alexander Astapenko: University, Olympiad programming ... What happened next?

Ignaty Kolesnichenko: ... During my studies at the School of Data Analysis, I was called to have an interview at Yandex. Then it was some shock: the third year, and you can already work, earn money and even some interesting problems to solve. I acted as an intern and was, it seems, more than a year. ... Later I was already working on more complex things, I worked in Yandex. Traffic jams, where we rewrote the current infrastructure with a team. The first big system that I saw and in which I somehow participated. That was great.

About iBinom

Alexander Astapenko: Tell me about the iBinom company ... About the idea, how the project appeared and how it started.

Ignatius Kolesnichenko: Everything turned out very simple. I had a good friend with the mekhmat, who, together with one biologist, started creating the company Genotek. They were exactly about such a service as 23andMe, when a user comes to you, you take saliva from him, analyze it and tell him about some of his susceptibilities to hereditary diseases. The service is, for the most part, entertaining, that is, people just come just for fun. They have a little money and they are ready to spend it in order to find out such interesting new information for themselves. One evening I talked with my friend, and he told me: “Look, we have a biologist and he has one problem ... Would you be interested in it?” The task was just to search for mutations in the genome exon analysis.

... Relatively speaking, we discussed this in December, and in February I came up with the news that everything seems to be working out. What you used to do locally in 8 hours, I can do in 30 minutes. And somehow it turned out the prototype. Then the guys said: “Listen, in fact, we are not the prototype, of course, interested. We have an idea how to monetize it. Let's do it. ” I thought and thought and decided that I needed to join.

There was a conflict of interests in the sense that I worked and worked at Yandex, I like it, on the one hand, and on the other, it’s also a super interesting task, completely different. And somehow it was stupid to miss this opportunity, so I decided to have personal time to press and start spending two things at once. Actually, for a little more than a year we have been doing a more or less meaningful startup.

Alexander Astapenko: Does “Genotek” still exist?

Ignatius Kolesnichenko: Yes, the company exists.

Alexander Astapenko: And there is no conflict of interests?

Ignatius Kolesnichenko: She does a little bit different. There are some algorithms there too, but this is not its main specialization. The user comes to them, they have to take the analysis of saliva from him, take it to the device and analyze there, and then, of course, do a certain computer analysis and produce a result on a beautiful site: “You have a predisposition to this and that ... "

Alexander Astapenko: Is it like the 23and Me you mentioned?

Ignatius Kolesnichenko: Yes, this is the Russian equivalent of 23andMe.

Alexander Astapenko: Let's talk about how iBinom works.

Ignatius Kolesnichenko: What is the human genome? The human genome is such a long, long sequence that consists of 23 pieces. 23 pieces are chromosomes. Each chromosome is a steam room. In this sequence (its entire length is about 3 billion characters) there are sections of interest. That is, it may, in principle, be all interesting to us, but there are special sections - the so-called coding regions, in other words, genes. There are not so many of them, not 3 billion, but, I don't know, 50 million. And we need to find out what is happening there: whether these genes work or not, which mutations exist and what they influence.

The first challenge is how to read it all. For a long time, about 50–40 years ago, they invented a simple manual method for reading one piece. All of our DNA can be thought of as a long line of four letters ACTG. From the point of view of algorithms and analysis, one can look at DNA as letters and not think about the fact that these are some kind of nucleic acids, and so on. And, in fact, you need to read it. About 50 years ago they learned to do it. True, to read a thousand characters, a person must spend the day. And we need not a thousand characters, but 50 million! How to do this is not at all clear. There was a big-big project “Human Genome”, in which a billion dollars was invested, if not more. And, actually, the purpose of this project was the entire human genome to read, assemble and understand it.

But there is another problem: we can read 50, 100 characters each, we can read such pieces from different places, but we understand very poorly where these places actually are and how then to assemble the human genome from them. Where is science in this area now? Science has learned to read these pieces in very large quantities and very quickly, but, again, we do not know where they are. We take the entire human genome, all our chromosomes, break them up into small pieces from 100 to 1000 in length and then read each one from the beginning to the end. After that we have a lot of such different readings and we need to collect the entire genome from them. To make assembly possible at all, this procedure is repeated many times, at least 30, and often more. That each letter, each site in a genome became covered many times. If we only do this once, then we have no knowledge of how these pieces should be glued together, no, and we will not be able to glue them together. Therefore, such pieces should be a lot, they are with a large coating. And then you need to apply some magic, a complex algorithm, which of these pieces will assemble the entire genome. This is a task that is difficult to this day, that is, to assemble the genome of a new organism is very difficult. This is what is called a genome assembly.

Analyzing the genome of a particular person is a little easier. Exactly the same analysis is done; small pieces are read from the genome, albeit from interesting coding regions. The entire genome is usually unreadable, although this can also be done. Then we use such a wonderful property that all people are very similar to each other. We differ by no more than 1%. And therefore, if we take some reference human genome and take our reading, we may not collect from our readings a new genome, but we can immediately compare these readings with the reference genome. Actually, find where they meet there, and see how they differ. So more or less the analysis of human mutations and various changes in this genome is done.

Alexander Astapenko: By the way, an interesting phrase “the perfect human genome”. It echoes so much from the middle of the last century.

Ignatius Kolesnichenko: There is some not ideal, but a reference one ...

Alexander Astapenko: Does it depend on the current political situation?

Ignaty Kolesnichenko: No, it does not depend on the current political situation. This is a purely biological, rather bioinformatical, scientific term.

Alexander Astapenko: Doesn't it depend on skin color either?

Ignatius Kolesnichenko: Differently, in fact. There are different assemblies. Relatively speaking, people in Europe and people in Russia (populations) are different. And, in general, it is possible to collect such an average person who is in Russia and an average person who is in Europe, and they will be a little different. If we analyze a person who lives in Europe, it is more reasonable to compare it with the European standard, and not with the African standard.

Alexander Astapenko: Or in Russia.

Ignatius Kolesnichenko: Or in Russia. Well, Russia is still similar to Europe in this sense. That is, they are closer, with Africa a little, as far as I know, further. But all the same there are insignificant differences and can be compared. The standard genome that is used everywhere is made up of 100 or 1000 completely different people. They took 100 or 1000 different people, analyzed them all and together from all their data collected such a common genome. If we take one particular person, then he may have many different hereditary diseases, which are simply in recessive form or for some other reason do not manifest themselves, and then we will not get a real standard. To collect such a standard, it is easier to take as many people as possible. If we take the majority at each point, then most likely this majority is a popular allele in our population, a popular letter in our population, at this point. And, most likely, it is good, correct, and one that is rare is wrong and can cause something.

Let me tell you what the whole analysis is about, because this is only the first part. This is the first difficult technical part, the gray data that the instrument (sequencer) produced, which makes small sequences - readings. In this sequencer, you simply give up your saliva or blood - any sample, there all sorts of reagents are additionally pushed. It works, I don’t know, 12 hours and it reads you to the hard disk. There are many, many, with large coverage. Typical for exon analysis (analysis of human genes) the amount of data - from 2--3 to 30 GB.

Alexander Astapenko: In one of your videos I met the figure of 200 GB.

Ignatius Kolesnichenko: 200 GB is generally a complete genome. But in fact, the full genome is of little interest to practical use. Rather, it is interesting purely in a scientific sense.

Alexander Astapenko: So in real life it is from 2 to 30 GB, somewhere like that?

Ignatius Kolesnichenko: More or less, yes. When we did the service, we wanted to be able to work with the full genome, too, because there are few who want ...

Well, we found changes in the genome: for example, in the reference genome we have the letter A, and for some reason I have the letter G in this place. Information that is completely incomprehensible. ? , - — , , , .

, , . , ? – . ? . – , , . 30, , . 10 . 10, – , , 600 200 . . , . , - . , ? - , , . , , , , . - , .

, , , , - . : , , 1%. - . . : - , () , . : , , , , , 5 , - . , , , . . . - , , , , , .

: - ?

: , .

: , !

: — - — . , . , , , - , . , 100 , , .

, ? , . , , - . , , , .

? , . : «, , , , !» : , , 200 , , . , . , , . 200 , , , , . , . , , 10 , . , : « ». « , . ». , , . , , , . . , - .

: . , . , , , . ? , - , - – , . , . , - , .

:Yes, there are a lot of mistakes. The sequencer itself is mistaken and it has to be somehow corrected. We are mistaken in terms of algorithms. Here, for example, we find 200 of some critical mutations that look scary. But in real life a person is healthy and he has problems only with the liver.

Alexander Astapenko: And you say no. In the morgue - it means in the morgue! Yes?

: 200 … . , , . , , - , . : «, , , . , ». . , - , , .

We try to fully tell what bases we used and what we found for the doctor to see this chain. If he sees this whole chain, he can go and double-check manually. A doctor for a specific disease is interesting, for example, the 3 mutations that we have brought. Check 3 mutations - easy. This is not easy, of course, but quite possible. The doctor can double-check, and then he is already sure: yes, indeed, these 3 mutations are associated with this disease. Well, and then he takes responsibility and begins to treat the patient.

NGS (New Generation Sequencing) FDA, . , . : «Why not?» , , . , - . , — .

70- . , , - , mainframe, Assembler. , . - , , , , , . 30--40 90% , Assembler, . , , .

, , . - . -, -, , .

: , ?

: -, . . , - , , . . , ? ?

, , , . ? , , , - , 12 . . , -, , , -, , , , .

: , , ? , - . (, 30 )? , ?

: , 30 -. , S3. , , .

: - 30 ?

: , . , , : , , .

— .

Source: https://habr.com/ru/post/227161/

All Articles

CTOcast # 2: Ignatius Kolesnichenko (iBinom - analysis of the human genome)

Text version of the podcast (1st part)

About Olympiad Programming, Education and Personal Experience

About iBinom

More articles: