📜 ⬆️ ⬇️

Ignatius Kolesnichenko: "You will not ask for money from a bacterium"

Introducing the second podcast release about technologies, processes, infrastructure, and people in IT companies. Today, Ignatius Kolesnichenko, Technical Director of iBinom (analysis of the human genome), is visiting CTOcast.

Listen to the podcast

1st part of the text version of the podcast
')

Text version of the podcast (2nd part)



ABOUT TECHNOLOGIES



Pavel Pavlov: You can ask a lot about algorithms, but let's start with the most obvious question. At first, for sure, you used existing algorithms, that is, it was about adapting them. Was it possible to create something of your own? How was your service and approach built?

Ignatius Kolesnichenko: I would say that somewhere around half of us use the existing as is. But you need to understand that the existing one also needs to be configured. For example, there are sequencers with different data, a bit in different formats and lengths, and the algorithms on such data work with differences in quality. It was necessary to figure out on what data to apply certain algorithms, what will be the results.

Well, the alignment algorithm itself. Of course, at some point we had an idea, and we even tried to write our own, but the task turned out to be difficult and not very lifting for a startup. Because if you look at any known alignment algorithm, there are 30,000 lines of code on the pluses optimized, which are being developed, say, for the last 4 years by the strength of two or three people. Obviously, we cannot repeat this at once.

Pavel Pavlov: What was the main criterion for choosing algorithms for you? Performance? The degree of certainty?

Ignaty Kolesnichenko: The criterion was a compromise, so that we were satisfied and that we could really do everything in the stated hour. Most of the algorithms do not fit exactly in this parameter - they are very slow. Maybe something else is good, but not speed. In general, they are most likely not very suitable for the analysis of the human genome. Simply, all these algorithms are used not only for the human genome. Scientists just as well analyze bacteria, plants. Bacteria genomes, of course, not 3 billion, but 10 million, for example. There the task is essentially simpler due to this.

Alexander Astapenko: But bacteria is not a very solvent client, as far as I understand, right?

Ignatius Kolesnichenko: Naturally. You won't ask for money from a bacterium.

Pavel Pavlov: Sorry!

Ignatius Kolesnichenko: About our algorithms ... Naturally, the first stage is the most difficult and we use one of the existing algorithms. Next is the stage of colling, you need to understand what kind of mutation occurred. Here we use both our own algorithm and the one that already exists. Until finally chosen. And then there is another interesting stage: a mutation occurred, for example, and you need to understand how likely it will lead to the fact that the protein in the coding of which this mutation participates breaks. What have we done? There are already many different algorithms that say: "This mutation leads to the breakdown of such and such a protein, with such and such probability." We collected such algorithms, used their results as features and built our own learning on these features. That is, here its algorithm works.

Pavel Pavlov: And what was more difficult? Looking for the implementation of these algorithms or, for example, building a platform that would carry out all the calculation, analysis, exactly infrastructural?

Ignatius Kolesnichenko: These are slightly different tasks, but it seems to me that the infrastructure part was a bit more complicated. In the analysis of algorithms, the difficulty in doing it correctly, also requires a lot of computational resources. But in general, of course, the task is simpler. And building an infrastructure that will run everything in Amazon, which will allow you to upload data and that continues to fill, to build in so that everything comes together in one service and works like a clock - not so easy.

Pavel Pavlov: It turns out that both problems were solved, first of all, by you as the chief technical specialist in the company?

Ignaty Kolesnichenko: Yes, I solved those and other problems. But actually, when we first started, I said: “Well, listen, guys, there are 20 hours a week that I can spend on you. But, obviously, we will not go far for 20 hours a week. ” Therefore, the first thing we went to look for a CTO, which would be in charge of everything. I also helped him, naturally, but more in terms of thinking through the architecture. It was realized without me. And I, actually, was engaged in experiments and researches, I chose which algorithm and how we will use it. Well, the whole part that concerns launch and storage on Amazon was also my task.

Pavel Pavlov: Probably, it makes sense to go through the entire workflow in order. Let's start with the moment when users try to upload their 2--20 GB in your S3-storage. How does this process occur? What services do you use?

Ignatius Kolesnichenko: There are no special secrets. The user uploads the data, they are proxied and fall into S3. On the back-end, of course, everything is done neatly, the filling is simplified, it is possible to continue filling if the connection is suddenly interrupted. To process data after the fill, MapReduce runs a task that aligns all this data, analyzes, and we have a list of mutations at the output.

Since there are not so many mutations (about 50 thousand are obtained for human exome), we can analyze them locally, which is what we are doing. We have established various bases with the help of which we look at mutations and understand where they are at all, whether they encode proteins or not. From these databases we also download links to articles, publications about mutations. And then we build a PDF report, in which we write out the 50 most significant mutations. In the same report in the course of work of all algorithms we collect a lot of additional information, for example, how many mutations of such, such and such. We show the user so that he can imagine what happened to his data, how we all processed it.

There is also a personal account where the user can register, his data, in fact, he keeps with us. He can also change the settings and choose what to show in the report and what is not.

Pavel Pavlov: And in most cases, the user, that is, the medical specialist, is able to deal with these settings and get the desired result for him?

Ignatius Kolesnichenko: Difficult question. It seems that no one, in fact, really tweaks the settings, that is, people rather get a report and continue to look at it. Default settings are both suitable and not suitable at the moment. Now we already have an understanding that simply giving out mutations is not enough. Therefore, we are completing a system that will be interactive: here are the mutations that we have already found, and there are also symptoms of diseases. The doctor will indicate the symptoms and which diseases are interesting to him, and we will understand how this relates to mutations.

Pavel Pavlov: And this expectation is based on some kind of feedback?

Ignatius Kolesnichenko: Yes, to be honest, every second user said that he gets our mutations, and then he still has to do the work to make some kind of conclusion. We realized that we need to aggregate all this information and help the user solve his problem in one place as quickly and easily as possible.

Pavel Pavlov: I'll be back all the same to the Amazon. To solve problems related to MapReduce, Hadoop, there are some special start-ups, cloud services that solve the problem more efficiently, and, at some moments, even cheaper. By and large, if you need only MapReduce and S3 storage, then the service is clearly not at full capacity.

Ignatius Kolesnichenko: No, of course, not at full capacity. As far as I know, everyone else who specializes in deploying Hadoop and raising the stack is more likely to be suitable for, for example, banks that already have their own 10 machines, 10 servers. They just need to deploy it, configure it, help them build processes. But we are not so, that is, we cannot keep 10 cars raised all the time, because it will be very expensive for us now.

We have a stream - several analyzes per day, so it’s easier for us to deploy a cluster of 5--10 cars several times, to analyze everything in an hour and turn off. It is currently cheaper. Probably, at some point, if we grow up, it will be more profitable to keep our own cluster, just out of 10 machines, and then it’s sensible to turn to specialized services.

Pavel Pavlov: Did you manage to achieve any changes, a serious improvement in productivity and, thereby, reduce the calculation time and save money?

Ignatius Kolesnichenko: Yes, Amazon has such a nice feature as spot instances, which allow you to reduce the amount of money and lose almost nothing. That is, they sell their residual capacities at a price 10 times less than they usually have a car.

And so, in fact, we twisted, experimented, but in general I will not say that we have gained anything significant, 10--15%. The main problems in Amazon are that it all starts out of the box, which, by and large, is designed to run java-code, and we have all the code written on the pluses and there were some problems with binary compatibility.

Pavel Pavlov: Can you tell us a little more about incompatibility? I just did not have to face before.

Ignatius Kolesnichenko: There is an Elastic MapReduce in Amazon, that is, a ready-made image of Hadoop, but you can’t manage it in any way. Well, strictly speaking, because Amazon is responsible for ensuring that he is a worker, so he set it up once in the appropriate image and gave it to you. And there is an opportunity to raise only similar virtuals. You pick up a similar virtual machine, you tweak something on it, you send a compiled binary there, and it unexpectedly crashes into libc when a new track creates something. Well, the problem!

Pavel Pavlov: Well, in fact, the process can be automated to track errors? I understand that it all happens quite regularly?

Ignatius Kolesnichenko: No, rather one-time. The problem happens when we update something on the machine. The update comes out, and the car, for example, was quite old, you update it and the libc is updated along with it, which is why everything suddenly breaks down.

This is a technical problem, but it can be solved. There is no special meaning in automation here, because we run the same thing all the time. We are not a public service, the user does not collect binaries himself and does not send to us. We do not know, 5 binaries and you just need to make sure that they run successfully and work out.

I, to my regret, learned that Hadoop does not seem to be very well suited for running such a binary code. It is still very sharpened to make it convenient for java, and for those people who are trying to run third-party code, everything is not very convenient. There were some problems with the interface with how to correctly specify the settings there. Because in the new Hadoop now all this is separately launched in containers, there is still a java-layer that needs to allocate so much memory. That is, your entire task runs under the java-machine as a child. There is a container, a java-machine lives in it, your program lives in a java-machine, and this java buffer data that it reads and writes. And you need to take all this into account correctly in order to maximize the use of the memory on the machine, which is, so that nothing falls, for the memory does not climb.

Pavel Pavlov: Well, again, it turns out that if you had a static cloud, some kind of stable configuration, would there be much less problems?

Ignatius Kolesnichenko: Yes, it is true. But while we can not afford a static cluster.

Pavel Pavlov: And how much is the typical calculation - from loading to getting the final PDF file - does it take in time?

Ignatius Kolesnichenko: Depends on the amount of data, but in general an hour. When data is 2 GB, postprocessing, which searches for different databases, it takes a minute, and when data is 30 GB, it is already 10-15. And, unfortunately, this process is very hard to parallelize, you can’t easily put it so hard on MapReduce, since all the databases you need to search for are dozens of gigabytes. And since the cloud is not static, we cannot simply send these tens of gigabytes to all machines, because everything will start to rest against the network. We have about 5 minutes in the launch of the cluster is occupied by the fact that we need our reference genome, which weighs 3 GB, and even all sorts of indices with it, that is, about 10 GB, send on all machines.

Pavel Pavlov: On the Amazon? But is there a rather serious channel, that is, at least 1 Gbps?

Ignatius Kolesnichenko: It turns out not 5, but by and large less, because S3 lives somewhere in one place, and the cluster rises a little in another. That is all in one person, of course, but there is clearly not an endless network and it takes 10 GB of minutes to deflate 10 GB. Unfortunately, there is no such thing that here is pure gigabit, we divide 10 GB by 100 MB / s and get 100 seconds. It does not happen, everything turns out much slower.

Pavel Pavlov: Is it possible for Amazon to somehow optimize such infrastructural moments related to network topology and costs?

Ignatius Kolesnichenko: I tried to communicate with the caliper and it seems that it does not really allow.

Pavel Pavlov: That is, even if you choose the same data center, accessibility zone, you can still drive through a bunch of routers?

Ignatius Kolesnichenko: It’s hard to understand how many routers are chasing there. They have an internal network raised and all this is very incomprehensible. Outside some ipishniki stick, and inside - others. That is, inside its own network and to understand what is happening there exactly, is impossible. It happens that different cars rise. It seems that this does not affect, just such an interesting fact.

Pavel Pavlov: Well, are they still trying to level out somehow the computing power of their units?

Ignatius Kolesnichenko: Yes. I can not say that one car was worse than the other, slows down.

Pavel Pavlov: The input that the user loads is some kind of standardized format? Does it depend on the sequencer or something else?

Ignatius Kolesnichenko: Rather different standards. There are, for example, in some field of programming 20 standards. And let us write the 21st, which will unite all these 20! Well, and then it turns out that now 21 standard simply. It often happens the same way. Some new company, laboratory or sequencer manufacturer says: “Listen, I thought and thought and understood that the old format is so-so. Here he has such and such problems. I’ll take a new format. ” Indeed, the old format has such problems. But other companies continue to do the same and as a result, the format becomes 1 more. But things are not really that bad with input formats.

Alexander Astapenko: Ignat, are there already any services that store these data in the cloud so that the user does not need to download them through the web interface?

Ignatius Kolesnichenko: Yes, there is such a task. Now the tendency is that providers of sequencers tell their clients: “Let us not connect a hard disk to the sequencer, but insert this green drive and we will all pour into our cloud right away”. The same Illumina, it has its own cloud, where it automatically floods all data. And when the doctor does the analysis, they send him a link. Everything, and there is no need for any moves there.

ABOUT iBinom TEAM



Alexander Astapenko: What people are in your team now? Who are you going to hire and in which direction to move?

Ignatius Kolesnichenko: Now, due to lack of money, the team has decreased, but initially it was divided into 2 parts. The first part is the construction of the service, the web-interface, its development, etc. And there, in fact, was our CTO, which is no longer working with us, as well as another excellent web-developer. In addition, we have had one junior developer.

Alexander Astapenko: Can you name the technologies so that it is more or less clear?

Ignatius Kolesnichenko: Web, back-end and front-end, on Node.js, plus Jade, CSS. In the back-end there is a little more Python, Bash, but this is all about building a report, running on Amazon.

This is one part of the team. The second part of the team that we have had and still have is research. The idea of ​​the entire startup belongs to our biologist Valery, who also once worked on Genotek. In the research part of the team, we had 4--5 people at different times. We tried algorithms, studied new problems.

For example, the following task: there is an analyzed genome of mom, dad and child and you need to more accurately answer the question about the child's mutations, to understand what mutations he has new ones that the parents do not have.

There is one more interesting problem that concerns the mother-fetus system: we take a blood test from a pregnant woman, and pieces of her baby's DNA are floating in this blood. And you can try to isolate and analyze them from there. That is, when we analyze, we will see the reading of the DNA of both mother and child. So you can try to understand the mutations in the child in advance. It is a very promising area, but so far, unfortunately, the quality indicators on this issue are rather low.

Alexander Astapenko: If we are talking about the back-end, then there were three of you?

Ignatius Kolesnichenko: Not counting me.

Alexander Astapenko: That is just four people, right?

Ignatius Kolesnichenko: Yes, but now I am and one developer. So now we have a goal - to make some kind of a new piece of service, to finish something. It is ready, it works, it does not fall, but since we don’t have money, we cannot right now develop it somewhere, redo it.

Alexander Astapenko: Are you, it turns out, the official CTO in a team, a company?

Ignatius Kolesnichenko: Officially, I may not have such a role. As I said, we found CTO in June and until we ran out of money, he, in fact, was in charge of the entire back-end development.

Alexander Astapenko: And now this post was given to you?

Ignatius Kolesnichenko: Yes. I already started the Beta more or less myself, I also did all the finishing touch myself. Formally, I may not be a CTO, but in fact I fulfill its functions.

Alexander Astapenko: How do you imagine CTO in your company? What role should he play in a project of this type?

Ignatius Kolesnichenko:The duty of the service station is to think about architecture, about the future of the system. He is required to have a good understanding of how the back-end works, how it all starts on MapReduce - what problems and difficulties are there. Our service station should be interesting to understand biotechnology, algorithms. He also needs to lead this research to some extent. STO must understand how it all works on Amazon, manage the development of the back-end and front-end.

Pavel Pavlov: It turns out that, in your understanding, the SRT focuses on the technical issues the team and the project are working on, but should it go beyond the company and see what is going on around, somehow navigate the industry?

Ignatius Kolesnichenko:Must, of course, should. But here you need to understand that it is hard to technically, even hard in time, to embrace all this in yourself. There are different options, as far as I can see, there should be 2 or 3 people who communicate together and each has their own area of ​​responsibility. In my opinion, the STO's area of ​​responsibility is the development, of course, the system architecture, an understanding of how it works, where it will go, and what technical difficulties there are. He also, of course, should look at the world around him, but this is not his main duty.

Pavel Pavlov: There is such a thing as an architect, lead developer, team lead, etc. And at the same time there is a service station, that is, do you somehow have these concepts intersecting? Is there any substitution?

Ignatius Kolesnichenko:Maybe there is. Much depends on the scale of the company. While the company is small (less than 10 people) and there are not enough hands, then, in my understanding, the service station should even program something, some complicated things. At a minimum, it must read and revise the entire code.

When a company grows, naturally, a stratification occurs, the SRT stops reading the code and starts to think more about the architecture, the product and its technical development. There are team leaders, senior developers who take over the previous duties of the SRT.

Similarly, in the scientific field, which begins to be divided into groups. Grows into a company with a large sales group. Simply, scaling is necessary here and a service station is needed in the current state, which can do everything a little bit.

ABOUT COMPETITORS



Alexander Astapenko: Are there any competitors? In Russia?World competitors?

Ignatius Kolesnichenko: There are competitors, one of the main competitors is probably the desktop program CLCBio, which is already 8 years old. It, in principle, solves the same problems. What are its disadvantages? First, the analysis is done long enough; it takes 12 hours to analyze the full human genome. And the second drawback is that it is quite complex and solves a million tasks at once and different, so biologists spend a lot of time just learning how to work with it. But for the rest, of course, she knows how to give out all the necessary information. One of the richest firms in the field today.

Alexander Astapenko: And some service models?

Ignatius Kolesnichenko:There are also service models. As I said, different companies are building cloud platforms for bioinformatics computing. In this kind of services, the problem is that they are not at all sharpened by bioinformatics, by scientists, doctors. They have no purpose to investigate the hereditary diseases of man. Their goal is to cover a segment from gray data to mutations. Further they, more often, do not go.

There are also competitors who do the same thing as we, but rather in manual mode. The client received mutations, sent them to the company, which, say, a week or two analyzes these data and sends a report with a detailed description of the mutations, an assessment of their value. Such companies cover the second area: how to understand from a mutation its connection with hereditary diseases. Here we still can develop very far. And I do not have complete confidence that we can automatically repeat everything that they do with their hands, but we will strive for this, naturally.

Alexander Astapenko: And what about sequencing companies?

Ignaty Kolesnichenko: They also think on the topic of cloud platforms, they are building something, but in what form it is not very clear.

Alexander Astapenko:Suppose that iBinom manages to get serious investments and the role of determining the company's technological future falls on you. Can you describe in some way where you need to go with iBinom?

Ignatiy Kolesnichenko: We have a big new direction - analysis of transcriptomes and solving the problem of determining the type of cancer. There are hundreds or even thousands of different types of cancers for which different medicines need to be applied. And this is a very difficult scientific task, that is, people are able to solve it in some specific cases, but from the point of view of algorithms, everything is already more complicated.

The first stage is the same: you need to take the data, sequence them and get mutations, and then delve into the area of ​​what these mutations affect. That is, here you will need my interaction with the biologist. Actually, it is very interesting for me to work with a scientist who can do something similar on some already existing tools. And my task is to figure out how he does it, how the tools work and to collect from them something single and working.

Alexander Astapenko: And what for the consumer? How do you plan to develop?

Ignatiy Kolesnichenko: We have an idea to remake the web-interface, to make it more convenient. But this is such a technical task, that is, there are no interesting challenges in terms of infrastructure and programming.

There was an idea to make it so that there was no gap when the user fills in the data, and then analyzes. And since we are doing, in fact, one analysis, it would be possible to analyze the whole thing at the time of uploading. Nontrivial task for the future.

Source: https://habr.com/ru/post/228227/


All Articles