If you ask a passerby what biology is, he will most likely answer something like “the science of wildlife”. About computer science will say that she is dealing with computers and information. If we are not afraid to be intrusive and ask him the third question - what is bioinformatics? - here he is surely confused. It is logical: not everyone knows about this area of knowledge even in EPAM - although there are some bioinformatics in our company. Let's understand why this science is necessary for mankind in general and EPAM in particular: in the end, suddenly we will be asked on the street about this.

Why biology has ceased to cope without informatics and what's the cancer
To conduct a study, biologists are no longer enough to take tests and look through a microscope. Modern biology deals with colossal amounts of data. Often it is simply impossible to process them manually, therefore many biological problems are solved by computational methods. Let's not go far: the DNA molecule is so small that you can not see it under a light microscope. And if it is possible (under electronic), all the same, visual study does not help to solve many problems.
Human DNA consists of three billion nucleotides - in order to manually analyze them all and find the desired site, a lifetime is not enough. Well, it may be enough - one life for analyzing one molecule - but this is too long, expensive and unproductive, so the genome is analyzed using computers and calculations.
')
Bioinformatics - this is the whole set of computer methods for analyzing biological data: read DNA structures and proteins, micrographs, signals, databases with the results of experiments, etc.

Sometimes DNA sequencing is needed to find the right treatment. The same disease caused by different hereditary disorders or exposure to the environment, must be treated differently. And in the genome there are areas that are not associated with the development of the disease, but, for example, are responsible for the reaction to certain types of therapy and drugs. Therefore, different people with the same disease may respond differently to the same treatment.
More bioinformatics is needed to develop new drugs. Their molecules must have a specific structure and bind to a specific protein or DNA segment. Computational methods help simulate the structure of such a molecule.
Achievements of bioinformatics are widely used in medicine, primarily in the treatment of cancer. DNA has encrypted information about predisposition and other diseases, but it’s the treatment of cancer most of all. This direction is considered the most promising, financially attractive, important - and the most difficult.
Bioinformatics at EPAM
At EPAM, the Life Sciences division deals with bioinformatics. They are developing software for pharmaceutical companies, biological and biotechnological laboratories of all sizes - from start-ups to leading global companies. To cope with such a task can only people who understand biology, know how to make algorithms and program.
Bioinformatics - hybrid specialists. It is difficult to say what knowledge is primary for them: biology or computer science. If you put the question like that, they need to know both. First of all, perhaps, the analytical mind and readiness to study a lot are important. There are biologists in EPAM, who have completed computer science, and programmers with mathematicians, who additionally studied biology.
How to become bioinformatics
Maria Zueva, developer:“I received a standard IT education, then I studied in the EPAM Java Lab courses, where I was fascinated by machine learning and Data Science. When I graduated from the laboratory, they said to me: “Go to Life Sciences, they are engaged in bioinformatics and they are just recruiting people.” I'm not being clever: then I heard the word “bioinformatics” for the first time. I read about it on Wikipedia and went.
Then a whole group of newcomers was recruited into the unit, and we studied bioinformatics together. We started with a repetition of the school program about DNA and RNA, then we analyzed in detail the problems existing in bioinformatics, approaches to their solution and algorithms, learned to work with specialized software. ”Gennady Zakharov, business analyst:“By education I am a biophysicist, in 2012 I defended my candidate’s degree in genetics. For a time he worked in science, was engaged in research - and I still continue. When the opportunity arose to apply scientific knowledge in production, I immediately seized upon it.
For a business analyst, I have a very specific job. For example, financial questions pass by me, I am rather an expert in the subject area. I have to understand what the customers want from us, understand the problem and compile high-level documentation - a task for programmers, sometimes make a working prototype of the program. In the course of the project, I keep in touch with developers and customers, so that they and others are sure: the team does what is required of it. In fact, I am a translator from the language of customers - biologists and bioinformatists - into the language of developers and vice versa. ”How to read the genome
To understand the essence of the bioinformatic projects of EPAM, you first need to figure out how to sequence the genome. The fact is that the projects we are going to talk about are directly related to the reading of the genome. Let us turn to bioinformatics for an explanation.
Mikhail Alperovich, head of the bioinformatics unit:“Imagine that you have ten thousand copies of War and Peace. You missed them through the shredder, mixed them up, randomly pulled a pile of paper strips out of this pile and tried to collect the source text from them. In addition, you have a manuscript of "War and Peace." The text that you collect, you will need to compare with it to catch typos (and they will be). Approximately the same way DNA is read by modern sequencing machines. DNA is isolated from cell nuclei and divided into fragments of 300–500 nucleotide pairs (we remember that in DNA nucleotides are linked to each other in pairs). Molecules are crushed, because no modern machine can read the genome from start to finish. The sequence is too long and errors will accumulate as it is read.
We remember “War and Peace” after the shredder. To restore the original text of the novel, we need to read and place all the pieces of the novel in the correct order. It turns out that we read a book several times in tiny fragments. The same with DNA: the sequencer reads each section of the sequence with multiple overlapping, because we are analyzing not one, but many DNA molecules.
The obtained fragments are aligned - they “attach” each of them to the reference genome and try to understand which part of the standard the read fragment corresponds to. Then, in the aligned fragments, variations are found - significant differences in readings from the reference genome (typographical errors in the book compared to the reference manuscript). This is done by the program - option-callers (from the English. Variant caller - identifier of mutations). This is the most difficult part of the analysis, so there are a lot of different programs - option-callers and they are constantly being improved and developed new ones.
The vast majority of mutations found are neutral and have no effect on anything. But there are those in which the predisposition to hereditary diseases or the ability to respond to different types of therapy are encrypted. ”
For analysis, a sample is taken that contains many cells - and therefore copies of the complete DNA set of the cell. Each small DNA fragment is read several times to minimize the chance of error. If you miss at least one significant mutation, you can make the patient a wrong diagnosis or prescribe the wrong treatment. Reading each DNA fragment one at a time is too little: the only reading may be wrong, and we will not know about it. If we read the same passage twice and get one correct and one incorrect result, it will be difficult for us to understand which of the readings is true. And if we have a hundred readings and in 95 of them we see the same result, we understand that it is the right one.
Gennady Zakharov:“To analyze cancer diseases, it is necessary to sequence both healthy and diseased cells. Cancer occurs as a result of mutations that a cell accumulates during its lifetime. If the mechanisms that are responsible for its growth and division deteriorate in the cell, then the cell begins to divide indefinitely, regardless of the needs of the organism, that is, it becomes a cancer tumor. To understand exactly what cancer is caused, a sample of healthy tissue and cancer is taken from the patient. Both samples are sequenced, compare the results and find one differs from the other: what molecular mechanism is broken in the cancer cell. Based on this, a medicine is selected that is effective against cells with a “breakdown”. ”Bioinformatics: production and open source
The bioinformatics division at EPAM has both production and open-source projects. Moreover, a part of the production project can turn into an open source, and an open source project can become part of the production (for example, when an open source EPAM product needs to be integrated into the client's infrastructure).
Project №1: option-caller
For one of the clients, a large pharmaceutical company, EPAM upgraded the option-caller program. Its peculiarity is that it is able to find mutations that are inaccessible to other similar programs. Initially, the program was written in Perl and had a complex logic. In EPAM, the program was rewritten in Java and optimized - now it works 20, if not 30 times faster.
The source code of the program is available on
GitHub .
Project №2: 3D-viewer of molecules
To visualize the structure of molecules in 3D, there are many desktop and web applications. Representing how a molecule looks in space is extremely important, for example, for drug development. Suppose we need to synthesize a drug that has a directional effect. First, we need to design the molecule of this drug and make sure that it will interact with the right proteins exactly as needed. In life, the molecules are three-dimensional, therefore they are also analyzed in the form of three-dimensional structures.
For 3D viewing of molecules, EPAM made an online tool that initially only worked in a browser window. Then, on the basis of this tool, we developed a version that allows us to visualize molecules in the glasses of virtual reality HTC Vive. Controllers are attached to the glasses, with which the molecule can be rotated, moved, substituted to another molecule, and rotated individual parts of the molecule. Doing all this in 3D is much more convenient than on a flat-panel monitor. This part of the EPAM bioinformatics project was done in conjunction with the Virtual Reality, Augmented Reality and Game Experience Delivery department.
The program is only being prepared for publication on GitHub, but for now there is a
link by which you can view its demo version.
How does the work with the application, you can learn from the
video .
Project # 3: NGB Genomic Browser
The genomic browser visualizes individual DNA reads, variations and other information generated by utilities for analyzing the genome. When the reads are compared with the reference genome and the mutations are found, the scientist needs to check whether the machines and algorithms have worked properly. It depends on how accurately the mutations in the genome are determined, what diagnosis will be made to the patient or what treatment they will be prescribed. Therefore, in clinical diagnostics, the scientist needs to control the operation of machines, and the genomic browser helps him in this.
For bioinformatics developers, the genomic browser helps to analyze complex cases in order to find errors in the algorithms and understand how they can be improved.
The new NGB
Genomic Browser (New Genome Browser) from EPAM works on the web, but in terms of speed and functionality it is not inferior to desktop counterparts. This is a product that was not enough on the market: previous online tools worked slower and were able to do less than desktop ones. Now many clients choose web applications for security reasons. The online tool allows you to not install anything on the work computer of the scientist. You can work with him from anywhere in the world by accessing the corporate portal. It is not necessary for a scientist to carry a working computer everywhere and download all the necessary data to him, which can be a lot.
Gennady Zakharov, business analyst:“I worked partly as a customer on open-source utilities: I set the task. I studied the best solutions on the market, analyzed their advantages and disadvantages, and looked for ways to improve them. We needed to make web solutions no worse than desktop counterparts and at the same time add something unique to them.
In the 3D molecular viewer, this was working with virtual reality, and in the genomic browser, improved work with variations. Mutations are complex. Changes in cancer cells sometimes affect vast areas. Extra chromosomes appear in them, pieces of chromosomes and whole chromosomes disappear or come together in random order. Individual pieces of the genome can be copied 10–20 times. Such data, firstly, is more difficult to obtain from readings, and secondly, it is more difficult to visualize.
We have developed a visualizer that correctly reads information about such extensive structural adjustment. We also made a set of visualizations that, on contact of chromosomes, shows whether hybrid proteins were formed due to this contact. If extended variation affects several proteins, we can calculate and show what is happening as a result of such variation, what hybrid proteins are obtained by clicking. In other visualizers, scientists had to track this information manually, and in NGB - in one click. ”How to study bioinformatics
We have already said that bioinformatics are hybrid specialists who need to know both biology and computer science. Self-education plays an important role in this. Of course, at EPAM there is an introductory course in bioinformatics, but it is designed for employees who can use this knowledge on the project. Classes are held only in St. Petersburg. And yet, if you are interested in bioinformatics, the opportunity to learn is:
1) An
introductory course in genetic diagnosis from the company 23andme .
2)
Several courses on Coursera (including a couple of courses in Russian: an
introduction to bioinformatics and
metagenomics ).
3) Courses for Stepik from the Institute of Bioinformatics:
molecular biology and genetics ,
molecular phylogenetics ,
genetic engineering and an
introduction to highly efficient sequencing technology . A complete list of courses from the institute can be found on its
official website .
4)
Lectures of Pavel Pevzner - Professor of the University of California at San Diego, a specialist in the field of bioinformatics.
5) If you live in St. Petersburg, you can come to the
guest lectures at the Institute of Bioinformatics - it's free.