In this article, I want to share the experience of developing the
pipeline using Docker to analyze biomedical data. Probably, one reader will be interested in the bioinformatical pipeline itself, and on others - the use of the Docker, so we divide the article into two parts.
Part 1. Bioinformatical pipeline, or what we did and why.
Technologies for reading the DNA sequence of living organisms are gradually developing. Scientists are learning more and more what parts of DNA are responsible for what and how they affect the body in general. In all these works there is a great medical potential. Aging, cancer, genetic diseases - DNA sequence analysis can be a powerful tool in combating them. We built a pipeline that analyzes the sequence of some parts of a person’s DNA and tries to predict whether it can have cardiomyopathy, a genetically determined heart disease.
Why are only some parts of DNA taken? Reading all the DNA (which is 3.2 billion nucleotides or "letters") will cost much more. And in order to understand whether a particular person has “errors” that lead to one or another “genetically determined” disease, it is enough to read only those areas that, so to speak, affect the development of these diseases. And it is much cheaper.
')
How do scientists know which parts of the DNA to read? The answer is: comparing the genomes of healthy and sick people. These data are quite reliable, because today mankind knows the complete
genomes of more than a thousand people around the world.
When readings on the necessary DNA segments are obtained, it is required to make a forecast whether the owner is waiting for a disease or not. That is, to understand the sequence in these areas such as in a healthy person, or in them there are deviations that occur in sick people.
Quite a lot of research is being done on such studies, therefore there are
best practices on how to do this. They describe in detail how the cleared data align to the reference genome (that is, the genome of some abstract healthy person), how to find the “single letter” differences between them, and then analyze these differences: weed out the insignificant, and others look in the biomedical databases. For all these actions, the bioinformatics community has developed many programs: bwa, gatk, annovar, etc. The pipeline is designed so that the output of the desired program is fed to the input of the next one, and so on, until the desired result is obtained. There are many ways to implement pipeline, in our work, inspired by the excellent course “
Management of Computing ”, we used snakemake.

With the help of the pipeline, we analyzed data for one family, some of whose members were diagnosed with cardiomyopathy (in the figure they are in the red box). Variations were found (that is, deviations from the reference genome), which, according to medical bases, are found in people with this disease (in the figure they are indicated by blue and green colors).

What conclusions can be made from all this? As expected, the pipeline itself cannot deliver a diagnosis. He can only give information according to which the doctor decides whether the person is at risk for sickness or not. This ambiguity is associated with the fact that cardiomyopathy is a complex disease, depending on both genetic and external factors. The biochemical mechanisms of its occurrence are not known (all this is difficult), so it is impossible to say exactly which sets of variations will lead to the disease. All that is is statistics on the sick and healthy, which allows the doctor to assess the probability of the disease and, if necessary, start treatment in time.
We also attempted to assess the quality of the work of the pipeline. As mentioned above, the pipeline finds variations - “single-letter” deviations of the DNA sequence of a person from the reference genome. Then he analyzes them and searches for information on them in biomedical databases. The most ambiguous stage that requires fine-tuning is finding these variations. Here we need to find a balance between redundancy - when there are too many variations, most of which are garbage, and insufficiency - when variations are chosen so strictly that we lose the necessary information. Therefore, quality testing has come down to checking how the pipeline finds variations in the data about which we know the “correct answer”. As this data,
Genome in a Bottle was taken - a certain as accurately as possible read by the human genome, for which there is reliable data on variations. The result of the quality check gave an 85% match, which is pretty good.
Part 2. Using Docker
If you express the main idea of ​​this article in one sentence, it would be like this: “Use Docker in your pipelines, it is much more convenient with it”. Indeed, what problems do people usually think of using something with pipelines? If the pipeline is on your work computer, you can inadvertently change the environment or the dependencies of the programs used, automatic updates are possible - all this can lead to what the pipeline will consider a little differently from earlier or start producing errors. Also, the deployment of the pipeline on a new computer can be problematic: you need to install all the programs, again, keep track of versions and dependencies, take into account the operating system. With Docker using all these problems will not, and to run the pipeline on a new computer, you will not have to install anything at all (except Docker).
The idea behind Docker is that each program used by the pipeline will run in an isolated container, in which the developer builds the necessary dependencies and environments. He describes everything he needs in the corresponding Dockerfile, then the docker build command builds an image of the container that can be downloaded to dockerhub. When someone (pipeline or another user) wants to use this program with these dependencies, he simply downloads the desired image with dockerhub and the docker create command creates the necessary container on his computer.

Our docker pipeline is available on
github . Each time, calling any program, the pipeline runs the corresponding container, passes the necessary parameters to it and the calculation goes. In fact, all the work of the programmer was to write a Dockerfile for each container. It indicates the base image (FROM), which commands to execute in the specified image (RUN), or which files to add (ADD), you can specify the working directory (WORKDIR), into which, when the container is started, the folder with the data necessary for calculations is mounted. An image is created based on the Dockerfile:
$ docker build -t imagename .
And it is loaded into any repository, for example,
dockerhub.com .
We describe some typical cases for our pipeline. Read more about the Dockerfile on the
official website :
You need to run the standard program installed from the repositories, for example, picard-tools. The dockerfile will be:
FROM ubuntu:14.04 RUN apt-get update && apt-get install -y picard-tools \ && mkdir /home/source WORKDIR /root/source
$ docker run -it --rm -v $(pwd):/root/source picard-tools picard-tools MarkDuplicates INPUT={input} OUTPUT={output[0]} METRICS_FILE={output[1]}
You need to run your shell script, annotation_parser.sh, which parses the file. To do this, you can use the standard docker-style ubuntu:
$ docker run -it --rm -v $(pwd):/root/source -w="/root/source" ubuntu:16.04 /bin/sh scripts/annotation_parser.sh {input} {output}
You need to run a script that is not in the standard repositories. Dockerfile:
FROM ubuntu:16.04 RUN apt-get update && apt-get install -y perl && apt-get install -y wget \ && mkdir /root/source ADD annovar /root/annovar ENV PATH="/root/annovar:${PATH}" WORKDIR "/root/source"
$ docker run -it --rm -v $(pwd):/root/source annovar table_annovar.pl {input} reference/humandb/ -buildver hg38 -out {params.name} -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -vcfinput
You need to run a java-application. Here we use
ENTRYPOINT (https://docs.docker.com/engine/reference/builder/#entrypoint), which allows us to run the container as an executable file.
FROM ubuntu:16.04 RUN apt-get update && apt-get install -y default-jre \ && mkdir /home/source ADD GenomeAnalysisTK.jar /root/GenomeAnalysisTK.jar WORKDIR "/root/source" ENTRYPOINT ["/usr/bin/java", "-jar", "/root/GenomeAnalysisTK.jar"]
$ docker run -it --rm -v $(pwd):/root/source gatk -R {input[0]} -T HaplotypeCaller -I {input[1]} -o {output}