Practical bioinformatics

Found a tough shortage of information on bioinformatics in the Russian segment. I don’t know, it’s true or not, but I want to provide the reader with an introduction that can be called practical bioinformatics, which I really didn’t have enough to familiarize myself with the subject. In this chapter, I want to describe the path that I had to go up to the present moment, when I no longer shy away from phrases: here's a FASTQ file for you and build me a bed graph for the genome browser. In order to continue the conversation about the interesting, I want to walk diagonally through the definitions and programs of primary data processing, without which it is difficult to speak the same language.

First, some definitions. We assume that chromosomes are one-dimensional coordinate axes, from one to about 10e8. The length of the axis depends on the length of the chromosome. Each point of the axis is an integer. Biologists and chemists conducted a large number of experiments and thanks to them they were able to describe parts of the chromosomes with great accuracy (about 90%). These descriptions are called annotations. The abstract contains information about the length of chromosomes, the coordinates of individual sections of chromosomes, the most famous of them are the gene, intron , exon . There is a huge number of these sections, but their main property is that these segments or a set of segments are located on the coordinate axis. Some segments may include others or intersect in some way. Here is a set of sites where you can see annotations of human genomes, mice, and the like.

Biologists with chemists conduct their experiments and as a result of their operations on cells they get a solution containing relatively small pieces of DNA or RNA (I don’t really want to go into details of difference or sameness, just a sequence of nucleotides ). This solution is passed through sequencing equipment, the output of which is small lines. These strings are the ends of pieces of DNA or RNA that were in solution. The length of the strings obtained from the equipment is only 36-50 bases (the length of the string in nucleotides) is sometimes longer, but at the current moment it seems to be no more than 200. These segments, obtained from sequencing equipment and defined by a sequence of nucleotides, are called reads (from English reads - “ reading"). It should be noted that reads are characterized only by the sequence of nucleotides, and not by their location on the genome. Sometimes these sequences are supplemented with a string of probabilities that matches the position of a nucleotide to its probability of being in that position. A FASTA file is a file without probabilities, FASTQ is a file with probabilities.

Further, depending on what was the result of the experiment - pieces of DNA or RNA, one of the two sequencing methods ChIP-seq or RNA-seq , respectively, is performed. More details about them are described here http://en.wikipedia.org/wiki/DNA_sequencing .

After the expensive sequencing machines work out and produce the result in the FAST A / Q file, you need to find the resulting sequences in the genome. For ChIP-seq, the bowtie program does a very quick search at home, finding millions of reads in just 5 minutes. Those. it is looking for the entry of a string of 36–50 characters consisting of at least four alphabetic alphabets in strings with a total length of 10–9. Why at least a turnover is used: in addition to the standard use of the A / TG / C alphabet, the symbol N is often added, replacing any possible letter ( http://genome.ucsc.edu/FAQ/FAQdownloads.html#download5 for more details). The program has many parameters. For example, you can resolve an error in a line (read) or two (up to a maximum of 3 errors). She can search not only one entry of the read into the genome, but many. It can slyly sort the data, for example, if the read with one error was unequivocally found in the genome, and with two or three errors was found in many parts of the genome, then only the first result should be output. This process of finding the reads in the genome is called the mapping from the English mapping (also called alignment). The algorithm of such a quick search is very interesting, but you can devote a separate article to this or find a link to an English article from the developer’s website, which tells how the bzip2 algorithm prompted them to such a solution. There are a lot of mapping programs and sites where they make it online; For the keywords blast, eland + genome in Google you can find additional information.
')
For RNA-seq, the procedure is a bit more complicated, ChIP-seq mapping is performed for it, and then those reads that are not found during ChIP-seq are processed. A good program that actively uses the bowtie for their work is called tophat . As a result of splicing and the isoformism generated by it , parts of the reed may be located in different places of the genome. For example, the first 15 characters may fall on one region of the genome, and the other 11 on another. This division of a reed into parts, at the end of one exon and at the beginning of another, is called splice-junctions. Tophat allows them to be found, and also identifies new possible isoforms of existing genes.

The result of these programs is the sam / bam file, which contains information from the FAST A / Q file plus information about the coordinates on the corresponding spirals of the chromosome. The process from the library to the sam / bam file is often referred to as the pipeline procedure and in many laboratories is on the stream, so you should ask what parameters and software versions are installed by default. In general, this is what the intro part ends with, then comes the line of research. It may be noted that from this moment we have data that are with some certainty in the same coordinates: the coordinate axis, annotations, reads with the corresponding coordinates.

You can proceed to the analysis of data from a simple elucidation of uniformity and continuity, complexity, ending with complex statistical calculations that help to divide the reads into certain groups: noise, zero level and enrichment. Data analysis is necessary in order to be able to reasonably discard certain data in the future.

If you are interested in the introductory part, then at each stage I can elaborate. Unfortunately, a simple introductory part already occupies several pages of dry text without pictures, therefore, I did not dare to describe the written programs and mechanisms in this chapter. I myself am most attracted to the last paragraph, where I briefly mentioned statistics. I would like to consecrate existing libraries and mechanisms for working with such data. Here you can attach the methods included in datamining (different types of clustering), which are described in Habré. How to apply a Poisson distribution to analyze data without control, how to apply a complex chain of Poisson, f-test, to find areas enriched in Reed on the genome (Diarac delta function)? Are ready-made libraries useful for working with intervals boost.intervals, boost.icl?

And, of course, if this topic is interesting, maybe someone will tell you how and where to dig, in what matters, and add. And maybe he will write his own. For the solution of biological problems without mathematics and programming at this stage is definitely impossible. There are English-speaking resources where similar issues are discussed www.seqanswers.com . But I would like in the future to get away from the description of finished products and engage in a discussion of the legitimacy of applying mathematical and statistical methods in these programs and the possibility of applying new methods.

At the current stage of our work, we tried to find parameters by which you can filter reads, both for interesting for research and for noise. The task is rather nontrivial in the absence of control. In the future, it was decided to add a control to the library for sequencing, which will allow measuring the level of error, but the statistics will not let go anyway.

Source: https://habr.com/ru/post/137069/

All Articles

Practical bioinformatics

More articles: