Genome Browsers

Not the last role in bioinformatics is visualization. Scientists in this field work with huge amounts of information that it would be nice to somehow capture the eye and present in my head. A vivid example of a visualization tool is the genome browser (genome browser), which I want to talk about.

As many people remember from a school biology course, a genome consists of a set of chromosomes, and a chromosome is two chains that are folded into a spiral. Each of the chains contains a sequence of nucleotides with four types of nitrogenous bases - adenine (A), guanine (G), cytosine (C) and thymine (T). It is easy to identify the second one by one chain, if you remember that adenine is paired up with thymine (Antoshka-Timoshka), and guanine with cytosine (goose-chicken). Some sections of DNA are called genes, RNA is read from them, and proteins are then encoded. Proteins are composed of 20 types of amino acids (plus a pair of exotic ones), each of which is encoded in three nucleotides.

A genome browser is a one-dimensional map that displays some nucleotide sequence (say, a chromosome or a single gene) with accompanying information. Information is usually structured into blocks called tracks. For example, there may be a track with genes or with separate nucleotides. Individual entities on the tracks are often called features.

There are genome browsers sharpened for small bacterial genomes, but the universal browser needs to show both the long chromosomes of vertebrates and individual nucleotides. The longest human chromosome (the first ) contains about 250 million base pairs, that is, the scale should change about a million times. Of course, on a different scale, the information is displayed differently. For example, in the picture above there is a track with the UCSC Genes genes, where the whole SOD1 gene and fragments of the neighboring genes fell. At this scale, the exon-intron gene structure is displayed. Exons (those parts that will remain in RNA after splicing and encode protein in the long term) are marked with filled rectangles, and introns (gaps between exons) are indicated by arrows, which show the direction of gene reading (in this case, the SOD1 gene is located on a straight DNA strand, and BC041449 - on the back). And this is how a piece of the SOD1 gene looks like when magnified:
UCSC genome browser; SOD1 gene

Here the scale allows you to deduce the amino acid sequence of those gene fragments, which then encode the protein. Each amino acid corresponds to a certain letter of the Latin alphabet.
')
What else can be seen on the genome browser? At the most detailed scale, you can see individual nucleotides, both on the forward and reverse DNA helix:

Each nucleotide corresponds to a standard color, so you can have fun painting it, even if the letters themselves no longer fit:
Ensembl genome browser

If you roll back a little more, you can judge about the nucleotide composition by the special track GC content:
BioUML genome browser - GC content

Here red color means that G and C nucleotides in this place are less than 50%, and blue color is more. One might think that A, C, G, T are just four equal states of a two-bit cell encoding genetic information, and the proportion of G and C does not mean anything interesting. However, base pairs GC form three hydrogen bonds, and AT only two. That is, GC is stronger, they are harder to break and the enrichment of GC or AT bonds affects the chemical processes in a given region of DNA.

What else can you see? Usually there are tracks with genomic variations that, for example, distinguish different people from each other. Often, variations are expressed as point mutations, single nucleotide substitutions ( Single-nucleotide polymorphism, SNP ). Many of these mutations were found when comparing the results of sequencing of the genomes of different people and placed in special databases (for example, dbSNP):

On the given fragment, there are not so few SNPs (19 by 356 nucleotides - more than 5%). However, many of them are synonymous. Since out of 4 ³ = 64 variants of three nucleotides, 20 variants of proteins are encoded, some substitutions do not affect the resulting protein. Some substitutions fall into non-coding regions (for example, introns), so they may also not influence the result (but they can also influence).

Another interesting thing is a comparison of the human genome with the genomes of other species. For this, non-trivial algorithms make multiple alignment of genomes and also show it. The very top image of the post shows a schematic alignment of a person with a rhesus monkey , a mouse, a dog, an elephant, an opossum, a chicken, a frog (Xenopus tropicalis) and a zebrafish . Dark matching fragments are shown. Notice that the darkest regions are in the coding regions of the genes. In the same picture there is a graph of the conservatism of sites among mammals (Mammal cons), which also correlates. But the multiple alignment in an enlarged form:

Minus means that a person has a nucleotide, but is absent in another species. Orange vertical line (for example, in the line with the dog between two thymine) - on the contrary. The number of missing nucleotides is indicated above (they are not listed). The coding region is given in amino acid form, therefore, synonymous substitutions are not visible. Chicken and fish, apparently, do not have a similar region. You can see how the macaque looks like a person.

At the very far scale, the karyotype of the chromosome becomes visible:

You can orient yourself by the karyotype, if you remember, for example, in which lane is your favorite gene that you are studying. The crossing in the middle is the centromere .

There are many other predefined tracks. Some browsers allow you to load tracks from the web using a special DAS protocol . And, of course, the genome browsers allow scientists to add their own (for this there are special file formats). User tracks can, for example, show DNA binding regions for a specific protein (for example, a transcription factor ), both predicted and obtained in an experiment (for example, ChIP-Seq ). If you, for example, sequenced your own genome, you can download the result and compare it with the reference and known SNP.

Genome browsers are plentiful. Only in Wikipedia thirty pieces are listed , and this is definitely not all. Many of them are specialized: sharpened for a certain organism or a certain type of data. Of the popular desktop browsers can be noted Integrated Genome Browser and Integrative Genomic Viewer (as you can see, with the names do not bother). Both are Java applications, there is Java Web Start.

Of course, it is often more convenient to use the genome browser on the web. Most of the pictures above are taken in the UCSC Genome Browser and the Ensembl Genome Browser . Both of these browsers generate images on the server. There are more modern technical solutions. AnnoJ , for example, renders pictures on a client on canvas, receiving JSON data from the server (a demonstration for Arabi's favorite weed of biologists - Arabidopsis). There are more JBrowse . In a way, it is unique because it does not contain server code. Data on tracks and sequences are prepared in advance on the server in the form of static files that the browser loads via AJAX. User files are processed through the File API .

The perfect genome browser doesn't exist. In my opinion, the main problem is the speed of work. This is especially noticeable on the web, although there are delays in the desktop. Some tracks at certain scales are either generated very slowly, or are turned off altogether. For visualization it is necessary to grind a lot of information, which, perhaps, is not always presented in the optimal form. Therefore, if someone has a desire to do this, there is every chance of overcoming competitors.

Source: https://habr.com/ru/post/170429/

All Articles

Genome Browsers

More articles: