The most well-known base containing the genomes of sequenced organisms -
NCBI , contains a large number of systematic errors. Because of this, it is practically impossible to use these data, and all the more it is impossible to study the mechanism of mutations (and, consequently, evolution), since in this case human errors during sequencing are investigated, and not natural mutations. Therefore, before using this data it is necessary to clarify this database.
And this is a time-consuming task; it cannot be solved for a separate, necessary organism. Therefore, I would like to find those who would like to create their own Russian-language source similar to the NCBI, but with updated information.
The article shows how massive the errors of the genomes found in NCBI are and how to make sure of this, and some ways to correct it.
')
Where are the genomesAll the sequenced genomes are located on ftp ftp.ncbi.nih.gov/genomes/. And then ftp.ncbi.nih.gov/genomes/Bacteria/ bacterial genomes - with them, it is worth starting.
We need the file all.fna.tar.gz - it contains genomes of about 2000 bacteria. What is a genome? This is a chain of DNA - letters A, T, C, G. Download, unpack - we get a bunch of directories with species names in Latin. Inside there are usually several NC files _ ###### - in each file there is a separate so-called. locus - DNA strand (chromosome or plasmid).
For simplicity, we will deal with RNA, since proteins are a bit more complicated to process. For this we need two more files:
1. all.rnt.tar.gz - contains a list and location (beginning, end, direction) of all RNAs at a specific locus
2. all.frn.tar.gz - contains a list of actually cut pieces of RNA from DNA (don’t be surprised even if it is RNA - there will not be replacements of T with U here, because this is the DNA code from which RNA will be created)
How to perform preprocessingThese files are not very convenient for processing. The .fna files contain a comment in the first line, and then the DNA code in each line of 70 characters, then a line break. Naturally, this is not where it is suitable for search, and you need to glue it into one line without hyphenation, and delete the comment. The file processed in this way will be given the extension .fna.txt
In addition, there is another nuance: RNA can be transcribed from right to left or vice versa, and since DNA is a double helix, then right to left means transcription from one strand of DNA, and from left to right with another complementary .
This means that in order to find, for example, RNA for which a negative direction is indicated in the file all.rnt.tar.gz - we need to search not in the received .fna.txt file - so we will not find anything. We need to create a reverse file (give it the extension .fna_.txt). The point is that we take the .fna.txt file - we read it letter by letter from the end and make complementary replacements:
T = A; G = C; A = T; C = G; M = K; R = Y; W = W; S = S; Y = R; K = M; V = B; H = D; D = H; B = V
The first 4 are clear and known. The rest are rather unexpected :), although they are rare, but they are. The point is that if during sequencing it is impossible to distinguish G from A, R is put, etc.
Find errors or cross-analysisI gave this process the name cross-analysis. The essence of such a file of interest .frn take the code of one RNA. And we are looking for a match across the entire set of .fna.txt and .fna_.txt files.
How many matches do you think you will get? Enough. And it may turn out that in the .rnt file there may not be a corresponding entry. And most often it turns out that the beginning and end in the file will be not so, but shifted by 1 or 3 positions. There may even be another direction. I also met with more significant errors when it is indicated that it is Ile RNA, and in fact it is Met RNA.
By the number of these errors almost 50%. How can I work with such erroneous data I do not know. Why people from NCBI do not conduct such a simple cross-analysis - I don’t know either.
But think how many erroneous conclusions then do biologists trust in this data?
At the same time, when errors are corrected - the same method allows you to make an experiment of the type described here.
Interesting results about the evolutionary systematics of prokaryotes or “many-species origin” , we note separately how much it is an elementary method, but giving exact facts that can speak about a lot.
One simplest exampleOpen the sequenced organism
Chlamydophila pneumoniae TW-183 . Searching for the “CpBt08” tag, the complement is indicated there (266485..266557) - this is the beginning and the end, respectively. There is also a link to
GeneID: 3284349 . Then there is the link
FASTA - there is a sequence
CGGGGACTTAGCTTAGTTGGTAGAGCGTCTGATTTGCATTCAGAAGGTCAGGAGTTCGAATCTCCTAGTCTCC
she's not right - really should be
GGGGACTTAGCTTAGTTGGTAGAGCGTCTGATTTGCATTCAGAAGGTCAGGAGTTCGAATCTCCTAGTCTCCA
(and it really is in the full DNA sequence, it is only incorrectly identified)
put them next
CGGGGACTTAGCTTAGTTGGTAGAGCGTCTGATTTGCATTCAGAAGGTCAGGAGTTCGAATCTCCTAGTCTCC
GGGGACTTAGCTTAGTTGGTAGAGCGTCTGATTTGCATTCAGAAGGTCAGGAGTTCGAATCTCCTAGTCTCCA
see that the difference is in the shift.
Why?
Now go to another organism
Chlamydophila pneumoniae CWL029 , look for the tag: CPnt08. And similarly
we find the geneGGGGACTTAGCTTAGTTGGTAGAGCGTCTGATTTGCATTCAGAAGGTCAGGAGTTCGAATCTCCTAGTCTCCA
Do you think this is just another sequence? An, no - this is the same but shifted. Of course there is, the question is which one is correct. And the worst thing here is to solve this automatically. You have to decide based on the error rate and some knowledge of what RNA sequences can be. However, for tRNA there is a more specific criterion to check the compliance of the anticodon in positions 34-36 and the presence of the end of SSA (which all tRNA should have).
PS Who will be interested in it so much to try to look for mistakes, correct or even make an experiment similar to mine, but for other data - please send a personal message.