This article is a continuation of two others.
Interesting results about the evolutionary systematics of prokaryotes, or “many species” ,
Genomes of sequenced organisms are errors in the bases .
After them, I had the honor to receive some feedback from both interested and professionals in this matter. Also, as you could see, there was a rather lively discussion. On the one hand, I would like to respond to the comments received.
On the other, put a new experiment. And it would be desirable to attract to this those who are interested in such things. If you do not have time - maybe you have free CPU time :)?
')
NCBI PositionThanks to
Kalobok , who works at NCBI, we managed to find out why “people from NCBI do not do such simple cross-analysis” (see
Genomes of sequenced organisms - errors in the bases ). I must say the conversation with
Kalobok was not very pleasant. At the beginning, like many commentators, he tried to teach me, in every possible way to hint that I didn’t understand anything, I’m doing everything wrong, etc.
Here are a few characteristic quotes from the correspondence: "
The essence of your claims ... is related to the incorrect use of data. ... there is nothing to correct. You just need to use the data correctly. ", "
As for the anticodons - I already said, I do not remember the details, so I can not comment. Judging by your level of knowledge, you also don’t have much reason to talk about this topic. First, figure out whether it is possible to compare the genomes of different organisms at all. will answer that this is not there is nothing else to discuss further. Try again to understand that NCBI doesn’t have dropout students, but professionals. And I’m not inclined to suspect that they are driving more than half of the defective data. Rather, I will assume that a layperson is mistaken , which got into someone else's field. ","
Consider that NCBI is focused primarily on biologists. They are completely satisfied with the available data and tools. And a single non-biologist programmer with incomprehensible ideas of the weather does not. ","
Here is the opinion of the head of the group of bacterial genomes after reading the article: Yes, quite naive ... Such work has been going on for the past 20 years. And this is some kind of lone craftsman "
Well, that sort of thing. Probably, few people would restrain themselves in such a tone to lead the discussion. But alas, such modern morals of people who have received a diploma of biology (biophysics, biochemistry ...) and started to understand programming a bit and now work in a respectable place.
How to survive in this evil world :)What to do to a person who does not have a corresponding diploma, but has knowledge in a narrow, but not his own field? Alas, it will always be related to it, starting with “
your post is teeming with the self-confidence of a gifted student ”, to containing an instructive, patronizing tone.
But in general it’s very interesting. Correct the language, add a literature review, describe methods and results, remove speculations from the discussions and maybe something will turn out. ","
> and there they will twist $ only at the temple - They won't do it, normal work. If you add links, you can get a diploma; But the discussion with the author shows that it would not have gone further ".
But the main thing is to understand a few psychological moments. A person with a diploma and with a warm place - is girdling. Even without having relevant knowledge (not in a broad aspect, but in this particular task), he allows himself to speak in the spirit of "superiority over the interlocutor." The discussion, as a rule, is not carried out in essence, but the weakest argument of the interlocutor is sought, or deliberate speculation, then “my son, read, teach this” is advised, and as a rule, irrelevant to the question, strong arguments are ignored, and then it turns out you are a diploma, who is your father and mother, etc.
With this, I met, well, over the years 10 repeatedly. My advice is simple - ignore. Do not be fooled by provocations, and do not teach what you are told - you do not need it. The first such case was in my school when I spoke on literary characters, for example, about “The Master and Margarita” or about Natasha Rostova from “War and Peace”. Then they told me how to speak negatively about what I did not read. Then I was seduced and read "The Master ..." and then I was able to speak with all the scrupulousness on this subject. It was easier with Natasha (the novel read only diagonally), they wrote an essay - they wrote there convincingly that this girl should not be given such important attention - it is not worth it. The rating was excellent, with a comment in the form “everything is very well justified with quotations, but it may be worth looking at it from the other side, as a manifestation of the Russian soul ...”. Not worth it, I said then, and went into adulthood :)
Over time, I felt sorry for the time when you spend it on the beaverd. Why do you ask literature? Yes, everything is one to one, since then everything is laid down - either you will be led or you will decide for yourself.
And yet - all this does not mean that such discussions should be avoided. Never allow yourself to behave like your opponent (although it is often not easy) - look for the truth in his words, they are really small, but if the opponent leads a discussion with you - he is already interested, and from time to time he gives out something Useful for you - learn how to filter.
However, sorry for this lyrical digression. Further in essence.
What ended the deal with NCBI?As expected, they admitted their mistakes, but did so with a good face on their face :)
"
The data that you took from FTP is the original sequences and annotations sent by the researchers. They are not yet verified by the NCBI, and there may be quite a few errors there [they are marked as] NCBI review ... Even in verified data, such errors can occur, both because of not 100% reliability of checks, and for historical reasons (in many of the old records they relied on the reliability of data submitters and did not make additional checks - so far they are). such data are periodically reviewed are being corrected. One of the biologists clearly said that if for some reason he used raw data from the genome, he would simply correct them manually, but most likely would use tRNAdb [this is another database where there is less data, but they corrected]. "
“
Here, by the way, another comrade responded. He says that now our standard program for checking data simply does not check the correctness of tRNA. Because it turns out to be very expensive in computing power. They plan to write a separate program for this, but for now there are more priority tasks. So wait. "
Therefore, the lyrics are lyrics, but the fact turned out to be a fact. It can take a long time to “pounce” on a non-biologist programmer, but the fact is that more than 50% of the data from NCBI is not verified - there is a reliable and recognized fact. It should not be taken as a criticism of NCBI - they do and contain a lot of good information, which is valuable even with errors. This is just for information to biologists who told fairy tales in comments in past articles.
It seems they are going to correct this data, but this is not a priority for them, since Many of these errors are not noticed, if they themselves notice the correct. But if they correct it themselves, because they do not trust lists of errors from others.
We do not wait for bug fixes. But what can be done without it?The main criticism of the article
Interesting results about the evolutionary systematics of prokaryotes, or “many-species origins”, consisted in the following claim “
One gene cannot be considered as a measure ”. With this, I fully agree, and new experiments should fix this.
Some numbers. Now in NCBI there are about 2000 genomes of bacteria. In preparation for the experiment, I isolated all the tRNAs that are labeled this way. They turned out more than 40 thousand unique variations. But alas, there are many mistakes among them.
But I thought that you can skip the stage of full error correction. How to do it? I sorted these tRNcs by length and presence of the end of the CCA at the end of the sequence. It must be said that the CCA sequence is obligatory for any tRNA, and the length can be from 74 to 96 nucleotides.
There are many miracles in NCBI up to tRNA from one nucleotide, or more than 1300 :) (you can’t tell without a smile). Therefore, I removed the sequences that are up to 70 and over 100 in length, as well as those that do not end in CCA.
There are about 20,000 of them. These are the most likely tRNAs that do not contain errors from the NCBI. With the remaining half of the tRNA - you can figure it out later.
In fact, for a planned experiment, it makes no difference whether this particular sequence of 70–100 nucleotides in length contains errors or not. Why? Since I am going to double-check the genomes of 2000 bacteria, are there really such sequences - the errors will be excluded. And tRNA is actually whether or not this is the second thing. The main thing is that different organisms coincide significant portions of DNA. The coincidence of the sequence length of 70-100 in the genomes is far from accidental. Already after a length of> 10, the coincidence coincidence approaches zero, and at 70-100 this is already that important part of the genome that cannot simply coincide in different organisms.
Therefore, what am I doing now? I take these 20,000 tRNAs and find which bacteria they are present in. If the sequence is present in only one organism, this is not interesting. And most likely this is an erroneous sequence. And thus a substantial percentage of errors is eliminated.
If there is a sequence in more than one organism, this is one association (connection) between two organisms.
Then the question arose how to visualize it well. The idea is that the organism is a class. The current phylogenetic taxonomy in the form of a tree is the inheritance between classes.
tRNA is a class property, and the aggregation of these properties in different organisms is a horizontal gene transfer (the same association).
Having generated the corresponding backbone of the code, you can display it automatically using UML and visually see all these links in the class diagram.
What is the problem?Now the problem is in CPU time. I make up the base of having 20,000 tRNA in 2,000 genomes of bacteria. Only about 100 tRNAs are processed per day. Therefore, I would be grateful to those who are interested in and help with processor time - well, such as an undeveloped project :)
If anyone is interested, write in private messages - you need a place on hard about 50GB, some time for me to explain what's what, and then I can send packets of 100 tRNA for processing, and you send the results after processing.