📜 ⬆️ ⬇️

Analysis of the genome of bacteria. Continuation

In the previous article , the discussion turned out to be too flashy. But we opened our site and there will be more extensive information (where? - write letters). I promised to write a sequel about my experiment, so those who are interested in the problems of building evolutionary trees - please under the cat.

â„–1. Selection of all homologous sequences (paralogs)


In the last article, we compared evolutionary trees built on the 16S and 23S genes. My method differs in that it offers to compare what is not mutated in organisms. In early articles on Habré, I suggested using tRNA, because These are the most conservative sequences. But it gave not much information. Therefore, I wondered - how do you find all those sequences that are not mutated in organisms? To do this in real time, I went for a little trick. The fact is that before any DNA sequence is inherited, it will most certainly (if it is useful) be represented in the genome by several copies. Those. we are talking about paralogs.
')
If within the same organism a gene doubled as a result of a chromosomal mutation, then its copies are called paralogs.

So if you find all the paralogs in one organism, then if inheritance occurred, they were transferred to other organisms. We only need then to select those who did not have time to mutate.

Those. we do the following:
1. We are looking for in each DNA (the genome of the organism) something that generally has duplicates from 50 to 150 characters.
2. For each duplicate found, we are looking for its occurrences in all organisms, i.e. learn and make up the base as ALL of the paralogs are included in the set of the genomes

(in order not to be distracted from the point of how to do this, I will tell you either as a separate article or rather, if you are interested, I will write an article on our website with time)

â„–2. Actually building an evolutionary tree


I have already told how to build an evolutionary tree according to my method. Therefore, focus on the results of cross-checking. Let me remind you that the cross-check of two trees built on the 23S rRNA gene and built on the 16S rRNA gene, which is the last result of the project The All-Species Living Tree , gave the following error distribution (compared to the previous article, converted to percentages of the total number of ):



I was hoping that my approach would give better results, but alas, he gave about the same in quality - but others in essence. First, about the quality, then the cross-check was done like this. Since about a million occurrences of paralogs were found in the organism’s genome, i.e. there are a million entries like “DNA sequence ID such-and-such enters into the organism of such-and-such”, then for cross-checking I divided this set randomly into two samples. He built trees on them and compared the constructed trees in the same way. It turned out the following:



Thus, in essence, the confidence in these trees is about the same. Both are correct by about 50%.

Of course, it seems that the information in the genomes is not so much that only half of the sample could be obtained similarity. Therefore, I thought that I would use the available information as economically as possible. And I thought that you can do this cross-analysis. Take all the available information to build a complete tree, and compare it with half-trees. Those. take the entire million records and compare them first with one half a million, and then with the second. In the figure below, tree images (and by reference in full resolution) are plotted using the full sample, and those nodes that are fairly stable — that is, are shown in red. cross-analysis did not give more than one error.

As you can see, everything is not so bad, some of the branches are completely red, but the closer to the root, the less information and the position of the species in the tree does not pass the test.

But what is interesting, I then checked the tree I received and the tree of the project The All-Species Living Tree (after reduction to one composition). It turned out that they coincide by only 25%.

And I had an important question of interpretation , can someone tell me what it could mean. It turns out that my method of building trees can be trusted and also apparently can be trusted and the classical method used in the project The All-Species Living Tree. They do not differ significantly in the level of coincidences. But why do they not coincide with each other? They are shown to show two variants of the same. But how can there be at the same time two half-truths that coincide only by 25%?


Full-size format can be found here and here .

I also thought that inconsistencies appear non-random, and somewhere at the level of families of organisms. In the second version of the tree image, it can be seen that the species are clustered in groups, and within the group there are many coincidences, while the position of the groups themselves is inaccurate.

There are two options - or really little data yet, few sequenced intermediate species. Or, nevertheless, really, at the level above the families, they have no common ancestor, and evolution does not follow Darwin? At least for the time being we do not have reliable data that a common ancestor existed at all.

Source: https://habr.com/ru/post/168593/


All Articles