Project “Prokaryotic Genome” - a scientific startup

This project was conceived long ago. About 5 years ago I believed that many results in genomics can be obtained by people far from biology, which I am fully. Of course, during this time I picked up a bit of terminology and learned a little how the specialists work. But the more I learned how the specialists work, the more it caused me to reject it. I believe that they obviously make many unjustly complicate the result that a difficult area becomes impassable. While everything is quite simple and quality can be done. And yes, I'm trying to compete with them (of course, only in a certain narrow area), no matter how naive it would look.

The whole problem of this project is that I am its only full-fledged participant. Of course, I managed to talk with many during this time and many had a real impact on the project. Thanks to all of them. It is clear that a non-profit project is not much can count on success. Yes, indeed, behind every scientific project there is a solid injection of about millions of people and a team of serious scientists. We do not have this, but there is only humanism and enthusiasm.

Therefore, first of all, I need advice from those who have experience in a startup of such projects on a non-commercial basis. Secondarily, we need a team of programmers (from knowledge of biology, if necessary, I will free you :)). And now I would like to find such enthusiasts who could provide work (say modestly) for the project’s home web page (please write to me at tac@inbox.lv or with personal messages from Habr). And of course, any other response and suggestions are important.
')
And below I will explain the idea and what the project claims for, as well as the current results, and in the worst case they are comparable to those given by the specialists. But I am quite self-critical, therefore I am always ready to listen to criticism - preferably not in my address, but in the address of the project.

From idea to computer experiments

I will not explain the raw idea, a lot has already been passed and it was described by me in previous articles on Habré. [However, I will insert a couple of words, because below, many complain that I started off the bat. The main idea / task of the project is to understand how bacteria evolved and how their DNA consistently changed. To do this, we build a tree of divergence of species and analyze them.] I will describe a new, what is called an all-wheel drive experiment. But first I need to introduce you to the problematic and then understand how to evaluate the results of the experiment.

Phylogenetic signal

Here we will try to discuss this term to which one biologist turned my attention.

With the evolutionary origin of animals from a common ancestor, it is believed that it is possible to build a single tree-like hierarchical structure of the origin of species. There is no fundamental difference what signs to take as a basis. Simply, the more genes are included in the analysis, the less poorly grounded sections remain in the tree. At the same time, if the objects being classified do not originate from a common ancestor, then there is no single tree-like hierarchical structure. The classification of such objects is either fundamentally different when using different sets of characters (genes), or has a fundamentally non-woody appearance.

But the coincidence of the resulting "trees" built on various grounds allegedly tells us about the presence of a "phylogenetic signal." And the smaller the differences between trees built on different sets of genes, the stronger the “phylogenetic signal” we have. But what is important, the reverse is not true.

It is often said that this signal is indeed present and coincides. But this is not entirely true, since I came across one article that is somewhat more critical on this subject.

First, they indicate that:

It is assumed that by analyzing a multitude of genes it is possible to strengthen the phylogenetic signal before it exceeds the noise and to achieve the correct resolution of conflicts between different genes. But

[there are a number of private examples]

All this suggests that the current methods of phylogeny reconstruction by a large number of genes do not eliminate the artifacts known for single genes. Here, assumptions of evolution models, differences in the rates of evolution of species, alignment errors and choice of orthologs, and insufficient representativeness of taxonomic sampling can have the same effect. To eliminate the artifacts of multigene phylogenetic analysis, data selection is proposed, which, of course, makes it less formal. Thus, the practice of modern phylogenomics shows that the statistical support for phylogeny reconstructions increases with an increase in the number of genes being compared, but a high level of statistical support for the tree as a whole or its individual nodes cannot serve as an indicator of the correctness of phylogenetic reconstruction.

And secondly they ask:

How to find to check a gene or nucleotide worthy of unlimited trust? The shorter the geological period of the stem group existed, the less likely it is that a randomly selected gene will carry synapomorphy, while not subject to homoplasias and reversions. To get for sure the winning ticket in the lottery, there is a way - to buy the whole edition. Given the speed of development of sequencing technology and computer processing, with respect to the genomes this may in a few years seem like such a stupid idea. On the other hand, if the related resemblance in the species is large, then it will be found in many genes from among those chosen at random and even, probably, in one sufficiently extended gene, like 18S or 28S rRNA.

This is what is called a classic of biology. And now let's try to think about it.

In previous articles on the role of such “trustworthy” genes, I suggested and showed what happens if it is a tRNA gene. This gene is no worse than rRNA, which now enjoys "unlimited confidence." But in this article [in continuation of it] I will show further what will happen if you “buy the whole circulation”. But before that, it’s necessary to figure out what the bad option is when rRNA enjoys “infinite trust”.

And it turns out that it’s not at all the choice of one or another gene or nucleotide sequence. And it is right that they dream (but strangely they don’t make it) about comparing over a large variety of genes. The point is in the method. And it has a statistical nature, and those who look at it a little more soberly admit that the above article has problems: “Here, assumptions of evolution models, differences in the rates of evolution of species, alignment errors and choice of orthologs, lack of representativeness of taxonomic sampling can affect it” .

All of this separately worsens the phylogenetic signal in one way or another. Most of all claims to alignment errors (I will not explain that it is Wikipedia read this by reference). It is because of this that one has to deal with statistics and the errors associated with it. Correctly do the alignment, especially for small sequences now do not know how - it really does not take into account the conservatism of some fragments. To do this, hydrogen bonds in the tertiary structure must be taken into account - but this is usually not done during alignment.

But rRNA is, firstly, long, and secondly, separately, there are a lot of errors, but statistically they still give some kind of signal. But what quality is it below and we will look at the example of comparing trees constructed with 16S rRNA and 23S rRNA (these are the longest RNA sequences of which the ribosome consists). These trees were obtained in the project The All-Species Living Tree . But, third, they are now writing a sufficient number of articles on the construction of phylogenetic trees, but for some reason such a question as “analysis of the prevalence of phylogenetic signal over noise” is not discussed.

And what about the alternative?

The only option to object to criticism like the one above (“a high level of statistical support for the tree as a whole or its individual nodes cannot serve as an indicator of the correctness of phylogenetic reconstruction”) is to move from statistical reasoning in which common sense does not allow 100% certainty, it is to proceed to the conclusions of a deterministic nature. And for this you need to get rid of alignment in the analysis and choose those nucleotide sequences that can be analyzed without alignment.

I am surprised, but experts of this alternative do not offer and do not see. Although it at least shows more stable results. Why? Let's deal with this.

After all, no matter what tree I would not conclude, the confidence in it will be no more / no less than other trees. But there experts built (as for example, in the project The All-Species Living Tree), and here you say you built a "charlatan." And there will always be objections.

Likewise, any method is vulnerable to criticism, as long as there is no confidence in the results. Therefore, we need a criterion for the correctness of the results. The stability of the “phylogenetic signal” claims such a criterion.

But before choosing him for this - I would like the reader to understand why this signal may be unstable at all. There can be 3 reasons:

1. Evolution does not follow Darwin, i.e. organisms simply have no common ancestor and have never had it. Considering, firstly, that now there is a phenomenon of horizontal transfer, and secondly, that the hypothesis of the RNA world has already been practically proven, and then individual organisms could arise independently of each other - Darwinian evolution is in fact a big question. Therefore, here we simply agree that it is simply more convenient for the human mind to hierarchically consider the origin of species and the Darwinian evolution for us is just a convenient way of presenting information, similar to drawing charts instead of textual information.
2. Method errors. For example, alignment, to which I expressed a great deal of distrust. It is because of incorrect alignment that the signal is deviated to a large extent.
3. Different number of examples in the sample.

When we have the influence of all three causes, we cannot with certainty distinguish the resulting noise — it is an objective reason or a subjective one. Those. we cannot say either the problem in our method, the problem in our representativeness of the sample or, nevertheless, evolution does not go entirely according to Darwin.

Researchers can very easily say, “but you know, our method works perfectly, the sample is wonderful, and the small errors that you see are just the way it is in nature”. But first, let's quantify the errors. Secondly, let's replace the statistical approach with the deterministic one. Third, we will analyze the total available for the deterministic approach.

The advantage of the deterministic approach

To demonstrate the advantage of the deterministic approach, I will propose a thought experiment. It can be done experimentally in reality, but the public will simply get tired of the dryness of presentation, and most importantly since Aristotle’s time we know that the experiment does not prove anything in absolute categories, but only allows us to say “we see this in this data, but this does not mean that It cannot be otherwise. ” And we need to judge it in absolute categories.

So a mental experiment. Compare the statistical and deterministic approach. In statistical terms, we analyze 1000 organisms on a single 16S rRNA gene, which has a great length of about 1600 characters (and this is done in the overwhelming number of cases in the study). Suppose we have a reliable set of rRNA for all 1000 organisms. But to build a phylogenetic tree, we need to make an alignment. But before alignment, divide the rRNA into two equal parts and make the alignment and the subsequent construction of the tree according to the first and second parts separately.
Since the sample is the same, 3rd reason has no effect. On the 1st reason we agreed not to refer. But it is obvious that the alignment will affect the appearance of the tree, at least to a small extent. a certain evolutionary distance is calculated there, but for different parts it will be a little different. As a result, the first and second tree will be different and this will be the 2nd reason - a method error.

What we have for a deterministic approach. Here we focus on such genes that are completely identical in different organisms, but they cannot be long, because all long is more likely to mutate. But instead of one gene of 1600 characters, we have a set of 10-20 genes with 70-150 characters. Such characteristics, for example, correspond to the tRNA genes. Again, suppose that we have a reliable set of these genes. Then the question is: if the tRNA sequences are divided into two parts and the construction of two different trees - will they coincide or not? Answer: they will match by 100%. This is due to the fact that when building a tree, in fact, the sequences are replaced with identifiers, and then all the manipulations take place on the basis of only combinations of genes. Therefore, if the genes were correctly identified on the basis of half the sequence, then there will be no further distortion.

That is, under ideal conditions and the same sampling rate, the deterministic approach has a distinct advantage, and has no errors of the 2nd kind.

And then we can talk about the errors of the 3rd kind and how they affect the phylogenetic signal. But we must understand that in the deterministic approach we only have errors of the 3rd kind, and in the statistical one, which is now accepted everywhere, we cannot separate the influence of the errors - the “noise” of the 2nd and 3rd kinds.

Actually experiment

№1. Comparing 16S and 23S trees

So, we need to compare two trees built on the 23S rRNA gene and built on the 16S rRNA gene, which are the last result of the project The All-Species Living Tree .

But you can only compare comparable things. And then it's time to talk about how to measure the magnitude of the error of the 3rd kind, i.e. how the sample size and its composition affect the result. Specialists here would suggest us to do statistical surveys of any probability distributions, estimates of displacement, variance, etc. turbid indices and indifferent coefficients. In contrast, we must compare in such a way that each digit would allow us to understand what this means.

First, the format of phylogenetic trees hides one important thing - they obviously do not display the parent, although it is there as the intersection of lines on the same level. In fact, here we need to solve the issue of converting the .newick format, for example, into the .gml format, i.e. get a full tree, where all the ancestors will have a name.

Secondly, the fact is that the data on the 16S gene is almost 10 times more. And we need to remove such leaves of trees that are in the tree 16S, but they are not in the tree 23S, and vice versa. Only then will we get what can be compared with each other. But after such removal (cutting) of “leaves” on the tree, which we cannot compare, their supposed ancestors can remain and if they no longer have other “leaves”, they should also be removed so that they do not litter the tree.

Third, and most importantly, the above-described circumcision does not solve all the problems of bringing the tree to the same denominator. It may be a situation that an ancestor has only one leaf, and this ancestor has in turn again only one ancestor, and so on several times. Those. as a result, we have “long threads” on the tree. All these “single” ancestors do not allow us to compare with another tree (23S) in which these ancestors do not exist, because it was built on another smaller sample, and it is natural that a large sample suggests a large number of ancestors in order to more accurately reflect the divergence of the species. But in order for it to be comparable it is necessary to exclude such “single” ancestors, and raise the leaves from them to a level where there is an ancestor with more than one leaf (that is, where there is real divergence).

This process of “lifting leaves to divergence sites” again leaves ancestors, which can be eliminated and steps 2 and 3 should be repeated until all unnecessary ancestors are excluded.

Little sketching for understanding:

Right option to all manipulations. In the center there is an option where the sheet “Escherichia_albertii” is cut, which is not in the compared tree. On the left is the option where the superfluous ancestor "n23" is removed. In reality, only 3000 remain more and more seriously from 18,000 nodes. It may also give the impression that important ancestors have been removed, but if they are not removed the comparison result will only get worse, since in the smaller tree the “removed” ancestors cannot appear, and compare all we need comparable things, not a “kettle with a pan”.

Now, if we strictly approach the comparison, then the coincidence of the trees is when the leaves having one parent in one tree also have one parent in the compared tree. And we can count the number of such cases. But in order to assess proximity, one must also have some distribution of errors. The magnitude of the error can be calculated as follows. If a pair of “leaves” in one tree has one parent, then in the compared tree we find their smallest common ancestor LCA and count the number of intermediate ancestors from one leaf to LCA and from second to LCA — we add the resulting numbers and plot as a dot on the error distribution.

As a result, we have such a schedule, about 50% of the correct cases, and the others are somewhat erroneous, the error really fades.

As we can see, the specialists are far from ideal, the signal is obtained somewhere 50% noisy and then, although some regularity breaks through, it is unstable. Therefore, there is something to improve.

To be continued…

It turns out somehow long, so the results of the deterministic approach, I will put in a separate article. There we will look at how much it will be possible to improve the quality of the evolutionary tree (phylogenetic signal). The experiment is not fully completed, but I hope for the best :)

PS upd. There is a high probability that the issue with the site will be resolved. Thanks to the good people :) Now we need the chief editor of the site / image-maker - who knows how to correct both grammar and semantic text correction so that my “cheeky style” does not distort specialists, and at the same time is understandable to ordinary people.

Source: https://habr.com/ru/post/166361/

All Articles