I already wrote, why such forecasting is necessary -
Creation of artificial intelligence .
Here I will only describe the prediction algorithm, without unnecessary lyrics.
I will consider predicting a sequence of bytes or the text of UTF-8. Prediction of a sequence of fractional numbers - graphs - is very similar, only you need to compare values not for equality, but for belonging to neighborhoods.
Let there be a stream of bytes (or let's say the text is UTF-8) - incoming predictable data. The incoming data is saved in a set of saved history. Each next incoming value is taken into account in the structure for the accumulation of statistics:
struct Stat { uint value;
In the root node, we count the statistics of all past values. The probability of each of the values will be the number of this value divided by the number of all values. The sum of the probabilities of Stats located in root is equal to one — the probability distribution of values in the incoming stream. Choosing Stat from root with the highest probability, we get the first prediction - the next character at the entrance to our algorithm will most likely be this one, with probability calculated in this Stat node.
')
The next value is input. We will first add it to the root node statistics. After that, for the saved incoming value of the previous step, we find or create the Ptrn node in root in the index_of_prev corresponding to the previous value, and in this node we also add the current value to the statistics.
The root node will be called the zero level of prediction. The root index_of_prev in root will be called the first level of prediction.
The first level contains more accurate predictions than zero - the probabilities of predicting values, provided that the previous value was such and such. But the sum of the probabilities at each node is still the unit.
When enough statistics is accumulated in any first-level node, so that one could argue that further accumulation in this node will not lead to a strong change in the distribution of the forecast, then this node gives branches in index_of_prev — a second level is created for this node. According to the saved history, we immediately count the statistics for these branches, for those that have already passed and have been saved in history.
And so we will branch as root through index_of_prev. Each Ptrn identifies a pattern — a chain of consecutive values added from the current Ptrn to root through the owner property, and the probability distribution of the prediction values following this pattern.
The longer the pattern, the more certainty what exactly will follow. But the longer the pattern, the less often it occurs, and there may not be enough statistics on a specific pattern. Those. some pattern in the accumulated history met only three times - it is unlikely that a measurement from three cases will be indicative of what value will follow it, and you need to use a shorter pattern that has greater certainty, but a more diffuse prediction distribution, until the longer one accumulates enough statistics.
Some patterns that already have sufficient statistics, they have an accurate forecast - out of a hundred cases of finding this pattern, it was always followed by the same value. For such patterns, it is not necessary to build longer patterns — for longer ones, the distribution result will be the same. Although a detailed examination of subsequent complications, such branches will still be needed.
I will further state that I omit many details in this description, which will only complicate the description and will interfere with the formulation of the general essence. One of such details, that the exact forecast, say one hundred of the one hundred cases, is in fact not exact, but approximate — there is no absolute probability in forecasting. And in the process of further accumulation of statistics on this pattern, someday there will still be a great value. While we believe that it will not change.
Let's divide the accumulated statistics into defined and undecided. For example, if the sum_count_of_stat in Ptrn becomes more than one hundred, then we begin to consider it determined, and according to it we begin to consider the subtitles index_of_prev. (Although not for all cases it is necessary to accumulate up to one hundred).
Those that are defined, divided into accurate (with accurate forecast), and not accurate.
In general, we accumulate a tree of patterns, and calculate the forecasts for the situations that they identify.
Now we will consider the similarity of the forecast. For those patterns for which, due to the lack or absence of statistics, we cannot determine the forecast, but according to some signs, we can determine that its forecast will be similar to the forecast of another pattern that is already defined, and we will take the forecast from the similar.
Each defined pattern has many subtitles in index_of_prev. To understand what a branch is, here is the algorithm for getting it from the pattern:
You can group patterns into groups in the likeness of their branches. In other words, classify many accumulated patterns.
But the similarity should not be the presence of sub-branches, but the distribution of the forecast in them. Those. if we consider for two patterns the similarity of their branches, then we consider how the distribution of the forecast among the branches, which these patterns have in common, correlates.
If you have accumulated a certain group of patterns, for which a certain part of the branches are similar in prediction, then for any new pattern, for which you are convinced that it belongs to this group, for all its branches, including those that are still undetermined, the predictions will be similar to those what's in the group.
I draw your attention to the fact that only part of the branches is similar. So, what would be like all the many branches - this is a very rare case of similarity.
How do we determine for new patterns, for which statistics on the branches have not yet determined whether they belong to any group:
If the pattern has identified at least a couple of branches, according to which we considered the similarity in the group to which we are testing, and this pair of branches is similar in prediction to that in the group, then we look if these two branches are enough to identify the belonging to this group.
To do this, we find all the patterns for which these two sub-branches are similar. And for all found patterns, we look, whether all of them belong to the tested group. If some of the found ones do not belong to the group, due to the fact that some of their branches is not similar to the corresponding branch in the group, and this does not belong to this branch, this branch has defined statistics, then those two defined branches are not enough to unambiguously identify belonging to group.
In some cases, two identification sub-branches will suffice, in some three or more.
Then it would be possible to describe what formulas and algorithms to consider these similarities, but I will say at once that it is impossible to solve this problem head-on, as I described it above, because of the large number of calculations.
And therefore, we will consider only the most significant part - to compare the similarities only along the branches with an accurate forecast. If the two patterns have the same sub-branch, in both cases they have an accurate forecast (that is, the distribution consists of one value), and this forecast is equal, then these branches are similar. If in one of the cases the forecast is not accurate, then the branches are not similar. Those. zero or one. Cases where both forecasts are not accurate are not considered, because we will group in groups with an accurate forecast, and all others will be discarded.
For each new pattern, when its forecast becomes defined, and if it is accurate, then we include it in the Group group, in the postfix_root object (this class is separated, because a little later it will be a tree):
struct Postfix { List<Ptrn*> list_of_ptrn;
When accumulating a certain minimum number of patterns in a group, say one hundred, we will run an algorithm for recalculating similarities (about which further). And also, with further accumulation of patterns in the group, when the number is twice as large as it was during the previous calculation (we keep in count_on_prev_calc_similarity), we also start the recalculation.
What does the similarity search algorithm consist of?
Here in one of the groups in postfix_root a lot of patterns have accumulated - for them the forecast is accurate and the same for all. From postfix_root we create the next Postfix level - for each pattern, take its leftmost value, and create the corresponding Postfix in postfix_root-> index_of_next, to which we add the pattern from which the leftmost character was taken. Only add without this left character.
It turned out that we split the set list_of_ptrn into subsets, by the extreme left value, removing this left value from the new sets. And so recursively we split everything for which in the resulting Postfix the number of list_of_ptrn is more than one hundred. At the same time, the patterns in these sets become smaller from level to level - the extreme left ones split off, and if the pattern rested on root - the length after biting became zero, then this pattern will not go to the next level.
Thus, we obtain grouped sets of patterns for which at least one branch is similar.
Next, for each set in Postfix, from the resulting set, we begin to scan possible branches from the patterns, for all, except for the branch along which the current Postfix was received from the Group. We iterate over the branches of a virtual tree consisting of the union of the branches entering into a set of patterns:
struct ScanItem { Ptrn* ptrn_of_postfix;
At the same time, the resulting sets of patterns are grouped according to their forecast - if they have an accurate forecast, then the subgroup grouped by the exact forecast from the set in Postfix will be a set of patterns that already have two similar branches.
Further, from a set of patterns with two similar branches, we further scan and split into subsets with three, four, etc. similar Moreover, at the moment when the found number of branches of similarities becomes sufficient for identification, then the set will no longer split, and the new found branches of similarities will be more likely confirmed for the entire remaining set.
The general scheme was as follows:
There was one large set of all patterns, of which they identified a set of accurate predictions, divided them into subsets according to the prediction value, then divided along the first branch of similarity and the remaining bitten pattern, and we begin to scan and divide many patterns with one similar branch .
When scanning, duplicates can be found - first they went backwards along branch A, and then received B. after scanning. Then, they started from B backward, and came to A. Duplication of scanning does not make sense and they need to be cut off.
Prior to this, he described similarities for the prediction of the next immediately after the group and called it the cutting of postfixes. Now we need the same principle, but the prefix will be cut off.
Why this: it will look for groups of patterns for which the forecast is the same. It may be different for different prefixes, but within the same one it is the same. And if for such a group a new prefix was found, for which they made sure by the identification signs that it is working, then it is not necessary to wait for the defined statistics for the whole group. I will not give a description of the class and algorithm, suffice it to say that it is symmetric to the search for postfixes.
When we were looking for postfixes, we derived the distances at which the same function acts: the pattern X was uniquely displayed on the forecast Y for all distances in the group. When we search for prefixes, we look for similarities not for distances, but for values — a set of values for which the forecast is the same.
When we find the next mapping X → Y by postfixes, then the typed sets X and Y are accumulated as a result. The method of deriving the boundaries of the values of these sets that are not adjacent to the distance is still in question. And while I say without him. Among the sets of functions to be found, some will have the same set of X. Then the value of X becomes an object with several properties — each function is a separate property. At the same time, such distances for obtaining any of these functions become methods for obtaining properties of this object. In fact, this is the basic principle of converting text (or other type of information) into a database — into labels with their properties.
You can predict not only the value, but also a group of values (a group of similar distances or similar values). Those. previously, the prediction was the value of the input stream. When the algorithm has absorbed and assigned this value to a group, then we can compile statistics similar to the prediction of the value, whether the analyzed value is an element of the group. And then, for new incoming values, we don’t accumulate any statistics on them at all, we can attribute them to the group and use them. On such a principle, the identification of new words in the text occurs in the brain - we compute the environment, this is a noun, or it is an adjective, and other different properties - despite the fact that the word itself is seen for the first time.
I can also mention that the postfix and the prefix of groups are not strictly either one or the other. When scanning similarities, the algorithm that goes through them together will be more complete - with each postfix, check all prefixes, and vice versa, with each prefix all postfixes. Both with reverse and direct scanning.
The principles described are only a piece of the full picture of what similarities should be sought:
It is necessary to scan several consecutive groups of similarities in succession — when groups of similarities are located in the scanned branches of the current test group — here are their rules for summing sets. As well as scanning nested groups.
We need to make quantifier sets - defining values not by a specific memorized pattern, but by memorizing a pattern, something like a regular expression.
It needs to be done so that the branch scanner can go not from a symbol to a symbol, but according to group markings, and not only in one direction, but let's say we jump through the group, turn around, and start scanning the element of the group symbolically but from the other end.
Then another work with sets, on the subject of continuity and ordering.
Etc….
And for sure there are a lot of properties about which I did not think.
PS: sources on this topic, as well as further descriptions will be placed on the site specified in the profile.