There are often cases when it is necessary to define a set of properties of a newly created object. For example, this may concern a site with descriptions of goods, films (and, accordingly, a set of tags or properties is required for each object). In general, this applies to any repository of descriptions of any objects that assume the presence of properties and the ability to compare objects with each other according to the principle “similar or unlike”.
So, it is given: the site has a ready-made set of objects, the properties for which are defined and verified. And a new object is added, about which we know nothing, but visitors to the site can judge. The task: to make it so that the administrator does not have to manually add the required properties, and everything was done by itself, by the hands of site visitors.
For clarity, we will assume that we have a site dedicated to cell phones. On the site (for simplicity) - 5 phones with the following conditional properties (the properties are numbered for convenience):
A> Vibrating alert (1), Radio (2), Speakerphone (3), Flashlight (4)
B> Vibrating alert (1), Speakerphone (3), MP3 player (5)
C> Flashlight (4), Disassembled housing (6), MP3 player (5)
D> Disassembled case (6), TV (7)
E> MP3 player (5), TV (7), Radio (2)
And the sixth device is added, about which the site administrator does not know anything, unlike visitors. Let it be a device with Radio (2) and TV (7).
')
In our example, there are only 7 possible attributes for the object. We assign all possible properties to a new object.
The next step we need to determine only those properties that the object really possesses, for this we offer site visitors to choose the degree of similarity between the known object and the new one (we offer a randomly selected object). The similarity is assessed on a scale from 0 to 2x, where 0 is “not similar”, 1- “there is something in common” and 2- “very similar”. You can make a more stretched scale, but for simplicity, this one is used here.
When comparing, we take into account only those properties that both the new and the known object have. If the user chose the degree of similarity "very similar", then we add 1 to the "weight" of the intersecting properties of an unknown object. When “there is something common” we add 0.5, and if “not similar”, then subtract 1.
I have sketched a small example in PHP, illustrating the operation of the algorithm.
The code is very primitive, but gives an understanding of the work.
At the output, we get something like this array, where the key is the property number, and value = the calculated weight
Array
(
[2] => 1,
[7] => 0.944444
)
As the tests show, the accuracy depends on the number of iterations, while the minimum number of iterations = 50 (corresponds to K_MUL * 0.5, where 0.5 is the minimum step of weight change).
Adding known objects with varying degrees of similarity improves the definition of properties of an unknown object.
Human factor
The case considered by us is ideal. But what if, say, a certain percentage of users respond inaccurately? To simulate a situation like this, you can add randomization of answers by adding the following line to the cmp function:
If (rand (0,100)> 70) {
$ val = rand (0,2);
}
We simulate a situation in which every third answer is random (it may or may not be true).
As the tests showed, with an increase in the number of iterations by 3 times (the same 1/3 of potentially incorrect answers), we get all the same array 2 and 7, and only occasionally fluctuations appear that can be eliminated by changing the threshold in the “process” function
Array (
[2] => 0.98648648648649 is a valid property
[6] => 0.50675675675676 - fluctuation
[7] => 1 - the right property
)
Possible improvements
The first improvement is the elimination of errors. Having a sufficient number of comparisons, we can exclude results that do not fit into the total mass, and thus we can improve the accuracy.
The second improvement: changing the weight of the voice users.
Users whose answers coincide with the answer of the majority, get more weight of their own votes. Accordingly, in subsequent votes for “similarity”, the voice of such a user will have more weight, which should also reduce the spread.
An important addition: it is assumed that this algorithm operates in a friendly environment in which users may make mistakes, but they do it not intentionally and not massively.
I will be glad to questions, suggestions, and just comments.