Can tags beat rubrics? Tag hierarchies

What is the role of sections, categories, hubs, and other facet classification, etc. in our internet life. Is everything obvious with them?
All these concepts came to us from the paper past, then rigid systematization was the only way to navigate books and documents. At first, Internet rubrication was almost the only way to navigate. Directories bloomed and multiplied, Yahoo is a vivid example of turning the catalog into a mega successful project with a capitalization of $ 32 billion .

tags beat categories. bosch

But the development of search technologies has considerably undermined the authority and relevance of directories and rubrics. It is just like with dinosaurs, large and cumbersome Catalogs were defeated by nimble and predatory Search Engines.
Defeated on the field of global navigation on the web, but in separate ecological niches, that is, on sites - rubrication remains a classic navigation tool.

Then tags or shortcuts appeared, it seemed this tool would supplant the rubrication as an outdated and inflexible approach. But no, everyone stayed with him. And often you can see both the rubrication and tagging jointly existing in the same information space.
')

Why tags did not become a great revelation and gravedigger headings?

It seems to me for the following reasons:
1. The rubrication well shows the general theme of this information source (website, weblog, etc.). She is pre-combed and does not suffer from the weirdness with which the tags suffer (unless they are moderated)

2. Similarly, the content located under the heading is most likely located there on the right, but the tags with which it is provided are simply a subjective opinion of the person who put the tags.

3. In forum solutions, the rubrication is the basis for moderating messages. That is, the content generated by the user is likely to fall into a more or less relevant rubric. And it makes life easier for those who do not write, but read.

4. In online stores and other systems where the classification taxis - heading is the basis of navigation.

Total, with the exception of the item with the stores, all the other advantages of rubrication are reduced to the predictability of the quality of the content found in the rubric.

That is, if you look the other way - on the tags, they are too unpredictable both as message tags and as indicators of the quality of messages.

On the other hand, it is intuitively clear that tags are also a rubrication, but simply in a more general and democratic form. Actually, because of this democratic character, tags also lose categories. Excessive freedom leads to uncertainty.

For example, if we use a tag as a navigator, but a navigator in the context of a message, this is one thing:
I read the next news about a mobile phone and see the tag “coverage” there. When moving on this tag, in fact, I have to go through a bunch of "mobile communication" + "coverage".
If the text was about breeding of ultra-dairy cows, then the “cover” tag would have a different meaning of “cow” + “cover”.
If I met with a tag without context, for example, I just saw them in the tag cloud of the site, then how to distinguish between “cover A” and “cover B”?

fuck or call

I allow myself a seditious thought - an unambiguous interpretation of the semantic context of the tag is possible only if we know the history of the interests of the reader (behavioral analysis?). That is, if our reader usually delves into the texts of the category “networks”, “mobile networks”, etc., then one can understand what he means by the tag “coverage”. If we do NOT have such information, then everything is as usual ...

What to do?

From the point of view of the development of tagging technologies, this means:

Using tags to search

1. In doubtful cases, it is better to use not one tag, but a set of tags to receive a selection of messages.
2. This set can be made up of the Main tag (the one that is ordered as the selection criterion of posts) and additional, that is, those labels in which the user “often grazes”.
3. Additional tags, of course, should often be met with the Main one, that is, be “linked” - strongly correlated.

Tagging

Yes, there is the weakest point. Tags are usually put by the author and this is very subjective, experiments with Folksonomy are still not encouraging. We need some other mechanisms to make the installation of tags more stringent. Most likely moderation and mechanisms of help tagging in the form of working with synonyms or searching for analogies.

Hierarchies

Tags are not just words, they are words with an implied indication of belonging to a subject area (braid (hairstyle), braid (coast), braid (tool)).
That is, a tag is at least two parameters (word, subject area).
The subject area, in turn, is also a collection of tags, tightly connected by joint appearances.

Tag hierarchies have tremendous potential. These are natural hierarchies, that is, those that are patched by users rather than paths, rather than laid by designers.

You can say the preamble is over. This is followed by very specific recipes from our mathematician Sergey Lvov. I really hope that they will give him a voice on Habré ( popolznev ).

First steps to tag hierarchy

Author: Sergey Lvov

1. What we want

Imagine a network community (the network means a computer), whose members communicate in writing - such a virtual izba-discussion. Messages (posts, statements, notes, remarks, speeches) are saved and form a big heap in which you can restore order, establish connections and so on in various ways.

One of the ways to restore order, or rather, one of the ways to form structures in a pile of messages is tags (thematic labels). It is assumed that users themselves invent labels and attribute them to their messages. Since community members are not limited in inventing labels, many labels themselves turn into a big pile, and in order for labels to become a tool for building structures, they themselves must be put in order. There are two fundamentally and fundamentally different ways to bring order to the heap: manually and automatically. Of course, we are now interested in the second method, although, perhaps, sometimes you will not manage without manual fitting.

Since labels are responsible for the “ticking” of messages, restoring order in a heap of labels means the establishment of a certain measure, which would allow one to say how closely any two labels are thematically close to each other. This measure can
be constructed as a metric (= distance) in the mathematical sense of the word (that is, for any two different labels, the distance between them must be a positive number, independent of the order in which the labels are listed; the distance from
label to itself is zero). But it seemed to us more convenient to make a measure like a correlation coefficient (the minimum value, for completely unrelated labels, is 0, the maximum value is 1). Moreover, this coefficient does not have to be symmetrical: one label to the second can be tied more than the second to the first.

2. Elements and designations

M denotes the set of all messages stored in our boltech: M = { m ₁ , ..., m _N }. It is clear that the set M changes with time, the number of messages grows, but we are not interested in dynamics: we consider the system at an arbitrary fixed moment. S is the set of all tags tags currently available in the system: S = { s ₁ , ..., s _NN }. A correspondence (relation) is defined between the sets M and S , which, by analogy with geometry (many readers will say: with graph theory! —But they were geometers), can be called incidence: message m and label s are incident if message m is labeled s .

Retreat . The rules of conduct in the discussion center may be such that not only the author of the message has the right to put labels on the message. But for our task it is not important now: whoever labels the messages, at the time in question the system is fixed in its state; what matters is whether the message and the label are incident.

If s 1, ..., s r - labels that mark the message m (you can define the tag set of the message m : S ( m ) = { s ₁ , ..., s _r }), then the values of e ^{[s ₁ ]} ( m ), ..., e ^{[s _r ]} ( m ), which are called (tagged) significance of message m . How exactly they are calculated, is not very important for us now, but to calculate
they should be so that the more significant the message, the higher its significance. Roughly, roughly speaking, the significance of a message is the number of “pluses” put up by the readers. A feature of our system: the plus sign is not just a message, but is attached to a label (s).

Genetic links between the messages themselves are taken into account: each message may have “descendants” and “ancestors” (“predecessors”). If the message m ₂ is written in response to the message m ₁ , then the message m1 will be called the immediate ancestor or predecessor for m ₂ , and m ₂ - the immediate descendant. If message m ₂ is written in response to a message that is a direct descendant of message m ₁ , then m ₂ will be called a direct descendant (or just a descendant) of message m ₁ . If message m ₂ is written in response to a message that is a descendant of message m ₁ , then m ₂ will also be called the (direct) child of message m ₁ . Directly determined ancestors (predecessors) are similarly defined.

3. Approach number 0: statistics of occurrences

The system we are building does not understand anything. For her, a label is just a set of characters (the problem of homonymy is put out of the brackets - we will assume that it is solved in some way). Therefore, the system can assess the thematic proximity of labels only based on the frequency of their joint appearances: it is natural to assume that if a pair of labels often occurs together, then they are thematically close. Let μ ^{[ s ]} denote the number of messages marked with the s tag, μ ^{[ u ] the} number of messages marked with the u tag, and μ ^{[ su ] the} number of messages marked with the s and u at the same time.

The first attempt to determine the coefficient of thematic link labels s and u :

ρ ₀ ( s , u ) = µ ^{[ su ]} / ( µ ^{[ s ]} + µ ^{[ u ]} - µ ^{[ su ]} ). (one)

The value that stands in the denominator is the number of messages supplied with at least one of the labels s , u .
What is bad formula (1)? First of all, run through all the messages - it can be very long. It would be nice to have a good selection of messages to reduce the amount of work. Especially since that's what's possible. Clear,
that descendant messages will often inherit the labels of their parents. And if the system starts a long branch of dialogue (on forums or in the same LJ, this happens all the time), which is not interesting to anyone except its participants, then it can skew the statistics.

The output is possible like this. We call a message nodal if it has more than one immediate descendant. Further, let µ _j ^{[ s ]} , µ _j ^{[ u ]} , µ _j ^{[ su ]} be the number of nodal messages marked, respectively, with the label s , the label u , the labels s and u simultaneously. Now we correct formula (1), considering not all messages at all, but
only nodes:

ρ _{0 j} ( s , u ) = µ _j ^{[ su ]} / ( µ _j ^{[ s ]} + µ _j ^{[ u ]} - µ _j ^{[ su ]} ). (2)

We got rid of one defect of formula (1), but this is not all. It is not good that formula (1), and with it formula (2), are symmetric - in fact, the relationship between labels is asymmetric. But it is easy to fix.

We introduce the coefficient of dependence of the label s on the label u :

ρ _{1 j} ( s , u ) = µ _j ^{[ su ]} / µ _j ^{[ s ]} . (3)

The meaning of this formula is simple: the more often the label s is found separately from the label u , the less the dependence of the label s on the label u .

Note that all three formulas give one if we substitute the same label in place of two arguments: that is, ρ ₀ ( s , s ) = ρ _{0 j} ( s , s ) = ρ _{1 j} ( s , s ) = 1 This is a natural normalization condition.

Let's go further. In all the formulas that we have presented so far, the key element is the frequency of simultaneous occurrences of two labels. However, sometimes the semantic closeness of two labels may be the reason for their simultaneous
non-occurrences. For example: the fact that one person will call tomatoes, another will call tomatoes - but one user will only have the label “tomatoes” all the time, and the other - only “tomatoes”. If we exclude the a priori attribution of the status of synonyms to “tomatoes” and “tomatoes” (we still exclude such methods from consideration), then we can hope to catch the closeness of labels that often appear at the same time with some third label. For example, if the label “homology” often goes along with the label “topology”, and the label “homotopy” often goes along with the label “topology”, then even without knowing anything about homology and homotopy, we can assume that these things have something This is general, thematically close. This can be formulated as follows (we still consider only the node messages):

ρ _{2 j} ( s , u ) = max ( v∈S ) ((μ _j ^{[ sv ]} / μ _j ^{[ s ]} ) (μ _j ^{[ uv ]} / μ _j ^{[ u ]} )). (four)

Here again, however, we returned to symmetry: by construction, ρ _{2 j} ( s , u ) = ρ _{2 j} ( u , s ). The meaning of this symmetry is that we are now looking for a third label, to which two comparable labels would be close. Note that always ρ _{2 j} ( s , u ) ≥ ρ _{1 j} ( u , s ) (because when u = v and when s = v one of the factors in the bracket of the right side of formula (4) is 1, and in the second coincides with ρ _{1 j} ( u , s ), that is, the value of ρ _{1 j} ( u , s ) is certainly achieved, and the maximum can be even more).

You can try to further enrich the formula:

ρ _{3 j} ( s , u ) = max (max ( v∈S ) ((μ _j ^{[ sv ]} / μ _j ^{[ s ]} ) (μ _j ^{[ uv ]} / μ _j ^{[ u ]} )), max ( v∈ S ) ((µ _j ^{[ sv ]} / µ _j ^{[ s ]} ) (µ _j ^{[ uv ]} / µ _j ^{[ v ]} ))). (five)

4. The possibility of other approaches

All formulas given in the previous paragraph implement attempts to guess thematic proximity based on statistics. What else can you do? You can try to use the equipment of our hut-discuss: remember that its main
zest is a system of significance. Any of formulas (1) - (5) can be modified in the following way. The numbers µ with different indices are the numbers of messages satisfying one or another condition (to which conditions exactly the indices assigned to µ are responsible for this). The number of messages is the sum of one: for each message that satisfies the necessary conditions, not 1 is added to the value of μ . If you add not 1, but a value that depends on the significance of the message (since at least two labels are involved in all our formulas, you can use at least two), then we get more subtle formulas. But whether this subtlety to good is a difficult question. We will not discuss this in detail now - we will postpone it for later.

Apparently, in addition to what has already been said, only the methods of "manual work" and administration-moderation remain. For example, it is possible to start several thematic blocks in advance and whenever the user starts a new label, suggest that he place this label in one of the blocks, establish certain hierarchical links or bindings to already existing labels.

Source: https://habr.com/ru/post/229427/

All Articles