Tags are an integral part of all modern sites and an indirect sign of the site’s belonging to the notorious Web-Two-Zero.
In the article I want to talk about the methods and algorithms for tagging information.
So, when organizing tags, there are several weaknesses and bottlenecks, namely:
- adding and changing the ownership of tags to the object.
- creating and changing tags themselves.
- display tags on the old girl.
- search by tags.
- assignment of tag aliases
- tag cloud building
Unfortunately, the universal algorithm that would easily solve all these problems is not familiar to the author. Further on the algorithms themselves.
The normal ratio is many-to-many.
There is a huge table with tags, there are huge tables with tagged information. The connection between them is carried out through a third table, which is obtained very large. So, if we have 50,000 articles, and 10,000 tags, provided that each article is on average associated with 4 tags, we get a table size of 200,000.
')
Pros:- no problem building tag clouds
- No problems with aliases.
- no problem creating and modifying tags
- no problem with “tag list”
Minuses:- adding and changing the ownership of tags to an object is difficult, since a separate INSERT or DELETE is required for each variable connection. Still need INSERT when creating a tag. If some tags are singular (which is often the case), then they will take over resources (by increasing the size of the tables), without bringing almost any practical use.
- retrieving and displaying tags requires a JOIN join of 3 huge tables. From the example above: the table in the 50000 join table 200000 join the table 10000. This will work slowly already with this data. Considering that it is really necessary to make another 2-3 large tables (for example, a user table and a rating table), it turns out not at all a rosy picture. Yes, I know that you can cache, but now is not about that.
- tag search again requires joining large tables
Using full-text search
The algorithm is given in my article
"Full-text search and its capabilities"Now how it is done directly in relation to the tags. In the field with the full-text index are the tags themselves, as they were written. Selection of objects occurs exclusively on this field. Based on the same field, the object's affiliation with tags is constructed. This means that if the tag is Russian, then the link to it should contain Russian letters. And with this there are problems, because they can be encoded using urlencode, and this depends on the encoding. Those. the same tag, depending on the encoding of the page, must be decoded differently. You can certainly use the transliteration of Russian words into English, and write them in the field along with Russian words. Then the tag will be displayed in Russian, and the link to it will be in Latin, and the search will also go in Latin. Bad exit, but exit.
Pros:- no problem with tag output
- no problem finding tags
- no problem adding and changing tags to the object
- no problems with aliases (more precisely, there are, but they can be solved)
- no tag creation problem
- You can easily do a search not by one but by several tags, as well as calculate similar materials.
Minuses:- rename or delete the tag just does not work, it is required in the fields of all objects that are assigned tags
- with the construction of tag clouds are very big problems. It can be solved this way: all the “tag” fields of the tables are processed, the frequency of the presence of a separate tag (eh, access to the full-text index itself would be analyzed, as it would be nice), and a cloud is built against this background. Then cached for a long period of time.
- difficult to make a "drop-down list of tags"
Alternatively, the combination of both methods. That is, a search by full-text index, and the frequency of use and the tags themselves in a separate table. Well, or variations on the same topic. This solves problems with the drop-down list and the cloud, but it creates difficulties when displaying, adding and creating tags.
If someone knows more options for the organization - it will be interesting to learn about them. Constructive criticism is welcome.