📜 ⬆️ ⬇️

Not we are - life is like this: Thematic analysis for the most impatient

bayesian

Why?


Now Relap.io generates 40 billion recommendations per month on 2000 media platforms of the Runet. Almost any recommender system, sooner or later, comes to the need to take into account the contents of recommended content, and rather quickly comes up against the need to somehow classify it: find some clusters or at least reduce the dimension to describe the interests of users, attract advertisers or else for some dark or not very goals.

The task sounds quite obvious and there are many well-proven algorithms and their implementations: Dirichlet latent placement (LDA) , Probabilistic latent semantic analysis (pLSA) , explicit semantic analysis (ESA) , the list goes on. However, we decided to try to come up with something more simple, but at the same time, viable.

There are several reasons for such a decision, but the main one was laziness and unwillingness to wait until the models were trained. More seriously, we have a fairly large amount of data from numerous sources (hundreds of thousands of popular and millions of published articles) and the problem of guessing the number of topics for which we want to scatter everything was not at all obvious (and to such an extent that we didn’t even imagine order - 100 topics? 1000?). With such introductory training, the LDA model or pLSA would be quite inefficient, especially considering the ever-growing hull. I wanted something faster, and perhaps less accurate, but able to scatter at least 70% of the documents in handfuls and at the same time find out the number and size of these handfuls, and then build some ontology on their basis.
')

How?


How to approach such a task: we obviously need some kind of generative model that links as many words as possible into certain topics (semantic fields).

Despite the desire to replace the bicycle with a newly invented scooter, we do not refuse the basic principles of thematic analysis, that is, we still present documents in the form of unordered “bags of words” (bag of words), and even consider the words themselves, but the lemmas that we got rid of all the texts through Porter stemmer . We do not remove homonymy and do not store any grammatical information. The sample consists of short news (publications - in fact, only the title and the first paragraph). We also know how often each publication was read (such knowledge can be useful for ranking articles per se importance / relevance, etc.).

In order to understand what we have simplified, let us first recall what the LDA is:


Where inline_formula - this is the probability of the “document-word” pair appearing, it consists of the sum on all topics ( inline_formula - a lot of topics and works of the actual probability of the document (calculated as the ratio of the length of the document to the length of the body), the probability of the word appearing in the topic and the weight of the topic in the document (or rather the probability of the topic being present in the document). All the components in the sum can be calculated using the Bayesian formula, the problem is that we do not know a priori distributions for either topics or words. Moreover, our documents are approximately the same length, since we only store pieces of approximately the same length, sufficient for annotation and therefore inline_formula for us is irrelevant, it does not contain any information. In other words, in the formulas:


unknown to us inline_formula neither inline_formula , but inline_formula the same for all documents, and directly we can not count anything. LDA operates on the assumption that the vector of documents inline_formula generated by a parametric distribution from the Dirichlet family of distributions inline_formula . This is where we came across the first restriction that we don’t like - we don’t have any thoughts on inline_formula , except for the fact that it is likely to be quite large, and therefore optimization for such a family of distributions will be rather computationally heavy.

What can be simplified here? Let's see what can be counted without any tweaks and what benefits can be gained from this?

Let's try something quite primitive, what is inline_formula , the probability of generating a word by topic? If we know the theme of the text, then we can guess how likely the word will be there. The Bayesian formula can always be “turned upside down” and you can calculate the probability that a word belongs to a topic, or rather, the probability that a topic is present in a document containing a word.

The problem is that we do not have a distribution of topics, but only some word statistics. It may make sense to simplify our view of the corpus and not think about the generation of documents (and this is actually the basis of the "correct" thematic analysis) and focus exclusively on the "relationship" of words.

In our case, a topic is simply a collection of words with some weights that are found together in documents describing interrelated things. Is it possible to check whether two words belong to the same topic? As it seems to us, it is possible, and with the help of rather simple calculations. A similar approach works well for isolating collocations, but on ordered sets , we do not store information on word order, but we know the frequency with which words occur within the same document. The joint distribution of pairs of words within one document is relatively many pairs and most of them will be completely meaningless.

Intuitively, the assumption that two words relating to the same topic are most likely to be found together more often than two words from different topics does not cause any special doubts. Immediately make a reservation that we argue about words with a pronounced affiliation to a more or less clearly defined topic: the words "kingpin" and "Lada" are probably found in texts on the automobile theme, and the words "carburetor" and "mayonnaise" are unlikely to meet ( or our imagination is not enough to come up with an example). On the other hand, most of the verbs and adjectives fit harmoniously into the text on almost any subject:
Resident of the city N was killed by the explosion of a huge kingpin

(the author knows that the pins usually do not explode and that there is no N city in Russia) or
The guests were killed on the spot a huge mayonnaise cake

If you somehow find "semantically loaded" words, then it makes sense to look at what other words are found with them.

Let's consider:

this is the probability of the presence of the word inline_formula in the document, if you know that there is a word inline_formula , you can count such things directly from the case, but if you take into account all possible pairs, we again come to the need to calculate inline_formula probabilities where inline_formula that is, the size of our dictionary (the dictionary of the Pushkin language was about 40 thousand entries, but the news published by our publishers contains more than 200 thousand lemmas per hour, we will refrain from conclusions for the time being).

Word inline_formula is a kind of generator for the word inline_formula Thus, if you intelligently select dependent word generators, you can get some meaningful sets of words. Let's try?

Let us return to the “semantically meaningful” words, the voices in the head begin to whisper softly but persistently: “tf-idf, tf-idf”.

Let's not fight the demons of evidence and try to figure out how to use tf-idf to find out which of the words are more important than others. Calculate tf-idf within each document, reduce the documents to a reasonable number of keywords (simply by sorting the words by their tf-idf values ​​and saving the desired number of words with the highest values). There is hope that the dictionary will decrease and it will be easier to understand it.

On the other hand, we reduced the documents, but increased the “value” of words with a very narrow meaning: if one article met in detail describing the marriage habits of guinggnms, the word “guinggnm” will receive high weight in this article and become a candidate for its own semantic field, which undoubtedly has the right to exist, but it is unlikely to help much in the subsequent classification of new articles. You can avoid this by aggregating the tf-idf words across the whole body and once again ranking the words, this time not inside the documents, but across the whole body.

What happened?


In order not to be unfounded, let's see what happens if we take a selection of articles that have been read in the last half hour and sort the words, as it was suggested by the demons of evidence. Let's first see what happens if we first just select words with significant tf-idf (without aggregation, the value of tf-idf is equal to the average for all articles where the word was encountered):
Latin, leagoo, melano, matron, material, polyp, tabl, ssd-accumulator, two-hour, tournament, fitch…

The words are not bad, in the sense that they clearly indicate some clearly defined topics, but it is not clear whether the topics in our body are really important, which contains about 280 thousand short articles. Let's try to aggregate and see what happened:
airborne, music, hare, portal, jumping, free, new, well, well, who, myself ...

It turns out some nonsense: frequently encountered words, even with the punishment in frequency, still “pop up” to the top. Let's try to fight it just by discarding the most popular words from consideration. Without much thought, we threw out 300 words with the lowest idf, that's what happened (for clarity, instead of the top 10, we derive the top 30 words).
streets, successful, kreter, temple, thousand, lifan, fully, cr, bisquit, raised, brit, central, pres, sign, lead, included, magnetic, borodin, dol, machines, interaction, excluded, snow, pedicure, napkin, creator, latinin, michael, est, competence, raped ...

This list is already more meaningful: first, we encounter the mention of “Latin” (in the first place for the first time and on the 26th in the last list), this is clearly a surname, and apparently something happened to this character, something important (most likely, this is Yulia Latynina, but we cannot guarantee it yet). Lifan is a brand of car whose presence is expected - a significant percentage of the traffic in this sample goes through automotive forums. The remaining words from the list also look logical - in traffic there is always a discussion of recipes (“biscuits”), economics (“central banks”), and the like. Just looking at the list, you can hardly easily understand what the readers are concerned about at the moment, but you can already see that the words refer to different topics and events. While this is enough - we are just looking for where to start, isn't it?

Let's start the actual generation of topics - for now we don’t think about how many topics we can get, but simply take some (large) part of the resulting list and generate topics, and then think about what to do with them.

Before throwing out topics for everyone to see, we will spend another couple of minutes thinking about the criterion of which words to keep in the subject, and which to throw out after the probabilities have been calculated. To put some kind of hard cutoff on the absolute value is not worth it - depending on the size of the topic, the conditional probabilities can differ quite strongly. Instead, try a relationship with the most likely word in the subject: inline_formula Where inline_formula let's set some kind of cutoff inline_formula and we will reject all the words the ratio of the probability of which and the maximum probability of the word in the subject below:



Let's start with a fairly relaxed requirements, put inline_formula and see what happens, for readability, we took only the initial words of each topic (the topics are quite long, the numbers after the word are probabilities):
0 = "(flight log, List ((flight log, 1.0), (mazda, 0.02639662123248224),
(mercedes-benz, 0.025020797337940742), (subaru, 0.02498880143341652),
(Prior, 0.024668842388174312), (octavia, 0.024380879247456324),
(front, 0.024316887438407882), (sedan, 0.02098931336788891),
(Kalin, 0.01785371472451526) ...

1 = "(music, List ((music, 1.0), (feat, 0.031221582297665956),
(song, 0.016469637263817317), (dj, 0.013438415681519652),
(Alexander, 0.012630089926240274), (Irin, 0.012326967768010509),
(Vladimir, 0.011417601293321209), (circle, 0.008992624027483076),
(mikha, 0.008487420430433464), (serg, 0.008386379711023543),
(Grigor, 0.007982216833383854), (Caspian, 0.007780135394564009),
(Viktor, 0.007780135394564009), (cargo, 0.007679094675154087),
(leps, 0.0072749317975143975) ...

2 = "(hare, List ((hare, 1.0), (feat, 0.03158217497955847),
(song, 0.01655764513491415), (dj, 0.013593622240392478),
(Alexander, 0.012775960752248568), (Irin, 0.012367130008176616),
(Vladimir, 0.011549468520032708), (circle, 0.009096484055600982),
(mikha, 0.00848323793949305), (serg, 0.008381030253475062),
(Grigor, 0.008074407195421096), (Caspian, 0.007869991823385119),
(Viktor, 0.007869991823385119), (cargo, 0.0077677841373671305),
(leps, 0.0073589533932951765) ...

3 = "(portal, List ((portal, 1.0), (feat, 0.031572494124859504),
(song, 0.01655256973536324), (dj, 0.013589455400020435),
(Alexander, 0.012772044548891385), (Irin, 0.012363339123326864),
(Vladimir, 0.011443751915806682), (circle, 0.009093695718810668),
(mih, 0.008480637580463881), (serg, 0.008378461224072748),
(Grigor, 0.008071932154899356), (Caspian, 0.007867579442117094),
(Viktor, 0.007867579442117094), (cargo, 0.007765403085725963),
(leps, 0.007356697660161438) ...

4 = "(download, List ((download, 1.0), (feat, 0.028736234923964345),
(song, 0.01688515993707394), (torrent, 0.016255899318300997),
(games, 0.014263240692186683), (alexander, 0.012690089145254328),
(irin, 0.012480335605663346), (vladimir, 0.01101206082852648),
(dj, 0.01101206082852648), (circle, 0.009229155742003147),
(mikha, 0.008495018353434716), (serg, 0.008390141583639224) ...

5 = "(free, List ((free, 1.0), (feat, 0.028751311647429174),
(song, 0.016684155299055613), (irin, 0.01280167890870934),
(Alexander, 0.012591815320041973), (Vladimir, 0.011017838405036727),
(dj, 0.010912906610703044), (circle, 0.009233997901364114),
(mikha, 0.008604407135362015), (serg, 0.008289611752360966),
(Grigor, 0.0080797481636936), (Caspian, 0.007869884575026232),
(Viktor, 0.007869884575026232), (cargo, 0.00776495278069255),
(leps, 0.007345225603357817) ...

6 = "(new, List ((new, 1.0), (select, 0.04891791522602125),
(set, 0.04046612882571392), (event, 0.028812908182865922),
(Telekana, 0.026892047637341526), ​​(review, 0.02650787552823665),
(tut, 0.02612370341913177), (near Moscow, 0.025995646049430145),
(address, 0.023306441285695992), (html, 0.021641695479574848),
(news, 0.02087335126136509), (yesterday, 0.020745293891663463),
(fresh, 0.02023306441285696), (oblast, 0.019592777564348827),
(presented, 0.0145985401459854), (info, 0.014342425406582149),
(Grodno, 0.013317966448969137) ...
...
17 = (Ross, List ((Ross, 1.0), (combined, 0.0731070496083551),
(one, 0.04960835509138382), (president, 0.03953748601268183),
(Putin, 0.029093621782916825), (primaries, 0.023125699365908244),
(Hoke, 0.01939574785527788), (Vladimir, 0.019022752704214847),
(countries, 0.0175307720999627), (coach, 0.0175307720999627),
(Chapters, 0.01566579634464752), (RF, 0.015292801193584483),
(championship, 0.014546810891458413) ...


As it is easy to notice, topics 1 through 5 actually describe the same thing, but topic 17, on the contrary, mixes several topics together (in fact, this is the latest news about Russia). This problem needs to be solved somehow.

Let's start with repetitive topics. The obvious solution would be to merge topics in which there are many common words, the only thing you need to decide on is understanding what means a lot and how to assess it a lot, you can simply use the Jaccard coefficient, instead of recalculating the elements, add their probabilities and again calculate the Jacquard coefficient, the main thing is that as a result, similar things can be combined.

Recall that the Jacquard coefficient for finite sets inline_formula and inline_formula calculated by the formula:


It does not make sense to apply it directly to topics, because even for a subset, if its cardinality is much less than the cardinality of a larger set, the Jacquard coefficient will be much less than one, which is logical, but it does not suit us. Instead, we can, for example, sort topics by length and start comparing the shortest topic with topics of greater length; if more than half of the words of the topic coincide with the words of some other topic, they should be merged. Let's see what happened (again, 10 words at the beginning of each topic):
0 = "(Borzhurna, List ((Borzhurna, 1.0), (mazda, 0.02639662123248224), (mercedes-benz, 0.025020797337940742), (subaru, 0.02498880143341652)), (prior, 0.024668842388174312), (octavors), (a D), and,), (), (1). ), (sedan, 0.02098931336788891), (Kalin, 0.01785371472451526), ​​(astra, 0.017789722915466818) ...

1 = "(music, List ((downloading, 1.0), (hare, 1.0), (portal, 1.0), (free, 1.0), (music, 1.0), (feat, 0.030372759594695493), (song, 0.016629833474044852), (torrent, 0.016255899318300997), (games, 0.014263240692186683), (alexander, 0.012691999938535306), (dj, 0.012509292152232418), (irin, 0.012467890282777335),

2 = "(new, List ((new, 1.0), (review, 0.5132539377641183), (opt),), (26469259179654335), (complete set, 0.24270002476527985), (event, 0.028812908182865922), (telly, 0.02689204763734152888588565922), (tel (near Moscow, 0.025995646049430145), (address, 0.023306441285695992), (html, 0.021641695479574848), (news, 0.02087335126136509), (Tuesday, 0.020745293891663463),

3 = "(s (mn, 0.01952130368358513), (count, 0.018842301816329992), (women, 0.01850280088270243), (cf. 0.01850280088270243), (term, 0.01782379901544729),

4 = "(fret, List ((fret, 1.0), (prior, 0.5873114649209881), pick-up box, psikstek, 0.43989485258120614), (seda, 0.4370012256029478), (Kalin, 0.4247207293150209), (0) (grant, 0.0830172777075432), (3d, 0.0777496839443742), (sport, 0.07026217566672686),

5 = "(Kot, List ((Kot, 1.0), (thing, 0.04718016238753566), (one hundred, 0.028966425279789335), (books, 0.0230414746543777778), (help, 0.021724818959842), (your, 0.017774851876234364), (reference, 0.01576495959842), (your, 0.017774851876234364)) (useful, 0.014702655255650647), (stars, 0.013385999561114768), (change, 0.013385999561114768), (wants, 0.01294711432960281), (girls, 0.01163045863506693), (words, 0.01141101601931095, 0.01463045863506693), (words, 0.01141101601931095, 0.01463045863506693), (words, 0.01141101601931095), (words, 0.011630863506693) , 0.010752688172043012),
...
20 = (recipe, List ((recipe, 1.0), (prepared, 0.1684643040575244) (poshagov, 0.13405238828967642) (culinary, 0.11145351823317926) (lettuce, 0.1027221366204417) (dishes, 0.04982023626091423) (prigotov, 0.044170518746789934) ( cake, 0.0421160760143811), (pie, 0.04006163328197227), (soup, 0.03543913713405239), (chicken, 0.03338469440164355), (fast, 0.029275808936825888),
...

Finally! Grigory Leps is only in one topic, the Russian car industry is separated from abstract car conversations, and the chicken got into one pile with soup, pie and adjective fast! Against the background of general complacency stands out topic 2, which contains it is not clear that. If you look at it in more detail, you can see that many words in this topic can belong to any other topic - you can get rid of such words simply by deleting words that are repeated more than in inline_formula where inline_formula somehow chosen parameter, it is also worth deleting words with low idf, we did not use them as generators, but nothing prevents us from getting them when calculating probabilities.

Let's go back a little and look at the very first generated list - is it possible to use it somehow? Naturally, each word will generate something, and there will be many such topics and they will have to be merged, which will take a long time: merging the topics together takes a quadratic time from the number of topics. Is it possible to get a big topic from words like "ssd-drive"? Demons of common sense insist that it is possible and periodically repeat words like "hierarchy" and "ontology", we only need to interpret these concepts as primitively as possible and slip them the Bayes formula again.

Let's try the following: let's think of hierarchy as a tree, where the most common concepts are in the root, and the narrowest words mean in the leaves. In this case, "ssd-drive" is a leaf of a tree, at the root of which sits a "computer", or "technology", or something like that, if we can recover at least part of this tree, we will have a good, albeit incomplete topic . Let's try, pseudo-recursion on similar generators. The term pseudo-recursion was coined just now and we mean the challenge of generating themes for each generated word in a freshly thought-out topic, such an operation (after normalization) can be called until we start receiving words about which are not suitable for classification ( we have already found similar words by checking their idf).

Let's see what happened?
salmon = tomato, brightly, parmalat, salted, mousse, forgive, sauce, green, royal, creamy, garnish, pore, canapé, quail, festive, snack, naive, salty, grat, hello, lover, roll, bake, fish, heads, salmon, breakfast, soup, wants, potato, raviol, salted, Borodinsk, oven, cream cheese, uh, taken, potatoes, salad, pink salmon, taste, salt, package, wonderful, steak, lightly, ready, medallion, roll, cooked ...

nausea = morning, heartburn, dream, common, symptom, helping, down, strong, thirst, vomiting, rarely, roller, life, food, heaviness, feeling, urine, gestosis, izgag, forerunner, recently awakened, protein, morning, countries, dizzy, bottom, nausea

In principle, not bad: "salmon" has generated a topic about healthy eating, and the next topic is that something went wrong. Actually the threads are longer, we just show small pieces. Another interesting example:
Acceleration = Niv, gas, sable, Barguzin, Kalin, 4x4, hatchback, prior, v8, urban, sector, seda, universe, 5d, tested, acceleration, 4x4, gazelle, 3d

Here one of the characteristics of a car generates a field about cars in general, by increasing the number of calls, we will be able to generate more words, when there are many such fields you can merge together, as described above.

Having looked a little at a couple of formulas and explanations for them, it is easy to see that as a result, as usual, a naive Bayes classifier turned out with all sorts of empirical tricks for selecting parameters and without the marked training corpus. The latter is important for us: there is a lot of data, you don’t want to mark up something manually or even in semi-automatic mode, and therefore you need training without a teacher. Strangely enough, such an approach, despite its simplicity, works well for large volumes of documents and allows you to at least sort them into small groups, and then sell diapers to young parents, and engine oils, to motorists!

Source: https://habr.com/ru/post/301922/


All Articles