Probabilistic models: from naive Bayes to LDA, part 1

We continue the conversation. The last article was a transition from the previous cycle about graphic models in general ( part 1 , part 2 , part 3 , part 4 ) to a new mini-cycle about thematic modeling: we talked about sampling as a method of inference in graphic models. And now we begin the path to the Dirichlet latent allocation model (latent Dirichlet allocation) and how all these wonderful sampling algorithms are used in practice. Today - the first part, in which we will understand where it makes sense to generalize the naive Bayes classifier, and at the same time we will talk a little about clustering.

Classification: Naive Bayes

Thematic modeling solves the classical text analysis problem — how to create a probabilistic model of a large collection of texts, which can then be used, for example, for information retrieval, classification, or, in our case, for content recommendations. We have already spoken about one of the simplest models trying to solve this problem — the well-known naive Bayes classifier. You can, of course, classify texts and differently - using the support vector method, for example, or logistic regression, or any other classifier - but naive Bayes will be most useful to us now as an example for further generalization.
')
In the naive Bayes classifier model, each document is assigned a hidden variable corresponding to its theme (taken from a predetermined discrete set, for example, "finance", "culture", "sport"), and the words are conditionally independent with each other for the topic. Each topic corresponds to a discrete distribution in the words from which they originate; we simply assume that the words are conditionally independent subject to the subject:

As a result, each topic is a discrete distribution in words. You can imagine a huge dishonest dice, which you throw to get the next word, or a bag of words, in which you, without looking, launch your hand to pull out the next one. A dishonest bag, however, is harder to imagine - well, let's say, every word on pebbles in a bag occurs several times, and different words are different (all this intuition will still come in handy, bear). Do you have one bag that says "finance", the other - "culture", the third - "sport":

And when you “generate” from these bags a new document in naive Bayes, you first choose a bag (throwing a die), and then from this bag you start to choose words one by one:

The graphic model of this process looks very simple (we talked about how to read these pictures in the first part of the previous cycle, and in the second part there was even an example of the naive Bayes classifier). On the left, the graphical model is complete, in the center is a model of a single document with a plate that hides repeated identical variables, and on the right is the same model, but for the entire dataset, with explicitly selected α and β variables that contain the parameters of all these discrete distributions:

It also shows that naive Bayes consists of one discrete distribution (cube), which determines the comparative "weight" of topics, and several discrete distributions (bags), according to the number of topics that show the probabilities of a given word in a topic.

And in order to teach the naive Bayes (and any other classifier, you can’t get anywhere), you need to have a marked set of texts on which you can train these distributions:

When teaching naive Bayes, for each document we know its subject, and it remains to teach only the distribution of words in each topic separately and the probability of the occurrence of topics:

To do this, it is enough just to calculate how many times a particular word has been encountered in a particular topic (and how many times the topics occur in a dataset), and smooth the result in Laplace.

The naive Bayes classifier can immediately see two directions in which it could be expanded and improved:

How necessary is it to have a marked dataset? It’s very expensive to mark up datasets with your hands, and dataset is needed big enough because we teach the distribution of all words in a given topic; maybe you can somehow so without tags, if skillfully?
in real datasets, unless tweets will have one well-defined topic; For example, the news “ defender of Chelsea and the Brazilian national team David Louis was sold in the PSG for more than 40 million pounds sterling ” fairly spoil our model, because here the words about sports and finance are combined, and as a result, the sports theme will get “ contamination "from financial words, or vice versa; Is it possible to make one document have several topics?

These extensions will lead us gradually to the LDA model.

From classification to clustering

It will be logical to start from the first direction, on how to proceed to the processing of datasets without tags, because even if we make a model in which each document has several topics, mark out a large one with hands, and even several topics for one document, it will be extremely is difficult. So let's imagine that we have a dataset consisting of texts, and we assume, as in naive Bayes, that each document was generated from a single topic. The only difference is that now we do not know which of them, and the task for us looks like this:

Let's assume (we will do this in LDA too; in order to get rid of it, we need non-parametric methods, which we are still far away from) that we know the number of potential categories (different bags with words), only their content is unknown. Thus, the probabilistic model looks exactly the same:

,
and the distribution in it is the same. The only difference is that we used to know its subject matter for each document, but now we don’t know. In other words, the joint distribution, from which documents are generated, is the same: if we denote by w the vector of words and by c the category of document D , its general credibility will be

where β _c is the distribution in the bag of words corresponding to category c ; and the overall likelihood that needs to be maximized when training is, respectively,

and maximize it must still α and β _c . But if earlier we did this with known c , and the maximization problem was divided into separate topics and was essentially trivial, then now we need to maximize with unknown c , i.e. actually taking a wait on them:

How to do it?

What we have done is actually a completely standard clustering task: there are many objects (texts) in a multidimensional space (words), and we need to break them up into separate classes so that the texts in one class would be as similar as possible to each other. at the other and at the same time as less as possible similar to the texts in other classes. So a standard tool for learning clustering models can help here - EM-algorithm (from the words Expectation-Maximization, now understand why). The EM algorithm is designed for situations where there are hidden variables in the model (as we have c ), and we need to maximize the likelihood of the model by its parameters (we have α and β), but waiting for the hidden variables. The EM algorithm first initializes the model with some initial values, and then iteratively repeats two steps:

E-step - to find expectations (expectation) of hidden variables with this model;
The M-step is to maximize the likelihood (maximization), considering the variables equal to their expectations found in the E-step.

In our case, at the E-step, we simply find the probabilities that each document belongs to different categories, with fixed α and β:

And then at the M-step we consider these membership probabilities to be fixed and optimize α and β (preferably with smoothing α ₀ and β ₀ corresponding to certain a priori distributions on parameters α and β):

Then all this needs to be repeated until α and β converge; this is the EM algorithm for clustering applied to the case of discrete attributes (words in documents).

The general idea of why the EM algorithm actually works is this: at every E-step, the EM algorithm actually builds an auxiliary function of the model parameters, which in the current approximation concerns the likelihood function and remains less than it everywhere (it minimizes the likelihood function) . Here is a picture from one of the sources on this topic (frankly, I forgot which one):

And then at the M-step, the EM algorithm finds the parameters that maximize this auxiliary function. Obviously, with this we move to a point at which the overall likelihood increases. I will not go into detail now in the proof that the EM algorithm does all this and really works correctly, only the general idea is important to us.

What's next

Today we took the first of two steps from the naive Bayes classifier to the LDA: we moved from the classification task, in which each document in the training dataset should have a fixed category, to the clustering task, in which documents do not have to be categorized. We note, by the way, that if you still have part of the documents marked up (this is called semi-supervised learning), it is very easy to add to the resulting model. You just need to fix the corresponding variables c _i ( D ), make them equal to one in the marked up topic and zero in all the others, and not to train these variables within the EM-algorithm, leaving their values fixed. Such a “seed dataset” will, by the way, be very useful for further training of the model.

Next time we will take the second step towards the latent placement of Dirichlet: from the categories assigned to the entire text, we turn to topics that can be several in each document.

Source: https://habr.com/ru/post/228249/

All Articles

Probabilistic models: from naive Bayes to LDA, part 1

Classification: Naive Bayes

From classification to clustering

What's next

More articles: