Launch LDA in the real world. Detailed guide

Foreword

On the Internet there are many tutorials explaining how LDA works (Latent Dirichlet Allocation - Dirichlet's latent placement) and how to put it into practice. Examples of LDA training are often shown on "exemplary" datasets, for example, "20 newsgroups dataset", which is in sklearn.

The peculiarity of learning on the example of "exemplary" datasets is that the data there is always in order and conveniently stacked in one place. When teaching production models, on the data obtained straight from real sources, everything is usually the opposite:

A lot of emissions.
Incorrect markup (if any).
Very strong class imbalances and 'ugly' distributions of any dataset parameters.
For texts, it is: grammatical errors, a huge number of rare and unique words, multilingualism.
Inconvenient way of data storing (different or rare formats, the need for parsing)

Historically, I try to learn from examples that are as close as possible to the realities of production-reality because it is in this way that you can more fully experience the problem areas of a particular type of task. So it was with LDA and in this article I want to share my experience - how to run LDA from scratch, on completely raw data. Some of the article will be devoted to obtaining these same data, in order for the example to acquire the appearance of a full-fledged 'engineering case'.

Topic modeling and LDA.

To begin with, consider what LDA generally does and what tasks it is used for.
Most often, LDA is used for Topic Modeling tasks. By such tasks are meant the tasks of clustering or classifying texts - in such a way that each class or cluster contains texts with similar topics.

In order to apply LDA to a text dataset (hereinafter referred to as the text corpus), it is necessary to convert the corpus into the term-document matrix.

A term document matrix is a matrix that has a size. $N \ times W$ where
N is the number of documents in the body, and W is the size of the body dictionary, i.e. the number of words (unique) that are found in our body. The i-th row, j-th column of the matrix contains a number — how many times the j-word in the i-th text occurs.

LDA builds, for a given Therm-document matrix and T a predetermined number of topics, two distributions:

Distribution of topics in the texts. (In practice, given by the size matrix $N \ times T$ )
Word distribution by topic. (Size matrix $T \ times W$ )

The values of the cells of these matrices are according to the probability that the topic is contained in this document (or the proportion of the topic in the document, if we consider the document as a mixture of different topics) for the 'Distribution of topics across texts' matrix.

For the matrix “Distribution of words by topic” values - this is, respectively, the probability to find the word j in the text with topic i, qualitatively, you can consider these numbers as coefficients characterizing how characteristic this word is for this topic.

It should be said that the word theme is not the 'everyday' definition of the word. LDA identifies T topics, but what the topics are and whether they correspond to any well-known text topics, such as: “Sport”, “Science”, “Politics” - is unknown. In this case, it is more appropriate to talk about the topic as a kind of abstract entity, which is specified by a row in the matrix of the distribution of words by themes and with a certain probability corresponds to the given text, if you like, you can imagine it as a family of characteristic sets of words that occur together with the corresponding probabilities (from the table) in a certain set of texts.

If it is interesting for you to study in more detail and in the formulas how LDA studies and works, then here are some materials (which were used by the author):

We get wild data

For our "laboratory work", we need a custom dataset with its own shortcomings and features. You can get it in different places: download reviews from Kinopoisk, Wikipedia articles, news from some news portal, we will take a bit more extreme option - posts from VKontakte communities.

We will do this like this:

We select some user VK.
Get a list of all his friends.
For each friend we take all of his community.
For each community of each friend, we extort the first n (n = 100) posts of the community and combine them into one text-content of the community.

Tools and Articles

To download posts, we will use the vk module to work with the VK API, for Python. One of the most intricate moments when writing an application using the VKontakte API is authorization, fortunately, the code that does this work is already written and is publicly available, except for vk I used a small authorization module, vkauth.

Links to used modules and articles for studying the VK API:

vkauth
vkauth tutorial
vk tutorial
vk tutorial №2
Official VK API documentation

Write the code

And so, using vkauth, log in:

#authorization of app using modules imported. app_id = '6203169' perms = ['photos','friends','groups'] API_ver = '5.68' Auth = VKAuth(perms, app_id, API_ver) Auth.auth() token = Auth.get_token() user_id = Auth.get_user_id() #starting session session = vk.Session(access_token=token) api = vk.API(session)

In the process, a small module was written containing all the functions necessary for uploading content in the appropriate format, below they are translated, let's go over them:

 def get_friends_ids(api, user_id): ''' For a given API object and user_id returns a list of all his friends ids. ''' friends = api.friends.get(user_id=user_id, v = '5.68') friends_ids = friends['items'] return friends_ids def get_user_groups(api, user_id, moder=True, only_open=True): ''' For a given API user_id returns list of all groups he subscribed to. Flag model to get only those groups where user is a moderator or an admin) Flag only_open to get only public(open) groups. ''' kwargs = {'user_id' : user_id, 'v' : '5.68' } if moder == True: kwargs['filter'] = 'moder' if only_open == True: kwargs['extended'] = 1 kwargs['fields'] = ['is_closed'] groups = api.groups.get(**kwargs) groups_refined = [] for group in groups['items']: cond_check = (only_open and group['is_closed'] == 0) or not only_open if cond_check: refined = {} refined['id'] = group['id'] * (-1) refined['name'] = group['name'] groups_refined.append(refined) return groups_refined def get_n_posts_text(api, group_id, n_posts=50): ''' For a given api and group_id returns first n_posts concatenated as one text. ''' wall_contents = api.wall.get(owner_id = group_id, count=n_posts, v = '5.68') wall_contents = wall_contents['items'] text = '' for post in wall_contents: text += post['text'] + ' ' return text

The final pipeline looks like this:

 #id of user whose friends you gonna get, like: https://vk.com/id111111111 user_id = 111111111 friends_ids = vt.get_friends_ids(api, user_id) #collecting all groups groups = [] for i,friend in tqdm(enumerate(friends_ids)): if i % 3 == 0: sleep(1) friend_groups = vt.get_user_groups(api, friend, moder=False) groups += friend_groups #converting groups to dataFrame groups_df = pd.DataFrame(groups) groups_df.drop_duplicates(inplace=True) #reading content(content == first 100 posts) for i,group in tqdm(groups_df.iterrows()): name = group['name'] group_id = group['id'] #Different kinds of fails occures during scrapping #For examples there are names of groups with slashes #Like: 'The Kaaats / Indie-rock' try: content = vt.get_n_posts_text(api, group_id, n_posts=100) dst_path = join(data_path, name + '.txt') with open(dst_path, 'w+t') as f: f.write(content) except Exception as e: print('Error occured on group:', name) print(e) continue #need it because of requests limitaion in VK API. if i % 3 == 0: sleep(1)

Fails

In general, the procedure of pumping data by itself does not constitute anything difficult; it is necessary to pay attention only to two points:

Sometimes, due to the privacy of some communities, you will receive access errors, sometimes other errors are solved by installing try, except in the right place.
VC has a limit on the number of requests per second.

When making a large number of requests, for example in a cycle, we will also catch errors. This problem can be solved in several ways:

Stupidly and directly: Stick sleep (some) every 3 requests. It is done in one line and it slows down the download greatly, in situations where the data volumes are not large, and there is no time for more sophisticated methods. It is quite acceptable. (Implemented in this article)
Understand the Long Poll requests https://vk.com/dev/using_longpoll

In this paper, a simple and slow method was chosen, in the future, I may write with a micro-article about ways to bypass or relax the restrictions on the number of requests per second.

Total

With a barely "some" user with ~ 150 friends, 4679 texts were obtained - each characterizes a certain VK community. Texts vary greatly in size and are written in many languages - some of them are not suitable for our purposes, but we'll talk about this a little further.

Main part

Let's go through all the blocks of our pipeline - first, by mandatory (Ideal), then by the rest - they are of the greatest interest.

CountVectorizer

Before teaching LDA, we need to present our documents in the form of a Therm-Doc-Matrix. This usually includes operations such as:

Removal of pathways / numbers / unnecessary lexemes.
Tokenization (word list view)
Counting words, compiling term document matrix.

All of these actions in sklearn are conveniently implemented within a single program entity, sklearn.feature_extraction.text.CountVectorizer.

Documentation link

All you need to do is:

 count_vect = CountVectorizer(input='filename', stop_words=stopwords, vocabulary=voc) dataset = count_vect.fit_transform(train_names)

Lda

Similarly, with CountVectorizer, LDA, is perfectly implemented in Sklearn and other frameworks, so there is not much point in giving directly to their implementations a lot of space, in our purely practical article.

Documentation link

All you need to run LDA is:

 #training LDA lda = LDA(n_components = 60, max_iter=30, n_jobs=6, learning_method='batch', verbose=1) lda.fit(dataset)

Preprocessing

If we just take our texts right after we have downloaded them and convert them into a Therm-Doc Matrix using the CountVectorizer, with the default default tokenizer, we will get a matrix of size 4679x769801 (on the data I use).

The size of our dictionary will be 769801. Even if we assume that most of the words are informative, we still hardly get a good LDA, we are waiting for something like 'Curse of dimensions', not to mention the fact that for almost any computer, we will just fill in all the RAM. In fact, most of these words are completely informative. A huge part of them is:

Smilies, symbols, numbers.
Unique or very rare words (for example, Polish words from a group with Polish memes, words written with errors or in 'Olban').
Very frequent parts of speech (for example, prepositions and pronouns).

In addition, many groups in VC specialize exclusively in images - there are almost no text posts - the texts corresponding to them are degenerate, in the Therm-Document Matrix they will give us almost completely zero lines.

And so, let's sort it all out!
We tokenize all texts, remove punctuation and numbers from them, look at the histogram of the distribution of texts by the number of words:

Remove all texts smaller than 100 words (there are 525)

Now the dictionary:
The deletion of all lexemes (words) consisting of non-letters, as part of our task, is quite acceptable. CountVectorizer does it itself, even if it doesn't, then I think there is no need to give examples here (they are in the full version of the code for the article).

One of the most common procedures for reducing the size of the dictionary is to remove the so-called stopwords (stopwords) - words that do not carry a semantic load or / and do not have thematic coloring (in our case, the same topic is Modeling). Such words in our case are, for example:

Pronouns and prepositions.
Articles - the, a.
Common words: “be”, “good”, “probably”, etc.

In the nltk module there are formed lists of stopwords in Russian and in English, but they are rather weak. On the Internet, you can find more lists of stopwords for any language and add them to those in nltk. So we will do. Take an additional stopword from here:

In practice, when solving specific tasks, the lists of stopwords are gradually adjusted and supplemented as the models are taught, since for each specific dataset and tasks there are specific words that are not meaningful. We will also pick up custom stopwords after learning our first generation LDA.

By itself, the procedure for removing stopwords is built into the CountVectorizer - we just need their list.

Is it enough what we did?

Most of the words that are in our dictionary are still not very informative for learning LDA and are not in the list of stopwords. Therefore, let's apply another filtering method to our data.

i d f (t, D) = l o g f r a c | D | | d i n D : t i n d |

$idf (t, D) = \ log \ frac {| D |} {| \\ {d \ in D: t \ in d \\} |}$

where
t - word from the dictionary.
D - body (many texts)
d - one of the corpus texts.
We calculate the IDF of all our words, and cut off the words with the biggest idf (very rare) and with the smallest (common words).

 #'training' (tf-)idf vectorizer. tf_idf = TfidfVectorizer(input='filename', stop_words=stopwords, smooth_idf=False ) tf_idf.fit(train_names) #getting idfs idfs = tf_idf.idf_ #sorting out too rare and too common words lower_thresh = 3. upper_thresh = 6. not_often = idfs > lower_thresh not_rare = idfs < upper_thresh mask = not_often * not_rare good_words = np.array(tf_idf.get_feature_names())[mask] #deleting punctuation as well. cleaned = [] for word in good_words: word = re.sub("^(\d+\w*$|_+)", "", word) if len(word) == 0: continue cleaned.append(word)

Obtained after the above procedures is already quite suitable for learning LDA, but we will still make stemming - in our dataset, the same words are often found, but in different cases. For stemming used pymystem3 .

 #Stemming m = Mystem() stemmed = set() voc_len = len(cleaned) for i in tqdm(range(voc_len)): word = cleaned.pop() stemmed_word = m.lemmatize(word)[0] stemmed.add(stemmed_word) stemmed = list(stemmed) print('After stemming: %d'%(len(stemmed)))

After applying the above described filterings, the size of the dictionary decreased from 769801 to
13611 and already with such data, you can get an LDA model of acceptable quality.

Testing, application and tuning LDA

Now, when we have datasets, preprocessing and models that we have trained in dataset, it would be good to check the adequacy of our models, as well as build some applications for them.

As an application, to begin with, consider the task of generating keywords for a given text. To do this in a fairly simple version, you can as follows:

We get from LDA distribution of topics for this text.
Choose n (for example n = 2) the most pronounced topics.
For each of the topics, choose m (for example, m = 3) the most characteristic words.
We have a set of n * m words describing this text.

Let's write a simple interface class that will implement this method of generating keywords:

 #Let\`s do simple interface class class TopicModeler(object): ''' Inteface object for CountVectorizer + LDA simple usage. ''' def __init__(self, count_vect, lda): ''' Args: count_vect - CountVectorizer object from sklearn. lda - LDA object from sklearn. ''' self.lda = lda self.count_vect = count_vect self.count_vect.input = 'content' def __call__(self, text): ''' Gives topics distribution for a given text Args: text - raw text via python string. returns: numpy array - topics distribution for a given text. ''' vectorized = self.count_vect.transform([text]) lda_topics = self.lda.transform(vectorized) return lda_topics def get_keywords(self, text, n_topics=3, n_keywords=5): ''' For a given text gives n top keywords for each of m top texts topics. Args: text - raw text via python string. n_topics - int how many top topics to use. n_keywords - how many top words of each topic to return. returns: list - of m*n keywords for a given text. ''' lda_topics = self(text) lda_topics = np.squeeze(lda_topics, axis=0) n_topics_indices = lda_topics.argsort()[-n_topics:][::-1] top_topics_words_dists = [] for i in n_topics_indices: top_topics_words_dists.append(self.lda.components_[i]) shape=(n_keywords*n_topics, self.lda.components_.shape[1]) keywords = np.zeros(shape=shape) for i,topic in enumerate(top_topics_words_dists): n_keywords_indices = topic.argsort()[-n_keywords:][::-1] for k,j in enumerate(n_keywords_indices): keywords[i * n_keywords + k, j] = 1 keywords = self.count_vect.inverse_transform(keywords) keywords = [keyword[0] for keyword in keywords] return keywords

Apply our method to several texts and see what happens:
Community: Paints of the World Travel Agency
Keywords: ['photo', 'social', 'travel', 'community', 'travel', 'euro', 'accommodation', 'price', 'poland', 'departure']
Community: Food Gifs
Keywords: ['butter', 'st', 'salt', 'pc', 'dough', 'cooking', 'onion', 'pepper', 'sugar', 'gr']

The results above are not 'cherry pick' and look quite adequate. In fact, these are results from an already tuned model. The first LDAs that were trained as part of this article showed significantly worse results; among the keywords one could often see, for example:

Components of web addresses: www, http, ru, com ...
Common words.
units of measure: cm, meter, km ...

The tuning (tuning) of the model was made as follows:

For each topic, choose n (n = 5) the most characteristic words.
We consider them idf, by the training building.
We enter in keywords 5-10% of the most widespread.

Such a “purge” should be carried out carefully, previewing the same 10% of words. Rather, it is necessary to choose candidates for deletion, and then manually select the words to be removed from them.

Somewhere on the 2-3 generation of models, with a similar method of selecting stopwords, for the top 5% of the widespread top words of distributions we get:
['any', 'completely', 'correctly', 'easy', 'next', 'internet', 'small', 'way', 'difficult', 'mood', 'so much', 'set', ' option ',' name ',' speech ',' program ',' competition ',' music ',' goal ',' film ',' price ',' game ',' system ',' play ',' company ' , 'nicely']

More applications

The first thing that comes to mind specifically for me is to use the distribution of themes in the text as 'embeddings' of texts, in this interpretation one can apply visualization or clustering algorithms to them, and search for the final 'effective' thematic clusters in this way.

Let's do this:

 term_doc_matrix = count_vect.transform(names) embeddings = lda.transform(term_doc_matrix) kmeans = KMeans(n_clusters=30) clust_labels = kmeans.fit_predict(embeddings) clust_centers = kmeans.cluster_centers_ embeddings_to_tsne = np.concatenate((embeddings,clust_centers), axis=0) tSNE = TSNE(n_components=2, perplexity=15) tsne_embeddings = tSNE.fit_transform(embeddings_to_tsne) tsne_embeddings, centroids_embeddings = np.split(tsne_embeddings, [len(clust_labels)], axis=0)

At the output, we get the following image:

Crosses are centers of gravity (cenroids) of clusters.

In the image of tSNE embeddings, it can be seen that the clusters selected with the help of KMeans form rather connected and most often spatially separable sets.

Everything else, up to you.

Link to all code: https://gitlab.com/Mozes/VK_LDA

Source: https://habr.com/ru/post/417167/

All Articles