Thematic modeling of repositories on GitHub

Thematic modeling is a machine learning section dedicated to extracting abstract “topics” from a set of “documents”. Each “document” is represented by a bag of words , i.e. many words along with their frequencies. Introduction to thematic modeling is perfectly described by prof. KV Vorontsov in the lectures of the ShAD [ PDF ]. The most famous TM model is, of course, the Dirichlet Latent Distribution (LDA). Konstantin Vyacheslavovich was able to summarize all possible thematic models based on a bag of words in the form of additive regularization (ARTM). In particular, LDA is also included in many ARTM models. ARTM ideas are embodied in the project BigARTM .

Usually thematic modeling is applied to text documents. We at source {d} (a startup in Spain) digest the big date obtained from the GitHub repositories (and will soon take on every publicly available repository in the world). Naturally, the idea arose to interpret each repository as a bag of words and set BigARTM apart. In this article we will talk about how we have completed in fact the world's first case study of the largest repository of open source projects, what came of it and how to repeat it. docker inside!

TL; DR:

docker run srcd/github_topics apache/spark

(replace apache/spark with any GitHub rep as desired).
')
» OpenDocument table with extracted themes.
" JSON with extracted themes.
" Trained model - 40MB, gzipped pickle for Python 3.4+, Pandas 1.18+.
» Dataset on data.world .

Theory

The thematic probabilistic model on a set of documents $\ inline D$ which describes the frequency of the word $\ inline w$ in the document $\ inline d$ with themes $\ inline t$ :

$p (w | d) = \ sum_ {t \ in T} p (w | t) p (t | d)$

Where $\ inline p (w | t)$ - probability of a word relation $\ inline w$ to the topic $\ inline t$ , $\ inline p (t | d)$ - the probability of a relationship topic $\ inline t$ to document $\ inline d$ , so the formula above is just an expression of the total probability , provided that the hypothesis of independence of random variables is true: $\ inline p (w | d, t) = p (w | t)$ . Words are taken from the dictionary $\ inline W$ themes belong to the set $\ inline T$ which is just a series of indexes $\ inline [1, 2, \ dots n_t]$ .

We need to recover $\ inline p (w | t)$ and $\ inline p (t | d)$ from a given set of documents $\ inline \ left \ {d \ in D: d = \ left \ {w_1 \ dots w_ {n_d} \ right \} \ right \}$ . It is commonly believed that $\ inline \ hat {p} (w | d) = \ frac {n_ {dw}} {n_d}$ where $\ inline n_ {dw}$ - number of entries $\ inline w$ in document $\ inline d$ However, this implies that all words are equally important, which is not always true. By “importance”, this refers to a measure that is negatively correlated with the total frequency of the word in documents. Denote the recoverable probabilities. $\ inline \ hat {p} (w | t) = \ phi_ {wt}$ and $\ inline \ hat {p} (t | d) = \ theta_ {td}$ . So our task is reduced to a stochastic matrix decomposition, which is incorrectly set:

$\ frac {n_ {dw}} {n_d} ≈ \ Phi \ cdot \ Theta = (\ Phi S) (S ^ {- 1} \ Theta) = \ Phi '\ cdot \ Theta'$

In machine learning problems, regularization is usually used as a way to improve the characteristics of a model on unknown data (as a result, re-training decreases, complexity, etc.); in our case, it is simply necessary .

Tasks like the one described above are solved using the maximum likelihood method :

$\ sum_ {d \ in D} \ sum_ {w \ in d} n_ {dw} \ ln \ sum_ {t} \ phi_ {wt} \ theta_ {td} \ to \ max _ {\ Phi, \ Theta}$

under conditions

$\ phi_ {wt} & gt; 0; \ sum_ {w \ in W} \ phi_ {wt} = 1; \ theta_ {td} & gt; 0; \ sum_ {t \ in T} \ theta_ {td} = 1.$

The essence of ARTM is to naturally add regularization in the form of additional terms:

$\ sum_ {d \ in D} \ sum_ {w \ in d} n_ {dw} \ ln \ sum_ {t} \ phi_ {wt} \ theta_ {td} + R (\ Phi, \ Theta) \ to \ max_ {\ Phi, \ Theta}$

Since this is a simple addition, we can combine different regularizers in one optimization, for example, thin the matrix and increase the independence of the topics. LDA is formulated in ARTM terms like this:

$R (\ Phi, \ Theta) _ {Dirichlet} = \ sum_ {t, w} (\ beta_w - 1) \ ln \ phi_ {wt} + \ sum_ {d, t} (\ alpha_t - 1) \ ln \ theta_ {t, d}$

Variables $\ inline \ Phi$ and $\ inline \ Theta$ can be efficiently computed using an iterative EM algorithm . Dozens of ready-made ARTM regularizers are ready for battle as part of BigARTM .

At this, the forced rewrite of the ShAD lecture ends and begins

Practice

As of October 2016, about 18 million repositories on GitHub were available for analysis. They are actually much more, we just dropped the forks and the “hard forks” (the fork is not marked by GitHub). We put each repository this $d$ , and each name in the source is $w$ . Source analysis was done with the same tools as with deep source code training in early experiments (see our presentations from the latest RE · WORK conferences: Berlin and London ): primary github / linguist classification and parsing based on Pygments . General-purpose text files were discarded, such as README.md .

The names from the source should not be extracted "in the forehead", for example, class FooBarBaz adds 3 words to the bag: foo , bar and baz , and int wdSize adds two: wdsize and size . In addition, the names were stamped by the Snowball from NLTK , although we didn’t specifically explore the benefits of this. The final preprocessing stage consisted in calculating the logarithmic version of the TF-IDF weighting (again, we did not specifically investigate, just copied the solutions from the usual NLP) and filtering too rare and commonly used names, in our case, the boundaries were 50 and 100000, respectively.
After ARTM produced the result, it was necessary to manually give the names to the topics, based on keywords and repositories-representatives. The number of topics was set at 200, and as it turned out, it was necessary to put more, because There are a lot of topics on Gitkhab. Tedious work took a whole week.

Pre-processing was performed on Dataproc aka Spark in the Google cloud, and the main actions were performed locally on a powerful computer. The resulting sparse matrix had a size of about 20 GB, and it had to be converted into Vowpal Wabbit text format so that it could be digested by BigARTM CLI. The data were milled fairly quickly, in a couple of hours:

 bigartm -c dataset_vowpal_wabbit.txt -t 200 -p 10 --threads 10 --write-model-readable bigartm.txt --regularizer "0.05 SparsePhi" "0.05 SparseTheta" Parsing text collection... OK. Gathering dictionary from batches... OK. Initializing random model from dictionary... OK. Number of tokens in the model: 604989 ================= Processing started. Perplexity = 586350 SparsityPhi = 0.00214434 SparsityTheta = 0.422496 ================= Iteration 1 took 00:11:57.116 Perplexity = 107901 SparsityPhi = 0.00613982 SparsityTheta = 0.552418 ================= Iteration 2 took 00:12:03.001 Perplexity = 60701.5 SparsityPhi = 0.102947 SparsityTheta = 0.768934 ================= Iteration 3 took 00:11:55.172 Perplexity = 20993.5 SparsityPhi = 0.458439 SparsityTheta = 0.902972 ================= Iteration 4 took 00:11:56.804 ...

-p sets the number of iterations. There was no certainty which regularizers to use, so only “sparsity” was activated. Affected by the lack of detailed documentation (the developers have promised to fix it). It is important to note that the amount of RAM that was required to work at the peak was no more than 30 GB, which is very cool against the backdrop of gensim and, God forgive me, sklearn .

Topics

As a result, 200 topics can be divided into the following groups:

Concepts - common, broad and abstract.
Human languages - it turned out that the code can roughly determine the native language of the programmer, probably partly due to the offset from the stemming.
Programming languages are not so interesting. We already know this information. The PLs usually have the aka standard library “batteries” from classes and functions that are imported / included in the source code, and the corresponding names are detected by our thematic modeling. Some topics turned out to be narrower than PL.
General IT - would fall into the Concept if they had an expressive list of keywords. Repositories are often associated with a unique set of names, for example, we say Rails, we keep ActiveObject and other Active in mind. Partially echoes Programming Philosophy 2 - Myth and language .
Communities - dedicated to specific, potentially narrow technologies or products.
Games
Brad - 2 topics could not find a reasonable explanation.

Concepts

Perhaps the most interesting group with a bunch of extracted facts from everyday life:

There is cheese in pizza, and many repositories mention it.
Terms from mathematics, linear algebra, cryptography, machine learning, digital signal processing, genetic engineering, elementary particle physics.
Days of the week. Monday, Tuesday, etc.
All sorts of facts and characters from RPG and other fantasy games.
IRC has pseudonyms.
Many design patterns (we say thank you for Java and PHP).
Colors. Including some exotic ones (we say thank you for CSS ).
The email has CC, BCC, and it is sent via SMTP protocol and received by POP / IMAP.
How to create a good datetime picker. It seems to be a very typical project on github, hehe.
People work for money and spend it on buying houses and driving (obviously, from home to work and back).
All kinds of "iron".
A comprehensive list of HTTP, SSL, Internet, Bluetooth and WiFi terms.
Everything you want to know about memory management.
What to eat there is a wish to make your own firmware based on Android.
Bar codes. A huge number of different species.
People. They are divided into men and women, they live and have sex.
Excellent list of text editors.
Weather. Many typical words.
Open licenses. Generally speaking, they should not have been included in a separate topic. The names and texts of licenses in theory do not overlap. From experience with Pygments, some PLs are much worse supported than others and, apparently, some were incorrectly parsed.
Commerce. The shops offer discounts and sell goods to customers.
Bitcoin and blockchain.

Human languages

The list of topics includes Spanish, Portuguese, French and Chinese. Russian has not been formed into a separate topic, which indicates more likely a higher level of our programmers on GitHub, who write directly in English, than a small number of Russian repositories. In this sense, Chinese repositories are being killed.

Programming languages

An interesting find in PL is the “Non-native English PHP” theme associated with PHP projects written by people for whom English is not native. Apparently, these two groups of programmers write code in a fundamentally different way. In addition, there are two topics related to Java: JNI and bytecode.

General IT

This is not so interesting. There are many repositories with OS kernels — large, noisy, and despite our efforts, they messed up some topics. However, something worth mentioning is:

Lots of information about drones. They work on Linux.
There are many implementations of Ruby. Often, there are “extreme forks” when people take someone else’s code base and commit one with a loss of change history.
onmouseup, onmousedown and onmousemove are the three giants on whom the UI stands.
A huge number of buzzwords and technologies from the world of Javascript.
Online learning platforms. Especially Moodle . Many, many Moodle.
All ever created open CMS.
The Coursera Machine Learning theme provides an excellent list of home-based repositories for Coursera courses on machine learning.

Communities

The group is the largest in number, almost 100. Many repositories have turned out to be private cloud repositories of configurations for text editors, especially Vim and Emacs. Since Vim has only one topic, while Emacs has two topics, I hope this will put an end to the argument which editor is better!

We met sites on all known web engines written in Python, Ruby, PHP, Java, Javascript, etc. PHP sites use Wordpress, Joomla, Yii, VTiger, Drupal, Zend, Cake and Symphony engines for some reason with Doctrine (for each topic). Python: Django, Flask, Google AppEngine. Ruby: Rails and Rails Only. Reeeels Sites in Java collapsed into one mixed topic. And of course there was a place for sites on Node.js.

It turned out that many projects use Tesseract - an open source OCR engine. In addition, many use Caffe (and never Tensorflow).

Quake 3 / idTech 3 is so popular in game devs that it deserves a separate topic. Unity3D has two of them, and the first in the bulk is a bunch of student projects and home crafts.
Cocos2D is also popular and has 2 themes. Finally, there were 3 topics about OpenGL + WebGL. Probably the difference in the way of working with the API and the binding used (GLUT, etc.).

It is not surprising that Chef , a configuration management tool, divided the topic of cooking (recipes, kitchen, etc.). However, WinAPI unexpectedly found itself in the same thread with repositories about Pokemon. There is an assumption that the stemming made the WinAPI character names look like Pokemon names ...

Games

Many topics are related to SDL , as well as to Minecraft and RPG.

What can be downloaded

We prepared the Docker image so that anyone could run our trained model on an arbitrary repository with GitHub. You just need to perform

 docker run srcd/github_topics apache/spark

and you will see the top 5. Inside the image there is a serialized matrix of topics and words, it is available separately: link . The format is pickle 4th version with a tuple of length 2, the first element is Pandas 1.8+ SparseDataFrame , and the second is a list with IDF. In addition, there is an OpenDocument table and JSON with extracted themes.

findings

As it was already written above, 200 topics are too few, many topics turn out to be dual, triple or mild. When performing a “fair” analysis, we should set 500 or 1000, but we will have to forget about manual labeling of those. It’s hard to figure out the infinite number of PHP themes if you’re not in the thread :). The perennial reading of articles on Habré was definitely useful to me, and still I felt uncomfortable. But still it turned out interesting. The outstanding achievement of ARTM in my opinion is the extraction of themes about people, nature, science, and even design patterns from just the names in the source code.

Plans to add readme files to the model and possibly other text sources. Perhaps they will strengthen the group concepts .

PS

Mining of the source code in the interpretation of classical machine learning (and not just bloopers collected metrics from AST and production) is a new thing, not very popular yet, there are practically no scientific articles. In perspective, we want and roughly introduce how to replace a part of programmers with a deep neural network that will translate the description of business tasks in natural language into code. It sounds fantastic, but technology has actually matured, and if it works, there will be a revolution more abruptly than industrialization. People are sorely lacking! We are Hirim!

The main difficulty in this business - to get access to the data. The GitHub API limits the number of requests from registered users to 5000 per hour, which of course is not enough if we want to get 18k. There is a project GHTorrent , but this is only a faint shadow of the data that we collected. I had to make a special pipeline on Go, which uses Go-Git for ultra-efficient cloning. As far as we know, three companies have a complete replica of GitHub: GitHub, SourceGraph and source {d}.

Source: https://habr.com/ru/post/312596/

All Articles