docker run srcd/github_topics apache/spark
(replace apache/spark
with any GitHub rep as desired).The thematic probabilistic model on a set of documents which describes the frequency of the word
in the document
with themes
:
under conditions
The essence of ARTM is to naturally add regularization in the form of additional terms:
Since this is a simple addition, we can combine different regularizers in one optimization, for example, thin the matrix and increase the independence of the topics. LDA is formulated in ARTM terms like this:
Variables and
can be efficiently computed using an iterative EM algorithm . Dozens of ready-made ARTM regularizers are ready for battle as part of BigARTM .
At this, the forced rewrite of the ShAD lecture ends and begins
As of October 2016, about 18 million repositories on GitHub were available for analysis. They are actually much more, we just dropped the forks and the “hard forks” (the fork is not marked by GitHub). We put each repository this , and each name in the source is
. Source analysis was done with the same tools as with deep source code training in early experiments (see our presentations from the latest RE · WORK conferences: Berlin and London ): primary github / linguist classification and parsing based on Pygments . General-purpose text files were discarded, such as README.md .
The names from the source should not be extracted "in the forehead", for example, class FooBarBaz
adds 3 words to the bag: foo
, bar
and baz
, and int wdSize
adds two: wdsize
and size
. In addition, the names were stamped by the Snowball from NLTK , although we didn’t specifically explore the benefits of this. The final preprocessing stage consisted in calculating the logarithmic version of the TF-IDF weighting (again, we did not specifically investigate, just copied the solutions from the usual NLP) and filtering too rare and commonly used names, in our case, the boundaries were 50 and 100000, respectively.
After ARTM produced the result, it was necessary to manually give the names to the topics, based on keywords and repositories-representatives. The number of topics was set at 200, and as it turned out, it was necessary to put more, because There are a lot of topics on Gitkhab. Tedious work took a whole week.
Pre-processing was performed on Dataproc aka Spark in the Google cloud, and the main actions were performed locally on a powerful computer. The resulting sparse matrix had a size of about 20 GB, and it had to be converted into Vowpal Wabbit text format so that it could be digested by BigARTM CLI. The data were milled fairly quickly, in a couple of hours:
bigartm -c dataset_vowpal_wabbit.txt -t 200 -p 10 --threads 10 --write-model-readable bigartm.txt --regularizer "0.05 SparsePhi" "0.05 SparseTheta" Parsing text collection... OK. Gathering dictionary from batches... OK. Initializing random model from dictionary... OK. Number of tokens in the model: 604989 ================= Processing started. Perplexity = 586350 SparsityPhi = 0.00214434 SparsityTheta = 0.422496 ================= Iteration 1 took 00:11:57.116 Perplexity = 107901 SparsityPhi = 0.00613982 SparsityTheta = 0.552418 ================= Iteration 2 took 00:12:03.001 Perplexity = 60701.5 SparsityPhi = 0.102947 SparsityTheta = 0.768934 ================= Iteration 3 took 00:11:55.172 Perplexity = 20993.5 SparsityPhi = 0.458439 SparsityTheta = 0.902972 ================= Iteration 4 took 00:11:56.804 ...
-p
sets the number of iterations. There was no certainty which regularizers to use, so only “sparsity” was activated. Affected by the lack of detailed documentation (the developers have promised to fix it). It is important to note that the amount of RAM that was required to work at the peak was no more than 30 GB, which is very cool against the backdrop of gensim and, God forgive me, sklearn .
As a result, 200 topics can be divided into the following groups:
Perhaps the most interesting group with a bunch of extracted facts from everyday life:
The list of topics includes Spanish, Portuguese, French and Chinese. Russian has not been formed into a separate topic, which indicates more likely a higher level of our programmers on GitHub, who write directly in English, than a small number of Russian repositories. In this sense, Chinese repositories are being killed.
An interesting find in PL is the “Non-native English PHP” theme associated with PHP projects written by people for whom English is not native. Apparently, these two groups of programmers write code in a fundamentally different way. In addition, there are two topics related to Java: JNI and bytecode.
This is not so interesting. There are many repositories with OS kernels — large, noisy, and despite our efforts, they messed up some topics. However, something worth mentioning is:
The group is the largest in number, almost 100. Many repositories have turned out to be private cloud repositories of configurations for text editors, especially Vim and Emacs. Since Vim has only one topic, while Emacs has two topics, I hope this will put an end to the argument which editor is better!
We met sites on all known web engines written in Python, Ruby, PHP, Java, Javascript, etc. PHP sites use Wordpress, Joomla, Yii, VTiger, Drupal, Zend, Cake and Symphony engines for some reason with Doctrine (for each topic). Python: Django, Flask, Google AppEngine. Ruby: Rails and Rails Only. Reeeels Sites in Java collapsed into one mixed topic. And of course there was a place for sites on Node.js.
It turned out that many projects use Tesseract - an open source OCR engine. In addition, many use Caffe (and never Tensorflow).
Quake 3 / idTech 3 is so popular in game devs that it deserves a separate topic. Unity3D has two of them, and the first in the bulk is a bunch of student projects and home crafts.
Cocos2D is also popular and has 2 themes. Finally, there were 3 topics about OpenGL + WebGL. Probably the difference in the way of working with the API and the binding used (GLUT, etc.).
It is not surprising that Chef , a configuration management tool, divided the topic of cooking (recipes, kitchen, etc.). However, WinAPI unexpectedly found itself in the same thread with repositories about Pokemon. There is an assumption that the stemming made the WinAPI character names look like Pokemon names ...
Many topics are related to SDL , as well as to Minecraft and RPG.
We prepared the Docker image so that anyone could run our trained model on an arbitrary repository with GitHub. You just need to perform
docker run srcd/github_topics apache/spark
and you will see the top 5. Inside the image there is a serialized matrix of topics and words, it is available separately: link . The format is pickle 4th version with a tuple of length 2, the first element is Pandas 1.8+ SparseDataFrame , and the second is a list with IDF. In addition, there is an OpenDocument table and JSON with extracted themes.
Mining of the source code in the interpretation of classical machine learning (and not just bloopers collected metrics from AST and production) is a new thing, not very popular yet, there are practically no scientific articles. In perspective, we want and roughly introduce how to replace a part of programmers with a deep neural network that will translate the description of business tasks in natural language into code. It sounds fantastic, but technology has actually matured, and if it works, there will be a revolution more abruptly than industrialization. People are sorely lacking! We are Hirim!
The main difficulty in this business - to get access to the data. The GitHub API limits the number of requests from registered users to 5000 per hour, which of course is not enough if we want to get 18k. There is a project GHTorrent , but this is only a faint shadow of the data that we collected. I had to make a special pipeline on Go, which uses Go-Git for ultra-efficient cloning. As far as we know, three companies have a complete replica of GitHub: GitHub, SourceGraph and source {d}.
Source: https://habr.com/ru/post/312596/
All Articles