GitHub features more than 300 programming languages, starting with well-known languages such as Python, Java, and Javascript, and ending with esoteric languages such as
Befunge , known only to small groups of people.
Top 10 programming languages hosted on GitHub by number of repositoriesOne of the problems GitHub faces is the recognition of different programming languages. When some code is placed in the repository, it is very important to recognize its type. This is necessary for reasons of search, vulnerability alerts, syntax highlighting, and the structural presentation of repository content to users.
At first glance, language recognition is a simple task, but this is not entirely true.
Linguist is a tool that we now use to define a programming language on GitHub. Linguist is a Ruby application that uses various language recognition strategies, including file name and file extension information. In addition, it takes into account the Vim or Emacs models, as well as the content at the top of the file (shebang). Linguist processes the language ambiguity heuristically and, if it does not work out this way, it uses a naive Bayes classifier trained in a small sample of data.
')
Although Linguist predicts quite well at the file level (84% accuracy), everything breaks down when files are named strange, and even more so when files have no extensions. This makes Linguist useless for content such as GitHub Gists or code snippets in README, error, and pull requests.
In order to make the definition of a language clearer in the long term, we developed a machine learning classifier called OctoLingua. It is based on the Artificial Neural Network (ANN) architecture, which can handle language prediction in non-trivial scenarios. The current version of the model can make predictions for the top 50 programming languages on GitHub and is superior to Linguist in accuracy.
More about OctoLingua
OctoLingua was written from scratch in Python, Keras with the TensorFlow backend - it was created to be accurate, reliable and easy to maintain. In this part, we will talk about our data sources, model architecture and performance tests of OctoLingua. We will also talk about the process of adding the ability to recognize a new language.
Data sources
The current version of OctoLingua was trained on files obtained from the
Rosetta Code and from a set of internal crowdsourse repositories. We have limited our set of languages to 50 most popular on github.
The Rosetta Code was a great starting dataset because it contained source code written for the same task, but in different programming languages. For example, the code for generating
Fibonacci numbers was introduced in C, C ++, CoffeeScript, D, Java, Julia, and others. However, the language coverage was not uniform: for some programming languages there were only a few files with code, while for others the files simply contained too little code. Therefore, it was necessary to supplement our training data set with some additional sources and due to this, significantly improve the coverage of languages and the effectiveness of the final model.
Our process of adding a new language is not fully automated. We programmatically collect source code from publicly available repositories on GitHub. We select only those repositories that meet the minimum qualification criteria, such as having a minimum number of forks covering the target language and covering specific file extensions. At this stage of data collection, we define the main language of the repository using the classification from Linguist.
Signs: Based on previous knowledge
Traditionally, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are used to solve text-based classification problems with neural networks. However, differences in programming languages in vocabulary, file extensions, structure, library importing style, and other details forced us to come up with another approach that uses all this information, extracting some signs in a tabular form to train our classifier. Signs are extracted as follows:
- Top 5 special characters in the file
- Top 20 characters in the file
- File extension
- The presence of specific special characters that are used in the source code of files, such as a colon, curly braces, semicolon
Model Artificial Neural Network (ANN)
We use the above factors as input for a two-layer neural network built using Keras with a Tensorflow backend.
The diagram below shows that the feature extraction step creates n-dimensional table entry for our classifier. As information travels through the layers of our network, it is ordered using dropouts, and the result is a 51-dimensional output, which represents the probability that the code is written in each of the top 50 languages in GitHub. It also shows the probability that the code is not written in any of these 50 languages.
ANN-structure of the original model (50 languages + 1 for “other”)We used 90% of our source database for training. Also at the model training step, some of the file extensions were removed so that the model could learn exactly the vocabulary of the files, and not their extensions that predict the programming language so well.
Performance test
OctoLingua vs Linguist
In the table below, we show the
F1 Score (average harmonic between accuracy and completeness) for OctoLingua and Linguist counted on the same test suite (10% of the volume of our original data source).
Three tests are shown here. In the first test, the dataset was completely untouched; in the second file extensions were removed; in the third file extensions were mixed in order to confuse the classifier (for example, the Java file could have the extension “.txt”, and the Python file extension “.java”.
The intuition behind mixing or deleting file extensions in our test suite is to assess the reliability of OctoLingua in classifying files when a key attribute is deleted or misleading. A classifier that does not depend heavily on the extension would be extremely useful for classifying logs and code fragments, since in these cases people usually do not provide accurate information about the extension (for example, many code-related logs have the extension txt.)
The table below shows how OctoLingua has a good performance in various conditions, when we assumed that the model learns mainly in the vocabulary of the code, and not in meta-information (for example, in the file extension). At the same time, Linguist determines the language erroneously, as soon as there was no information about the correct file extension.
OctoLingua vs. Linguist performance on the same test suiteEffect of deleting file extensions when training a model
As mentioned earlier, during training, we removed a certain percentage of file extensions from the data in order to force the model to learn the file vocabulary. The table below shows the performance of our model with different shares of file extensions removed during the training.
Efficiency of OctoLingua with different share of deleted file extensionsPlease note that a model trained on files with extensions is significantly less effective on test files with no extensions or with mixed extensions than on ordinary test data. On the other hand, when a model is trained on a dataset in which part of the file extensions are deleted, the model's performance is not greatly reduced on a modified test suite. This confirms that removing extensions from a portion of the files during the training induces our classifier to learn more on code vocabulary. It also shows that the file extension tended to dominate and prevented more weight being given to featured content.
New language support
Adding a new language to OctoLingua is a fairly simple process. It starts with searching and receiving a large number of files in a new language (we can do it programmatically, as described in the paragraph “Data Sources”). These files are divided into training and test kits, and then pass through our preprocessor and tracer extractor. A new dataset is added to the existing pool. The testing kit allows us to make sure that the accuracy of our model remains acceptable.
Adding a new language to OctoLinguaOur plans
OctoLingua is currently in an “advanced prototyping stage.” Our language classification mechanism is already reliable, but it still does not support all the programming languages available on GitHub. In addition to expanding language support, which is not so difficult, we strive to provide language detection with different levels of code detail. Our current implementation already allows us, with a slight modification of our machine learning mechanism, to classify code fragments. Also, it does not seem like something difficult to take a model to a stage where it can reliably detect and classify embedded languages.
We are also considering the possibility of publishing the source code of our model, but we need a request from the community.
Conclusion
Our goal in developing OctoLingua is to create a service that provides reliable definition of the source code language at different levels of detail: from the level of files or code fragments to the potential definition and classification of the language at the row level. All our work on this service is aimed at supporting developers in their daily work on development, as well as creating conditions for writing high-quality code.
If you are interested in contributing to our work, please feel free to contact us on Twitter
@github !