📜 ⬆️ ⬇️

The most popular words in two terabytes of code

Hello friends! I analyzed 2TB code here and got the most popular words in different programming languages. The results can be viewed in the form of tag clouds and a simple list:

image
The site is here , and its source can be read on github .

Under the cat described in detail about how the data was collected, how the site was built and how the clouds fit. And a little bit of observation.
')
Enjoy reading!

Observations



How?


I collected the data using BigQuery. Google, together with GitHub, posted a full snapshot of the sources on a public date set github_repos . The index was built at the end of 2016.

When building clouds, I impose some restrictions on the data:


How was the data collected?


BigQuery is an awesome platform. The contents of all files from the githab are stored in plain text in tables .

FileContent
File 1.h// File 1 content \ n # ifndef FOO \ n # define FOO ...
File 2.h// File 2 content \ n # ifndef BAR \ n # define BAR ...

BigQuery allows you to write an ordinary SQL query and execute it with amazing speed.

At first I decided to break the contents of all files into words, and then use GROUP BY to count them.

WordHow many times have I met
File2
content2
......

Unfortunately, this approach takes the word out of context. I really wanted to show words with examples of how they are used:

image

How to be? Instead of a simple word break, I created an intermediate table, where the files are broken line by line:
LineHow many times did I meet a string
// File 1 contentone
#ifndef FOOone
#ifndef FOOone
......

Such intermediate storage reduces the size of the processed data from ~ 2TB to ~ 12GB.

Now, to get the most popular words from this table, we can break each line into individual words, but at the same time keep the original line:
LineWord
// File 1 contentFile
// File 1 contentcontent
#ifndef FOOifndef
#define FOOFOO
......

It would seem that almost nothing has changed. But, in this interpretation, we can use window functions to get the top 10 lines for each word ( SELECT ... OVER (PARTITION BY ...) - as in this question on StackOverflow ).

The current request code can be found here: extract_words.sql .

By the way ... My SQL is at a very basic level. Because if you, dear reader, find an error, or know a more suitable way - please let me know.

How to draw a tag cloud?


At the heart of all tag developers I know is this algorithm:

`w`:
1. `w` (x, y)
2. - 1.

This code can run indefinitely, because we either stop trying after several iterations, or reduce the font size until the word fits.

For simplicity, words can be thought of as rectangles. We are trying to place all the rectangles on the screen so that no rectangle crosses the occupied pixels on the screen.

The most resource-intensive part of this algorithm is intersection checking. Especially at the end, when all the free space is basically already taken, finding a new area where you can insert a word becomes very difficult (and sometimes not possible).

Different implementations are trying to speed up this part of the algorithm by indexing the occupied space.


I wanted to try something different. Instead of indexing the occupied space, I wanted to have an index of the free area. So you can immediately select a large rectangle in which there is guaranteed a place for a new arrival.

For the index, I used a quad tree . Each intermediate tree node stores information about how many free / occupied pixels there are. So you can instantly sweep aside quadrants in which there are not enough free pixels.

The easiest way to see it in the picture. Here is the quad-tree of the JS logo:

image

White empty rectangles are free space. If you need to add a new rectangle which is smaller than any of these empty rectangles - we can safely draw it there.

This approach gives good results, but can lead to visual artifacts. After all, no new rectangle can be located at the intersection of quadrants:

image

Moreover, what if none of the free quadrants are large enough? And if the adjacent quadrants are combined, then there is enough space?

The union of free quadrants was my next step. I just "expand" quads to the left / right of the target. This slows down the build time a little, but reduces artifacts and gives better results:

image

By the way ... My handler code is not available with the site. It was quickly written and is difficult to use in other contexts. If you need a good handler, look at
amueller / word_cloud

How was the site made?


Text rendering


In general, I was pleased with the speed of building a tag cloud. But in the context of my site this speed was not enough.

I used SVG to draw words on the screen. The rendering of so many text-based SVG elements can easily block the UI stream for a few seconds. Where are the positions for the tag cloud?

Fortunately, you can remove the stacker offline. Instead of counting the positions of the words on the fly, when the browser opens the page, I decided to count the positions once, save them to a file, and then draw them statically. This allows us to focus on optimizing the UI flow.

In order not to block the browser for long periods of time, you need to break all the work into small pieces, and perform it asynchronously. At one iteration of the event loop, we add N words and exit the function so that the browser can handle other events. At the next iteration, we add more words, and so on.

For these purposes, I wrote anvaka / rafor . This library is an adaptive, asynchronous `for` loop based on requestAnimationFrame() . All iterations are performed at different stages of the event cycle, and thereby reduces the load on the UI stream. The initial loading of the site looks smoother.

Navigation and Zoom


Using the mouse, keyboard, or touch-screen, you can zoom in, delete a map, and move it around the screen, just like Google Maps does. All this is done using the panzoom library.

Application model


I use vue.js for UI. It is easy to use and quick to use. It is especially cool to have vue components in separate files - you don't often have to switch between js / markup / styles. Hot-reload makes development especially enjoyable.

Application state is stored in a single appState object. When you select a programming language, words and their context are loaded asynchronously.

For the exchange of events between components, I use my mini-library ngraph.events . Initially, I made it for a speedy exchange of events in my graph libraries. But here it works fine as a dispatcher.

Finally, anvaka / query-state binds the query string to bidirectional binding to the choice of programming language.

image

Finally


This project has been my evening hobby for the past two months. Despite all the flaws of the tag clouds, I was very interested.

I sincerely hope that you enjoyed this study too :)!

Thank you, dear reader, for your attention. And special thanks to my soul mate for her endless support and hints.

Source: https://habr.com/ru/post/320256/


All Articles