Visualizing data processing tools with Github

Do you use MySQL, Postgres or Mongo in your work, and maybe even Apache Spark? Want to know how these projects started and where are they going now? In this article I will present the appropriate visualization.

Today there is a huge variety of data processing tools for every taste: from classic relational databases to modern tools for processing event streams in real time. Of particular popularity and love of developers cause open source projects. Any problem that has arisen with such technologies will not be buried in the sand by a vendor like Oracle (which, of course, will provide you with a workaround to work around the problem), but will be openly discussed and eventually corrected. And you can not only report the problem yourself, but also fix it yourself and offer your code to the community.
')
At the same time, almost all open-source projects now store their source code on github. There is either the main repository, or at least its mirror, maintained up to date. This version control system contains a large amount of information about every change ever made in the source code of each of the projects that are stored there. What if to analyze this information?

For analysis, I took a list of the most popular data processing tools available today and analyzed their repositories. Then for each of the users who made changes to these tools, I sampled their last activity on Github and collected from there the most popular repositories to which they made changes. Thus, the list of repositories that have fallen into the visualization is not limited to projects I personally know, but rather reflects the real situation in the community more objectively. And therefore projects like Node.js, Docker and Kubernetes got into the visualization in spite of the fact that they have a very indirect relation to the field of data processing.

In essence, this visualization has one remarkable property - it is an absolutely independent look at open-source data processing projects, free from any kind of marketing. After all, real changes made to real code by real people are analyzed. In my opinion, this is great, because vendors who advertise their products have greatly undermined the credibility of various analytical publications.

So, within the framework of this work 150 github repositories were analyzed, more than one and a half million commits made by 8333 unique developers. Python was used for uploading, parsing and communicating with the Github API, Postgres for storing and analyzing data, and Matplotlib for rendering. The visualization was mostly done by hands, the algorithms for the movement of the graph vertices were also written by hands (in fact, the vertex is attracted to the associated with it and repels from nearby ones). Here is the visualization itself:

I recommend to look at the maximum quality so that all the names of the projects are normally visible.

Each vertex of the graph represents one project. The area of the circle representing it is proportional to the number of unique people who made changes to a specific project 10 weeks before the time represented by the visualization (see the scale at the top of the video). The text of the project name also depends on the number of unique contributors - a large yellow text for the largest projects, for smaller projects the text is smaller, and for the smallest it is not displayed at all. The edge between projects A and B is indicated if there is a person who has made changes to both these projects within 10 weeks preceding the moment of visualization. I found it reasonable, because it links together the forks of one technology and just related projects like Apache Hadoop and Apache HBase.

My original publication is available here .

Source: https://habr.com/ru/post/280412/

All Articles

Visualizing data processing tools with Github

More articles: