Habra Analytics Tools: Compare Hubs

If you do not read habr on weekends, then you most likely missed the launch of the mini-project Habra Analytics Tools . The goal is simple - to provide Habrahabr authors with tools for analyzing articles and hubs. The first tools are devoted to the analysis and comparison of hubs, and above all they are useful for assessing the audience of articles. For example, the graph on the right shows which hubs are also read by Google subscribers (the height of the bar is the percentage of subscribers who also read hubs on the X axis), and on the left is a visual disposition relative to the two other hubs (Artificial Intelligence and Linux).

We discussed the construction of Venn diagrams earlier in this article , the source code is available on github (or ready-made executable files for Windows , Linux and Mac OS ) - you will also need to download and unpack data.7z (also needed for this article).

Here we will talk about the preferences of readers of a particular hub in relation to all other Habrahabr hubs at once.

The source data tula called hubs are available on github , executable files for Windows and Linux are also available there.
')

Why do you need it?

The most obvious use of the hubs tool is to analyze the preferences of the target audience. Imagine that we are writing for a corporate blog and want to find out what else our subscribers are reading. What topics interest them? Consider as an example the corporate Yandex blog:

For comparison, we give the Yandex hub (not a corporate blog):

From the two graphs, we see that preferences are significantly different (although they are united by some love for Google).

Such a significant difference arises primarily due to a significant difference in the audience of the blog and the hub:

Thus, the Venn diagram tells us that the audience is significantly different, and the two histograms from above show how exactly the tastes of the readers differ. So the decision on which hub to write to a corporate blog and / or a regular hub can take into account the topic of the article and its correspondence to the preferences of users from the histograms above.

Similar hubs

If the Venn diagram suggests the reciprocal arrangement of the two hubs and answers the question: “How will the audience of hub X grow, if we add Y”, then the hub histogram answers the question which hubs Z ₁ , Z ₂ , ... Z _n resemble hub X?

Here we provide two metrics to compare hubs:

If z% of readers of hub X read Y, then X ~ Y = z, an example, if 10% of readers of the hub of Cosmonautics are subscribed to the hub of C ++, then 10 is the degree of similarity of the hubs of Cosmonautics in C ++ (this ratio is not symmetrical)
Coefficient_Jaccarat :

The first metric is best suited for a natural interpretation of the preferences of readers of the hub, and the second can be useful for automatic clustering of hubs in the directory. Let us give an example with an assessment of the preferences of readers of the hub:

In this histogram, we see that the corporate hub has several main groups of readers, let's call them “development” - algorithms, programming, etc, “security”, “Open source”, “refactoring” and “operating systems”. This factor can be taken into account when writing articles, for example, emphasizing certain aspects of interest to one of the groups of readers.

Code, documentation and examples

To install, you must download either:

Windows or Linux executable
sources hubs.py and src /

You also need to download the data.7z archive (~ 15MB, unzipped ~ 200MB) and unpack it in the same directory as the script. Further, depending on the downloaded version, you need to call python hubs.py , either ./hubs.efl , or hubs.exe . We will stick to the first version.

Basic commands

The script is a console, so its most important command is help, accessible through the -h, --help flags:
python hubs.py -h
display example:

The main command to display a histogram is “what else are hub subscribers reading”:
python hubs.py --alsoread space
example output:

For each hub in the program, the corresponding name from the link to this hub is used. Habrahabr.ru/hub space - for the hub Cosmonautics, space is the name in the program. The output of the available names of the hubs and their full names, all operations are performed using short Latin names from the list (they are also used in the url on the habr)
python hubs.py --hublist
Ideally used with the grep command:

As noted earlier, yandex is a corporate blog and hub; the key is used to eliminate ambiguity:
--company . Therefore, in order to create a diagram for a corporate Yandex blog, you need to call:
python hubs.py --alsoread yandex --company .

To create a histogram based on the Jacquard coefficient, use the --similar key:
python hubs.py --similar space

To display the maximum (minimum) values up to N hubs by switching on or by Jacquard coefficient without the histogram itself, you must call the --max ( --min ) key:

Ideas for the following tools

Monitor articles: after writing the article, you need to call the monitor $article_id and it will record and draw the change in views (pluses, etc) over time, as well as shares and likes on social networks and possibly comments from readers.

Web interface: all the same, but accessible through the web.

Ideas, comments, help from the audience and suggestions are especially welcome.

Source: https://habr.com/ru/post/221087/

All Articles