Rating hubs and companies by posts / subscribers

At the moment, there are about 350 hubs . The functionality of the site allows you to sort them by name and by index . But according to other parameters - for example, by the number of posts - no, but I would like to.

I was inspired by the article rating hub posts , and I decided to make a similar, but already make a rating of the hubs themselves .

In the first half of the article I will provide you with the ratings of hubs and companies , as well as a small analysis of them. And in the second - I will sign in detail how I am using Java with the help of the JSoup library to parse the HTML pages of the habr, what interesting phenomena and problems I encountered. And at the end of the article lay out the full source code of the program.
')

All 4 ratings (full) in the form of a web page

Hub rating

By the number of posts

Closet 35 971
I am PR 5 461
Web Development 4 011
Information Security 3 385
Google 2,770
Iron 2 733
Gadgets. Devices for geeks 2 375
Programming 2 293
Linux 2 235
Android 1 965
Javascript 1 687
Apple 1 612
Habrahabr 1 568
.NET 1,485
PHP 1,465
System Administration 1,454
DIY or DIY 1 442
Development 1,331
Project Management 1 261
Interfaces 1 257
Microsoft 1,237
Game Development 1 218
Open source 1 110
Smartphones and communicators 1 091
JAVA 1 020
Design in IT 996
Algorithms 991
Copyright 982
Social networks and communities 949
GTD 939
Windows 919
The educational process in IT 916
Python 866
Robotics 798
Development for Android 783
Development for iOS 777
Hosting 749
C ++ 711
Legislation and IT business new 677
Media 664
(...)
Tumblr 3
Cubrid 3
Industrial programming new 3
Julia new 2
Microsoft Access 2
Growth Hacking new 2
Google Checkout 0
Myspace 0
Xcode new 0
SCADA new 0

By the number of subscribers

Closet 124 521
I am PR 101 864
Web development 96 117
Android 95 361
Gadgets. Devices for geeks 95 020
Smartphones and communicators 94 376
Google 93 844
DIY or DIY 92 322
Iron 91 959
Information Security 91 729
Linux 91 103
Robotics 89 721
Programming 89,668
Tablets 88 757
Google Chrome 88 197
Operating systems 88,098
Interfaces 88 064
Windows 87 695
iPhone 87 609
Algorithms 87 372
Web Design 87 341
Electronic books 86 582
Design in IT 86 266
Perfect code 85 525
Browsers 85 443
iPad 85 290
Energy and batteries 84 866
Popular science 84 668
PHP 84 621
(...)
Backup new 3 503
Xcode new 2 823
Physics new 2 372
Raspberry Pi new 2 274
Industrial programming new 2 141
Development for e-commerce new 2 034
SCADA new 1 856
Laravel new 1 799
Growth Hacking new 1 063
Julia new 948

When I sorted out the hubs, interesting things came to light. For example, I did not know that there are hubs with zero posts. And there were as many as 4 of them ! Moreover, each of them was signed by more than 500 people .

Troika hubs - Chulan , I am PR and Web development - lead both in the number of posts and in the number of readers. The closet is on the 1st place because the administration deletes the articles there. Next comes Information Security , which is wildly popular in Habré.

Unfortunately, I did not understand why Hub Habrahabr is an offtopic. By the number of posts, he will be in 13th place , and his subscribers are > 80K . It turns out that writing on the site about the same site is a departure from the topic?

Grieved that the Java hub is not as high as we would like.

Company rating

Although initially I planned to build a rating only for hubs, in the comments to the article they put forward a good idea - to do the same for companies. The code did not have to be changed much.

There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habra shows All (1331) , although my program counted them 1343 - and, in fact, this is correct. If you count them by hand - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.

By the number of subscribers

Yandex 11 056
Google 10,999
Microsoft 6 797
Intel 5 463
Apple 4 124
Opera Software ASA 3 873
Journal Hacker 3 034
Zfort Group 2 969
JetBrains 2 946
Mail.Ru Group 2 921
VimpelCom (Beeline) 2 730
IBM 2 655
Art. Lebedev Studio 2 640
Nokia 2 542
TM 2 314
Simple Science 2 222
Samsung 2,222
2GIS 1 992
Adobe 1 878
ABBYY 1,847
Box Overview 1 844
Vkontakte 1 841
HP 1 828
Mosigra 1 772
Skype 1 718
Kaspersky Lab 1,667
ASUS Russia 1 615
Sony Mobile Communications 1 572
Apps4All 1 541
LinguaLeo 1,493
(...)
Angie 5
Photoplay 5
Florist.ru 5
PlatOn 5
Polyvizor 5
Dulton Media LLC 5
bdl premium 4
GolovachCourses 4
timera inc. four
Slon.ru 3

By the number of posts

Yandex 1 012
Microsoft 828
Intel 491
Google 422
Mail.Ru Group 317
Apps4All 292
Opera Software ASA 234
Samsung 215
ASUS Russia 209
ESET NOD32 200
ABBYY 197
IBM 190
HP 188
Evernote 186
Webnames.ru 169
MUK 154
Nokia 142
Zfort Group 134
Positive Technologies 131
Simple Science 127
EPAM Systems 127
Sony Mobile Communications 116
CRIC 115
Turbomilk 103
Selectlel 101
REG.RU 97
Box Overview 96
Ciklum 96
SmartGadget 94
JetBrains 87
(...)
HotSupport -2
Worksection -2
Next -2
MFI Soft -2
NVIDIA Corporation -2
DeepArtment -2
RuTube -2
WANTED PHONE -3
Studio 8812 -3
590.com.ua -3

To begin with, I was surprised by the fact that there are 2 types of absence of the company - “the company is deactivated” and “the page was not found”. Although I repeat - all companies were taken from the list. The first type I marked the number of posts -2. There are quite a few such companies. And three companies, whose name consists of numbers - lead to the "page not found." I marked them -3. Such are the cases. Also full of companies with zero posts - for example, Apple . I wonder why creating an account for the company and not writing from it at all?

Actually, if from those 1343 registered on Habré, delete non-existent and companies without posts, only 321 will remain. Such are the cases.

Development

For a very long time I tried to understand Habrahabr Api . As it turned out, it is closed and is still in the process of development. However, in correspondence with support@habrahabr.ru I was told that they have nothing against parsing their pages. Actually, this is exactly how habraklients work for Android (at the moment).

When it comes to projects "for myself", I choose my beloved Java. This time she didn’t let me down either - the JSoup library allowed me to get the necessary data from the HTML page in a few lines. But let's first discuss how the hubs are arranged.

Pages with hubs are located at habrahabr.ru/hubs/pageN/ , where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs - we need to download and analyze these pages before they run out. On each page there is a list of hubs. The format of the list item is fairly simple and is easily parsed. It looks like this:

<div class="hub " id="hub_50"> <div class="habraindex">1 280,58</div> <div class="info"> <div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div> <div class="buttons"> <input type="button" class="mini blue subscribeHub" value="" data-id="50"> <input type="button" class="mini hidden unsubscribeHub" value="" data-id="50" "=""> </div> <div class="clear"></div> <div class="stat"><a href="http://habrahabr.ru/hub/infosecurity/subscribers/" class="members_count">91741 </a>, <a href="http://habrahabr.ru/hub/infosecurity/posts/">3385 </a><a></a></div><a> </a></div><a> </a></div>

Let's write a method that returns us a list of all the hubs on the site:

 static List<Hub> getAllHubs() { ArrayList<Hub> fullHubsList = new ArrayList<>(); String urlHubsIncomplete = "http://habrahabr.ru/hubs/page"; int pageNum = 1; do { String urlHubs = urlHubsIncomplete + pageNum; try { Document doc = Jsoup.connect(urlHubs).get(); Elements hubs = doc.select(".hub"); if (hubs.size() == 0) { break; } for (Element hubElem : hubs) { Hub hub = new Hub(hubElem); fullHubsList.add(hub); } pageNum++; } catch (Exception e) { e.printStackTrace(); break; } } while (true); return fullHubsList; }

We spin an infinite while loop, forming a new URL with each iteration. Then, using Jsoup.connect (urlHubs) .get (), we directly get the HTML document with the list of hubs and their parameters. As it is easy to see - a div with information about the hub has a class hub - and by calling doc.select (". Hub") , we get a list of these elements. If its size is zero, then we have passed the last page and have already analyzed all the hubs - then we exit the cycle.

Next, we go through all the hub elements and for each we create an object of type Hub , passing our org.jsoup.nodes.Element into the constructor. It contains the HTML code of the same format as above. Now let's abstract from everything. For this, there is the PLO. Before us there is only that piece of HTML presented above, and the class in which you need to push it. We write a frame for our class:

 import org.jsoup.nodes.Element; public class Hub { String title; int posts; boolean profiled; int membersCount; float habraindex; String url; public Hub(Element hubElem) { } }

Let's write a constructor. To begin with we will make the simplest - we will receive the data from a heading tag. To do this, we first extract the view div itself.

 <div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div>

Parsim through

 Element titleDiv = hubElem.select(".title").get(0); Element tagA = titleDiv.getElementsByTag("a").get(0); title = tagA.text(); url = tagA.attr("href"); profiled = (hubElem.select(".profiled_hub").size() != 0);

Next, we want to parse the number of subscribers and posts — the actual parameters by which we will sort. But we immediately encounter the first problem - the tag contains the string "91741 subscriber" , which we cannot just take and convert to Integer - it contains letters! This is where regular expressions come to the rescue. We quickly write a clever method that gets a string and cuts everything out of it except numbers, and also converts the result to an int. \ D is NOT a digit, but + - “occurs 1 or more times.” Those. we in this case replace the letters with emptiness.

 private int getNumbers(String str) { String numbers = str.replaceAll("\\D+", ""); return Integer.valueOf(numbers); }

Now we can receive our values with peace of mind:

 String membersCountFullStr = hubElem.select(".members_count").get(0).text(); membersCount = getNumbers(membersCountFullStr); String statFullStr = hubElem.select(".stat").get(0).getAllElements().get(2).text(); posts = getNumbers(statFullStr);

In principle, this could be stopped, but for the sake of interest I decided to extract all possible information about the hub. There arose a very interesting second problem, which will be the highlight of the article . How to parse habra index?

For starters, you should replace the comma with a period and remove extra spaces. But this is not enough! The parser still gives an error if you copy and paste the habrax into the code - Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is ok. And visually in my IDEA they look absolutely identical!

It turns out that habra designers simply used dash instead of minus - with another character code, and the parser, of course, does not eat it. Take note. The essence of the problem is as follows :

 System.out.println((int)'-');//45 System.out.println((int)'–');//8211

Sometime in my article on Cunning Java Tasks, I looked at the catch when you can’t distinguish L from the little one. Actually, now I’ve run into a similar problem.

Therefore, the code for extracting Habrax will be a little more complicated:

 String rawHabraIndex = hubElem.select(".habraindex").get(0).text();//1 265,92 char minus = 45;//'-' char dash = 8211;//'–' String niceHabraIndex = rawHabraIndex.replaceAll(" ", "").replace(",", ".").replace(dash,minus);//1266.72 habraindex = Float.valueOf(niceHabraIndex);

Next, we write a comparator on the posts as a nested static class for Hub

 public static class ComparePosts implements Comparator<Hub> { @Override public int compare(Hub o1, Hub o2) { return o2.posts - o1.posts; } }

And sort by it somewhere in main

 List<Hub> hubs = getAllHubs(); Collections.sort(hubs, new Hub.ComparePosts());

Everything, the task is completed! The number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they can be immediately inserted into the article - and did it at the beginning.

It takes about 10 seconds to receive all the hubs. Source code can be downloaded here . We compile and run it like this, not forgetting to install Jsoup and replace the path with yours:

 javac -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com/kciray/habrahubs/Main.java java -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com.kciray.habrahubs.Main

In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is the same - however, to find out the number of posts in the company's blog, we had to upload a page for each separately - and it took about 5 minutes. I did a multi-threaded download to speed up. Found that the habr does not allow to load more than 5-7 pages at a time. Actually, I serialized ArrayList <CompanyBlog> and recorded. This file for 100 kilobytes lies with the second source code - you can work with it.

If you are interested in the full rating and in a more compact form - I posted it in the form of a web page .

Source: https://habr.com/ru/post/211775/

All Articles

Rating hubs and companies by posts / subscribers

Hub rating

Company rating

Development

More articles: