📜 ⬆️ ⬇️

Rating hubs and companies by posts / subscribers

At the moment, there are about 350 hubs . The functionality of the site allows you to sort them by name and by index . But according to other parameters - for example, by the number of posts - no, but I would like to.

I was inspired by the article rating hub posts , and I decided to make a similar, but already make a rating of the hubs themselves .

In the first half of the article I will provide you with the ratings of hubs and companies , as well as a small analysis of them. And in the second - I will sign in detail how I am using Java with the help of the JSoup library to parse the HTML pages of the habr, what interesting phenomena and problems I encountered. And at the end of the article lay out the full source code of the program.
')


All 4 ratings (full) in the form of a web page

Hub rating


When I sorted out the hubs, interesting things came to light. For example, I did not know that there are hubs with zero posts. And there were as many as 4 of them ! Moreover, each of them was signed by more than 500 people .

Troika hubs - Chulan , I am PR and Web development - lead both in the number of posts and in the number of readers. The closet is on the 1st place because the administration deletes the articles there. Next comes Information Security , which is wildly popular in Habré.

Unfortunately, I did not understand why Hub Habrahabr is an offtopic. By the number of posts, he will be in 13th place , and his subscribers are > 80K . It turns out that writing on the site about the same site is a departure from the topic?

Grieved that the Java hub is not as high as we would like.

Company rating

Although initially I planned to build a rating only for hubs, in the comments to the article they put forward a good idea - to do the same for companies. The code did not have to be changed much.

There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habra shows All (1331) , although my program counted them 1343 - and, in fact, this is correct. If you count them by hand - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.


To begin with, I was surprised by the fact that there are 2 types of absence of the company - “the company is deactivated” and “the page was not found”. Although I repeat - all companies were taken from the list. The first type I marked the number of posts -2. There are quite a few such companies. And three companies, whose name consists of numbers - lead to the "page not found." I marked them -3. Such are the cases. Also full of companies with zero posts - for example, Apple . I wonder why creating an account for the company and not writing from it at all?

Actually, if from those 1343 registered on Habré, delete non-existent and companies without posts, only 321 will remain. Such are the cases.

Development

For a very long time I tried to understand Habrahabr Api . As it turned out, it is closed and is still in the process of development. However, in correspondence with support@habrahabr.ru I was told that they have nothing against parsing their pages. Actually, this is exactly how habraklients work for Android (at the moment).

When it comes to projects "for myself", I choose my beloved Java. This time she didn’t let me down either - the JSoup library allowed me to get the necessary data from the HTML page in a few lines. But let's first discuss how the hubs are arranged.

Pages with hubs are located at habrahabr.ru/hubs/pageN/ , where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs - we need to download and analyze these pages before they run out. On each page there is a list of hubs. The format of the list item is fairly simple and is easily parsed. It looks like this:
<div class="hub " id="hub_50"> <div class="habraindex">1 280,58</div> <div class="info"> <div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div> <div class="buttons"> <input type="button" class="mini blue subscribeHub" value="" data-id="50"> <input type="button" class="mini hidden unsubscribeHub" value="" data-id="50" "=""> </div> <div class="clear"></div> <div class="stat"><a href="http://habrahabr.ru/hub/infosecurity/subscribers/" class="members_count">91741 </a>, <a href="http://habrahabr.ru/hub/infosecurity/posts/">3385 </a><a></a></div><a> </a></div><a> </a></div> 


Let's write a method that returns us a list of all the hubs on the site:
 static List<Hub> getAllHubs() { ArrayList<Hub> fullHubsList = new ArrayList<>(); String urlHubsIncomplete = "http://habrahabr.ru/hubs/page"; int pageNum = 1; do { String urlHubs = urlHubsIncomplete + pageNum; try { Document doc = Jsoup.connect(urlHubs).get(); Elements hubs = doc.select(".hub"); if (hubs.size() == 0) { break; } for (Element hubElem : hubs) { Hub hub = new Hub(hubElem); fullHubsList.add(hub); } pageNum++; } catch (Exception e) { e.printStackTrace(); break; } } while (true); return fullHubsList; } 

We spin an infinite while loop, forming a new URL with each iteration. Then, using Jsoup.connect (urlHubs) .get (), we directly get the HTML document with the list of hubs and their parameters. As it is easy to see - a div with information about the hub has a class hub - and by calling doc.select (". Hub") , we get a list of these elements. If its size is zero, then we have passed the last page and have already analyzed all the hubs - then we exit the cycle.

Next, we go through all the hub elements and for each we create an object of type Hub , passing our org.jsoup.nodes.Element into the constructor. It contains the HTML code of the same format as above. Now let's abstract from everything. For this, there is the PLO. Before us there is only that piece of HTML presented above, and the class in which you need to push it. We write a frame for our class:
 import org.jsoup.nodes.Element; public class Hub { String title; int posts; boolean profiled; int membersCount; float habraindex; String url; public Hub(Element hubElem) { } } 

Let's write a constructor. To begin with we will make the simplest - we will receive the data from a heading tag. To do this, we first extract the view div itself.
 <div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div> 

Parsim through
 Element titleDiv = hubElem.select(".title").get(0); Element tagA = titleDiv.getElementsByTag("a").get(0); title = tagA.text(); url = tagA.attr("href"); profiled = (hubElem.select(".profiled_hub").size() != 0); 

Next, we want to parse the number of subscribers and posts — the actual parameters by which we will sort. But we immediately encounter the first problem - the tag contains the string "91741 subscriber" , which we cannot just take and convert to Integer - it contains letters! This is where regular expressions come to the rescue. We quickly write a clever method that gets a string and cuts everything out of it except numbers, and also converts the result to an int. \ D is NOT a digit, but + - “occurs 1 or more times.” Those. we in this case replace the letters with emptiness.
 private int getNumbers(String str) { String numbers = str.replaceAll("\\D+", ""); return Integer.valueOf(numbers); } 

Now we can receive our values ​​with peace of mind:
 String membersCountFullStr = hubElem.select(".members_count").get(0).text(); membersCount = getNumbers(membersCountFullStr); String statFullStr = hubElem.select(".stat").get(0).getAllElements().get(2).text(); posts = getNumbers(statFullStr); 

In principle, this could be stopped, but for the sake of interest I decided to extract all possible information about the hub. There arose a very interesting second problem, which will be the highlight of the article . How to parse habra index?

For starters, you should replace the comma with a period and remove extra spaces. But this is not enough! The parser still gives an error if you copy and paste the habrax into the code - Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is ok. And visually in my IDEA they look absolutely identical!

It turns out that habra designers simply used dash instead of minus - with another character code, and the parser, of course, does not eat it. Take note. The essence of the problem is as follows :
 System.out.println((int)'-');//45 System.out.println((int)'–');//8211 

Sometime in my article on Cunning Java Tasks, I looked at the catch when you can’t distinguish L from the little one. Actually, now I’ve run into a similar problem.

Therefore, the code for extracting Habrax will be a little more complicated:
 String rawHabraIndex = hubElem.select(".habraindex").get(0).text();//1 265,92 char minus = 45;//'-' char dash = 8211;//'–' String niceHabraIndex = rawHabraIndex.replaceAll(" ", "").replace(",", ".").replace(dash,minus);//1266.72 habraindex = Float.valueOf(niceHabraIndex); 

Next, we write a comparator on the posts as a nested static class for Hub
 public static class ComparePosts implements Comparator<Hub> { @Override public int compare(Hub o1, Hub o2) { return o2.posts - o1.posts; } } 

And sort by it somewhere in main
 List<Hub> hubs = getAllHubs(); Collections.sort(hubs, new Hub.ComparePosts()); 

Everything, the task is completed! The number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they can be immediately inserted into the article - and did it at the beginning.

It takes about 10 seconds to receive all the hubs. Source code can be downloaded here . We compile and run it like this, not forgetting to install Jsoup and replace the path with yours:
 javac -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com/kciray/habrahubs/Main.java java -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com.kciray.habrahubs.Main 

In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is the same - however, to find out the number of posts in the company's blog, we had to upload a page for each separately - and it took about 5 minutes. I did a multi-threaded download to speed up. Found that the habr does not allow to load more than 5-7 pages at a time. Actually, I serialized ArrayList <CompanyBlog> and recorded. This file for 100 kilobytes lies with the second source code - you can work with it.

If you are interested in the full rating and in a more compact form - I posted it in the form of a web page .

Source: https://habr.com/ru/post/211775/


All Articles