At the moment, there are
about 350 hubs . The functionality of the site allows you to sort them
by name and
by index . But according to other parameters - for example,
by the number of posts - no, but I would like to.
I was inspired by the article
rating hub posts , and I decided to make a similar, but already make a
rating of the hubs themselves .
In the first half of the article I will provide you with the
ratings of hubs and companies , as well as a small analysis of them. And in the second - I
will sign in detail how I am using Java with the help of the
JSoup library
to parse the HTML pages of the habr, what interesting phenomena and problems I encountered. And at the end of the article lay out the full source code of the program.
')
All 4 ratings (full) in the form of a web pageHub rating
By the number of subscribers When I sorted out the hubs, interesting things came to light. For example, I did not know that there are hubs with zero posts. And there were as many as
4 of them ! Moreover, each of them was
signed by more than 500 people .
Troika hubs -
Chulan ,
I am PR and
Web development - lead both in the number of posts and in the number of readers. The closet is on the 1st place because the administration deletes the articles there. Next comes
Information Security , which is wildly popular in Habré.
Unfortunately, I did not understand why Hub
Habrahabr is an offtopic. By the number of posts, he will be
in 13th place , and his subscribers are
> 80K . It turns out that writing on the site about the same site is a departure from the topic?
Grieved that the
Java hub is not as high as we would like.
Company rating
Although initially I planned to build a rating only for hubs, in the comments to the article they put forward a good idea - to do the same for companies. The code did not have to be changed much.
There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habra shows
All (1331) , although my program counted them 1343 - and, in fact, this is correct. If you count them by hand - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.
By the number of subscribers To begin with, I was surprised by the fact that there are 2 types of absence of the company - “the company is deactivated” and “the page was not found”. Although I repeat - all companies were taken from the list. The first type I marked the number of posts -2. There are quite a few such companies. And three companies, whose name consists of numbers - lead to the "page not found." I marked them -3. Such are the cases. Also full of companies with zero posts - for example,
Apple . I wonder why creating an account for the company and not writing from it at all?
Actually, if from those 1343 registered on Habré, delete non-existent and companies without posts, only 321 will remain. Such are the cases.
Development
For a very long time I tried to understand
Habrahabr Api . As it turned out, it is closed and is still in the process of development. However, in correspondence with
support@habrahabr.ru I was told that they have nothing against parsing their pages. Actually, this is exactly how habraklients work for Android (at the moment).
When it comes to projects "for myself", I choose my beloved Java. This time she didn’t let me down either - the
JSoup library allowed
me to get the necessary data from the HTML page in a few lines. But let's first discuss how the hubs are arranged.
Pages with hubs are located at
habrahabr.ru/hubs/pageN/ , where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs - we need to download and analyze these pages before they run out. On each page there is a list of hubs. The format of the list item is fairly simple and is easily parsed. It looks like this:
<div class="hub " id="hub_50"> <div class="habraindex">1 280,58</div> <div class="info"> <div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div> <div class="buttons"> <input type="button" class="mini blue subscribeHub" value="" data-id="50"> <input type="button" class="mini hidden unsubscribeHub" value="" data-id="50" "=""> </div> <div class="clear"></div> <div class="stat"><a href="http://habrahabr.ru/hub/infosecurity/subscribers/" class="members_count">91741 </a>, <a href="http://habrahabr.ru/hub/infosecurity/posts/">3385 </a><a></a></div><a> </a></div><a> </a></div>
Let's write a method that returns us a
list of all the hubs on the site:
static List<Hub> getAllHubs() { ArrayList<Hub> fullHubsList = new ArrayList<>(); String urlHubsIncomplete = "http://habrahabr.ru/hubs/page"; int pageNum = 1; do { String urlHubs = urlHubsIncomplete + pageNum; try { Document doc = Jsoup.connect(urlHubs).get(); Elements hubs = doc.select(".hub"); if (hubs.size() == 0) { break; } for (Element hubElem : hubs) { Hub hub = new Hub(hubElem); fullHubsList.add(hub); } pageNum++; } catch (Exception e) { e.printStackTrace(); break; } } while (true); return fullHubsList; }
We spin an infinite while loop, forming a new URL with each iteration. Then, using
Jsoup.connect (urlHubs) .get (), we directly get the HTML document with the list of hubs and their parameters. As it is easy to see - a
div with information about the hub has a class
hub - and by calling
doc.select (". Hub") , we get a list of these elements. If its size is zero, then we have passed the last page and have already analyzed all the hubs - then we exit the cycle.
Next, we go through all the hub elements and for each we create an object of type
Hub , passing our
org.jsoup.nodes.Element into the constructor. It contains the HTML code of the same format as above. Now
let's abstract from everything. For this, there is the PLO. Before us there is only that piece of HTML presented above, and the class in which you need to push it. We write a frame for our class:
import org.jsoup.nodes.Element; public class Hub { String title; int posts; boolean profiled; int membersCount; float habraindex; String url; public Hub(Element hubElem) { } }
Let's write a constructor. To begin with we will make the simplest - we will receive the data from a heading tag. To do this, we first extract the view div itself.
<div class="title"> <a href="http://habrahabr.ru/hub/infosecurity/"> </a> <span class="profiled_hub" title=" "></span> </div>
Parsim through
Element titleDiv = hubElem.select(".title").get(0); Element tagA = titleDiv.getElementsByTag("a").get(0); title = tagA.text(); url = tagA.attr("href"); profiled = (hubElem.select(".profiled_hub").size() != 0);
Next, we want to parse the number of subscribers and posts — the actual parameters by which we will sort. But we immediately encounter the first problem - the tag contains the string
"91741 subscriber" , which we cannot just take and convert to Integer - it contains letters! This is where
regular expressions come to the rescue. We quickly write a clever method that gets a string and cuts everything out of it except numbers, and also converts the result to an int.
\ D is NOT a digit, but
+ - “occurs 1 or more times.” Those. we in this case replace the letters with emptiness.
private int getNumbers(String str) { String numbers = str.replaceAll("\\D+", ""); return Integer.valueOf(numbers); }
Now we can receive our values ​​with peace of mind:
String membersCountFullStr = hubElem.select(".members_count").get(0).text(); membersCount = getNumbers(membersCountFullStr); String statFullStr = hubElem.select(".stat").get(0).getAllElements().get(2).text(); posts = getNumbers(statFullStr);
In principle, this could be stopped, but for the sake of interest I decided to extract all possible information about the hub. There arose a very interesting second problem, which will be the
highlight of the article . How to parse habra index?
For starters, you should replace the comma with a period and remove extra spaces. But this is not enough! The parser still gives an error if you copy and paste the habrax into the code -
Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is ok. And visually in my
IDEA they look absolutely identical!
It turns out that habra designers simply used
dash instead of
minus - with another character code, and the parser, of course, does not eat it. Take note. The essence of the problem is
as follows :
System.out.println((int)'-');
Sometime in my article on
Cunning Java Tasks, I looked at the catch when you can’t distinguish L from the little one. Actually, now I’ve run into a similar problem.
Therefore, the code for extracting Habrax will be a little more complicated:
String rawHabraIndex = hubElem.select(".habraindex").get(0).text();
Next, we write a comparator on the posts as a nested static class for Hub
public static class ComparePosts implements Comparator<Hub> { @Override public int compare(Hub o1, Hub o2) { return o2.posts - o1.posts; } }
And sort by it somewhere in main
List<Hub> hubs = getAllHubs(); Collections.sort(hubs, new Hub.ComparePosts());
Everything, the task is completed! The number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they can be immediately inserted into the article - and did it at the beginning.
It takes about 10 seconds to receive all the hubs. Source code can be
downloaded here . We compile and run it like this, not forgetting to install
Jsoup and replace the path with yours:
javac -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com/kciray/habrahubs/Main.java java -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com.kciray.habrahubs.Main
In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is the same - however, to find out the number of posts in the company's blog, we had to upload a page for each separately - and it took about 5 minutes. I did a multi-threaded download to speed up. Found that the habr does not allow to load more than 5-7 pages at a time. Actually, I serialized
ArrayList <CompanyBlog> and recorded. This file for 100 kilobytes lies
with the second source code - you can work with it.
If you are interested in the
full rating and in a
more compact form - I posted it
in the form of a web page .