What is the blogosphere today? You may not agree with me, but in my opinion 80% of what people understand by the word “blogosphere” is placed in the Runet in LiveJournal. Yes, Yandex indexes a large number of blog sites, there is also LiveInternet and diary.ru and blogs on mail.ru too. And much more. But try to remember when you read something interesting, worthy of attention on the blog from LiveInternet? Is there anything on mail.ru blogs?
It is a well-known case that in the LiveJournal the ball is ruled by thousandths (and recently already 10,000th).
Let's take a closer look, who are they, the top bloggers of Runet?
In a hurry, I sketched a robot that went to the profile of a thousand bloggers, the first by the criterion of “friends in”, according to the
rating of LiveJournal . There is also the so-called Yandex credibility rating, but let's not talk about sad things today.
')
The robot collected personal data and carefully folded them into a common pile. The robot code was written in C #, I will not bore you with unnecessary technical details, everything is quite simple and straightforward - I went to the page, parsed it for the occurrence of the necessary variables, saved it, moved on to the next one.
And so 1000 times.
Here is the function code that receives the URL of the page as input, and outputs the HTML page as a string. Now it can be parsed with the usual string functions, or it can be used by RegExp.
private string GetPageByURL( string strURL)
{
try
{
// used to build entire input
StringBuilder sb = new StringBuilder ();
// used on each read operation
byte [] buf = new byte [8192];
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create(strURL);
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null ;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding .GetEncoding( "UTF-8" ).GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
return sb.ToString();
}
catch (Exception ex)
{
return "" ;
}
}
* This source code was highlighted with Source Code Highlighter .
Now in the cycle we go through the pages:
www.livejournal.com/ratings/users/?page=1...
www.livejournal.com/ratings/users/?page=50we extort them with the help of the above function, then we run over them like strings and collect the names of users and their “friends from u” in the ArrayList.
We get a list of 1000 people. Then we go through it in a loop, and go to the pages http: // [username] .livejournal.com / profile and parse them for the occurrences of the other variables.
After that, we write everything to the database, file, or simply spit it out on the page, and from there copy-paste with handles in Excel.
And in order for LiveJournal not to take offense at my robot - put a significant delay between calls, otherwise they very strictly warn you if you will come to us with your robots and not wipe your feet - banned. Therefore, the whole process took more than a day - writing a robot, testing, working, formatting the results. I agree, it was possible to cope with php in a screen and a half and 2 hours for everything about everything, but .NET is more familiar to me.
It was such a sign.
User | Friend of | Friends | City | Region | Country | Journal entries | Total Comments | Created on | Last Updated | Account Type |
drugoi | 69145 | 749 | Moscow | | Norway | 13,188 | 1,698,002 comments received, 66,105 comments posted | 2002-03-02 | 1 hour ago | Permanent Account |
tema | 68601 | 24 | South Palmyra | | Russian Federation | 3,638 | 2,049,489 comments received, 6,880 comments posted | 2001-09-04 | 4 hours ago | Permanent Account |
navalny | 52840 | 10,000 | Moscow | Moscow | Russian Federation | 2.306 | 957,191 comments received, 14,365 comments posted | 2006-04-19 | 3 hours ago | Paid Account |
sergeydolya | 51964 | 1991 | | | | 870 | 243,261 comments received, 28,394 comments posted | 2007-11-09 | 1 day ago | Permanent Account |
pesen_net | 48525 | 202 | Riga | | Russian Federation | 187 | 53,083 comments received, 10,084 comments posted | 2007-04-22 | 6 weeks ago | Paid Account |
zyalt | 35617 | 384 | Moscow | Moscow | Russian Federation | 1.619 | 246,360 comments received, 11,344 comments posted | 2006-07-26 | 22 hours ago | Paid Account |
dolboeb | 33820 | 1942 | Moscow | | Russian Federation | 8,335 | 522,484 comments received, 38,400 comments posted | 2001-02-06 | 58 minutes ago | Permanent Account |
belonika | 33151 | 4604 | | | | 781 | 208,475 comments received, 36,079 comments posted | 2008-09-08 | 6 hours ago | Paid Account |
eprst2000 | 31454 | eleven | Moscow time | Moscow | Russian Federation | 460 | 46,324 comments received, 3,724 comments posted | 2002-08-22 | 1 week ago | Paid Account |
tebe_interesno | 29831 | 612 | Moscow | Moscow | Russian Federation | 547 | 31,679 comments received, 8,823 comments posted | 2007-06-25 | 10 weeks ago | Paid Account |
mi3ch | 29827 | 738 | Moscow | Moscow | Russian Federation | 6,930 | 374,776 comments received, 44,883 comments posted | 2003-04-03 | 2 hours ago | Permanent Account |
shpilenok | 29637 | 119 | | Bryansk region | Russian Federation | 303 | 57,348 comments received, 4,461 comments posted | 2009-01-11 | 6 hours ago | Paid Account |
zhgun | 26081 | 29 | | | | 188 | 22,301 comments received, 8,626 comments posted | 2002-04-28 | 5 weeks ago | Paid Account |
mantrabox | 25572 | 373 | | | Russian Federation | 2,915 | 60,720 comments received, 17,850 comments posted | 2002-12-29 | 1 week ago | Paid Account |
olegtinkov | 25291 | eleven | Moscow | | Russian Federation | 638 | 137,481 comments received, 6,277 comments posted | 2009-02-21 | 18 hours ago | Paid Account |
radulova | 24682 | 595 | | Moscow | Russian Federation | 8,622 | 874,385 comments received, 31,657 comments posted | 2004-11-14 | 1 hour ago | Paid Account |
tanyant | 24282 | 199 | | | | 318 | 67,802 comments received, 6,868 comments posted | 2007-12-14 | 2 weeks ago | Plus Account |
stillavin | 23615 | 1703 | Moscow | Moscow | Russian Federation | 1,299 | 311,283 comments received, 18,247 comments posted | 2006-08-23 | 3 days ago | Paid Account |
mzadornov | 22568 | 80 | Moscow | | Russian Federation | 161 | 62,221 comments received, 136 comments posted | 2009-09-15 | 3 days ago | Plus Account |
miumau | 21495 | 47 | Berlin | | Germany | 2,957 | 163,632 comments received, 13,520 comments posted | 2002-02-27 | 1 hour ago | Paid Account |
...
The entire table (and neither in height nor in width) did not fit into Habratopik, but the complete file with 1000 entries
is in Google Docs . The data is relevant for today, July 21, 2011, for another couple of months, or even half a year, they are unlikely to change significantly.
I could not resist building a couple of charts and graphs, although everyone can use this data freely and at their discretion.
Even with the usual sorting columns up and down, you can observe interesting details.
For example, sorting entries by the number of friends, we find that the most friends are not
navalny , who has 10,000 of them (although the limit for mere mortals on LiveJournal is 5,000 friends), and for some user
inexi , who has 20624.
Or, for example, we sort by the number of blog entries. Most of them nastruchil of course
cypa , well, who else? Since 2003, he has made 43,390 entries.
And when reverse sorting, we immediately find a curious bot -
blog_d_medvedev . From the day it was created in 2009, this pseudo-browser has not made a single entry in its blog, but 5,816 people have added it as a friend. Obviously some kind of robot, apparently just a toy in the wrong hands. Surely it didn’t go without muhlezh - friends marathons, rating cheating, vote rigging - all matters.
Continuing the sorting, we learn that the oldest blog, which got into the TOP1000, was created on March 31, 2000, and the youngest - three months ago, in April of this year.
Also in TOP-139 Basic Account (Basic), 560 Paid Account (Paid), 15 Permanent Account (Permanent), 284 Plus Account (Improved) and one Early Adopter (and who is this at all, by the way?
billycorgan - what does he do in the Russian top if he lives in the USA and writes in English?).
It turns out - not so many paid accounts in the first 1000. Just over half of all.

Or, for example, a breakdown by country:

In short, you can think of a lot of work for analysts, statisticians, various specialists in promoting anything and other curious idlers.
At first I thought to make this service online and constantly updated, but then I decided that for the daily 1000 requests to the LiveJournal server (more precisely, even more) I would not be stroked over my head with my robot. So, limited to one-time statistics.
The statistics file is welcome for distribution, no restrictive copyright is provided for it.
UPD: I would be happy if you tell me how you can allow any user to sort the columns in Google Docs, but do not allow him to change the results, i.e. distort the data itself.
In any case, the file can be saved to your computer from the File-Download As-Excel menu, and you can sort where you want at home in Microsoft Office.