📜 ⬆️ ⬇️

Reading old articles Habra with pictures

Some time ago I decided to refresh my knowledge and read something about graphs. “Well, of course, there should be good articles on Habré!” I thought, and I was right. There are lots of articles. But they mostly look like this: one , two , three . Open and guess from one attempt why it is absolutely impossible to understand something from these articles, although it is written in quite understandable language. No pictures! Well, how to study graphs without pictures? No

A newbie on Habré puzzled to ask: “How so - no pictures? There is habrastorage.org! Yes there is. But it was not always, but the images automatically became perezalivatsya and did become only in July 2013 . And before that, the pictures were hosted anywhere — on all sorts of radicals, imagejakah, even on dropbox, people used to naively try to upload something. As a result, we have on Habré a bunch of articles from 2006-2013 with missing pictures.

Let's fix this!

')

Plan


At the first stage we face the following tasks:


Implementation


In general, only the lazy Habr has not yet paris, well, we are not lazy. Moreover, what is there to write in Python + requests:



Fill the code on the virtual machine in the cloud, run it, come back after 2 days (you could, of course, parse in several streams - but I remember somewhere in the FAQ Habr asked not to pull it with bots more than 2 times per second).

results


In total, the articles "Before the Advent of Habrastoraj" found 157601 pictures placed on the "left" image hosting. Of these, 92549 links are still valid, and 65052 links are no longer.



Well, ok, we have links to 65052 inaccessible pictures in articles on Habré. What to do with it? Get them from the archive.org cache, of course! He in order and invented!

You can check the availability of pictures in the web archive with a simple query:
http://archive.org/wayback/available?url=%image_url% 


For example, the link to the image img513.imageshack.us/img513/3580/pic1e.jpg, which is absent in the habrahabr.ru/post/63982 article, is fully accessible at http://web.archive.org/web/20131103061340/http:// img513.imageshack.us/img513/3580/pic1e.jpg

There is, however, one problem. Sometimes the Internet archive claims that it has a link to the cached image, but in fact it does not. Lying in general. Those. we will have to check every link to the cached image. Well, nothing, check it out .

Fill, run, wait half a day.

results


The web archive turned out to be available 13863 pictures of those that are no longer available on the original links in articles on Habré.



This whole experiment gave us a good “average temperature in the hospital”: we now know that the picture filled in for random hosting has a chance to survive about 58% over the next 2-9 years. We also know that archive.org is useful and sometimes helps, but the chances of recovering a broken link to a picture on Habré with it are 21.3%.

Conclusions to the first part of the article


So, now we have an array with more valid links to the pictures and the second array , with broken links and corresponding links to the available pictures in the web archive. In this place, you could ask the administration of Habr to take this data and write 4 lines of code to reload it all on habrastorage.org and update the links in the existing articles, but I do not know if they will be doing this. And I want to read articles in a normal way! And so we will go our own way. You can, of course, say “Read articles directly from the web archive!”, But this is somehow not particularly instructive and what would all this data collection be like.

The second impulse may be the desire to write an extension for Chrome, substituting bad references for good ones, but I don’t want to do this for a number of reasons:


Therefore, we will go the other way and write something that solves all the above problems. What exactly? And you will learn about this in the next article .

PS A logical division into two articles is added for readability, since the method used in the second article has nothing to do with the pictures on Habré, and vice versa.

UPD. As correctly suggested in the comments - you need to check and redirects. The script is fixed (thanks to the encyclopedist ) and re-run. As a result, another 10683 valid links to pictures were received. Files with new data are uploaded to GitHub (see links in the article).

Source: https://habr.com/ru/post/251259/


All Articles