📜 ⬆️ ⬇️

Broken links - some statistics

Seeing today D'Artagnan’s topic and the Internet, or working on the problem of broken links , I decided to share with you some statistics collected while writing my master’s thesis.

In my diploma, one of the tasks was to solve the problem of broken links on a single resource. In order to show the urgency of the problem, I downloaded a dump of the Wikipedia database and the program checked the performance of 700 thousand external links in articles.

It turned out that 20% of the links are broken!


The link was considered inoperative in the following cases:
○ Remove a domain from DNS.
○ Failure to connect over HTTP.
○ Getting an HTTP 4xx or 5xx response code — basically deleting a page (404), denying access (403), server error (500).
○ Redirect from the internal page to the main page.
○ Infinite HTTP 3xx Redirection.

It also tracked the substitution of the page content for another and program errors PHP, ASP, etc, but these data are not included in the statistics.

The database was obtained in August 2009.
Then 3 checks are made:
● October 2009 - 20.7% of links are broken
● November 2009 - 22.4%
● April 2010 - 23.8%

It may be noted a gradual increase in the number of broken links. At the same time, only 4% of those who did not work earlier regained their work. Those. overwhelmingly, the failure is irreversible.

The figure below shows the statistics for reasons of links inoperability:

A similar check of the catalog of links to external sites of the federal educational portal www.edu.ru revealed a similar picture - 24.5% of the links do not work.

Of course, such a study is not serious and scientific, and the results obtained are not very accurate. Probably checked links belonged to old versions of articles, I could not trace it. But it is obvious that the problem of broken links exists. Some more numbers:

According to DomainTools, the number of domain names of sites that have ceased to exist in one day is about 100,000, and in general their amount exceeds the number of existing ones more than 3 times (for .com, .org, .net, .info, .biz zones and .us)
Archive.org claims that the average life of a web page is between 44-75 days.

What to do

If you need to ensure the smooth operation of external links, you can use one of the following methods:

1. Periodic automatic testing and hiding / deleting broken links.
Such an approach is applicable in cases where the performance of links is not critical or you just need to give a link to the entire site. There are ready-made programs that implement this principle: PHP Spider, ht: // Check, VEinS, etc.

2. Save a copy of the resource on your server and issue a link to it.
This approach is preferable to the first, if it is important to provide users with access to the resource for an unlimited time. It also excludes the possibility of replacing the page content with others. This raises the problem of compliance with copyright on the saved copies.
This method is more suitable for a link to a specific page / document, since keeping a copy of the entire site is quite difficult.
An example of a service using this principle is Peeep.us . Also since 1996, The Wayback Machine service of the Internet Archive electronic library has been functioning, which periodically collects copies of web pages on the Internet that are publicly available.

3. The combination of methods 1 and 2 - providing a link to the original resource, and in case of loss of its working capacity or change - to a saved copy.

4. URN , PURL - how they really can be used is not very clear.

Source: https://habr.com/ru/post/102527/

All Articles