📜 ⬆️ ⬇️

Mirror creation algorithm (website mirror)



Description


This guide presents materials for creating a system of mirrors of various software. The main difficulties in creating a system of mirrors are described, and ways to overcome them are shown. For system administrators and SEO specialists. Creating a system of software mirrors includes the following steps:
  1. Creating an address
  2. Allocating the required disk space
  3. Creating mirrors
  4. Adding mirrors to the list of mirrors (mirrors list).

1. Introduction


The purpose of creating a system of mirrors is to obtain links from pages of sites with a high PR value.
As a rule, well-known sites of software developers have a fairly high PR value. At the same time, in order to reduce the load on the server from which end users download the software, they welcome the creation of mirrors for their software (for small volumes, a full site mirror, for large volumes, file repositories). Without going deep into the historical reasons for this phenomenon, we note only one thing - in the case of creating a mirror of a software product by a certain site, it is added to the mirrors list on the source site. Thereby automatically increasing its PR.
The purpose of the release of the document is to present the work done in order to analyze the results, determine the problem points of creating a system of mirrors, and develop proposals for overcoming them.
The document reflects the main results of the work carried out over a period of four months - September - December 2011.

2. Creating an address


')
2.1. Creating a mirror for your own needs

If a mirror is created for your own needs, there are no problems - create a regular website in which you place each mirror in your own directory.

2.2. Creating a mirror for a partner

The easiest way to create a mirror for a partner is to create a CNAME record with the address of its own mirror as the address of the partner's mirror.
For example:
We have: partner.com site domain.com and own website fatcow.com .
We create: mirror mirrors.domain.com based on our site fatcow.com .
To set up a mirror, a partner must register on its DNS server:
mirrors.domain.com CNAME fatcow.com
Further, in the settings of your web server, you need to create an entry for the site mirrors.partner.com , specifying the root directory fatcow.com as the root directory.
If necessary, you can create mirrors in the form of separate subdomains in the same way; however, in this case, a separate CNAME record is required for each subdomain, for example:
apache.domain.com CNAME apache.fatcow.com
putty.domain.com CNAME putty.fatcow.com
Obviously, for each entry on your server, you must make the appropriate settings of your own web server.

3. Required disk space



3.1. General considerations

One of the key issues of creating a system of mirrors is a sufficient amount of disk space for their placement. If you have complete information about the required amount of disk space, this question is easily closed. But is it possible to have an idea about the volume of the hard disk, before you actually start to synchronize the first project?
Unfortunately, in our case it was a little different. Due to the lack of experience (both his own and someone else's) in this area, the creation of a system of mirrors was started with the initial very modest amount of disk space of 250 Gb. It quickly came to the realization that this is clearly not enough and the disk space was expanded to 500 Gb by connecting a second disk of the same (250 Gb) volume. While Linux OS using LVM (Linux Volume Manager) allows you to perform such an operation fairly painlessly, there have been cases of malfunctions (up to a complete server shutdown) for some time, most likely due to the unstable operation of the file system.
Further, the real volume of the placed data is quite large. The total repository volumes of some operating systems — Linux, various versions of BSD, including various versions of OS (including obsolete and unsupported, binary and source codes, software packages, etc., etc.) easily reach hundreds or more Gb. In connection with this, I had to once again correct my understanding of the necessary volumes.
TABLE 1
dateDisk capacityThe number of mirrors
07/15/2011250GB6
08.16.2011500GB9
12/09/20112000GB12

Graph increasing disk space
image
In early December, it was decided to increase the volume to at least 1TB (the cost of a dedicated server increases by $ 25 / month). At the same time, it was clearly understood that the data should be transferred to one disk of this size, since connecting, for example, another 500 Gb disk, drastically reduces server reliability - having three working disks three times increases the probability of failures. It is also necessary to take into account the possibility of further expansion of disk space.
Immediately, we note that the use of RAID was not planned - in case of data loss, they can be restored within 1-3 days (for mirrors with large amounts of data) from the source sites. The speed of recovery depends on the bandwidth of the channels and the load on the server - the source and our server. This reduces the total cost of equipment. However, it is necessary to pay great attention to the integrity of the data of its own services.
The availability of a 2TB disk (the cost of a dedicated server increases by $ 35 / month) was reached on the availability of a hoster in the warehouse, which somewhat relieved the moral tension in the sense of the question - how much will 1TB suffice? However, further the process of transferring data to a 2TB disk faced considerable difficulties (perhaps this is due to the insufficient qualification of our hoster). The initial connection of the 2TB disk with the transferred data failed; as the hoster explained, CentOS 5.6 installed on our server cannot boot from the 2TB disk. After that CentOS 6.1 was installed on this disk and the data was transferred. Considering that a number of services were configured to work with CentOS 5.6, it was necessary to carry out a number of works after the installation of the system. In particular:

All these periodic wiring-disconnecting drives, of course, destabilize the work and, ultimately, reduce performance and can be delayed for several weeks. Therefore, as a recommendation, a good idea can be considered the initial amount of disk space to accommodate a 1TB mirror system, which will allow you to avoid unnecessary troubles like the above.
If you (as you think) are not worried about the amount of disk space, you can skip this item. However, in real life there are no limitless resources and we strongly recommend that you start to estimate its volume before installing a mirror.
First of all, look on the website of the product owner for an indication of the amount of disk space needed to create a mirror. This information is in the installation instructions for the mirror. If there is no instruction or these data are not given, the required volume will have to be estimated by yourself.
The easiest way is to gradually download via FTP from the source site of the software (software) that is planned to be installed as a mirror. In this case, you have the possibility of gradually (according to directories) creating a mirror with volume control.
In some cases, initial download is possible using rsync by running it from the command line. However, given the sufficiently large volume of mirrors, it is necessary to periodically monitor the occupied disk space in order to interrupt the download process, if necessary.
Be careful - the initial download of the mirror can go more than a day! The ideal solution in this case is to have a server monitoring system.
Generally speaking, there are two more options for estimating the volume of the mirror, less costly in terms of temporary losses and the threat of disk overflow:
The first is to obtain information from the data owner, which in some cases is not always acceptable.
The second option is to use a utility that allows using any protocol (HTTP, FTP) to estimate (count) the total volume of all mirror files without downloading them. Unfortunately, we did not find any ready-made software solutions, so I had to develop a php script to estimate the volume of mirrors on a remote FTP server. Its use has significantly reduced the time spent on installing mirrors.
Below is the source code of the script.
<? php
// =============================================== ===============
// Directory size
// #php path_to_script host_name directory
//
// =============================================== ===============
function ftp_dir_size ($ connect, $ dir)
{
$ dir_size = 0;
$ file_list = ftp_rawlist ($ connect, $ dir);
// print_r ($ file_list);
// get directory list
foreach ($ file_list as $ file)
{$ dim = explode ('', $ file);
if (count ($ dim)> 3)
{// echo "---------------------------- \ n";
list ($ attr, $ bloks, $ group, $ user, $ size, $ month, $ day, $ year, $ f_name)
= preg_split ("/ [\ s] + /", $ file);
$ pr_dir = substr ($ attr, 0, 1);
// echo $ attr. "\ n";
// echo $ pr_dir. "\ n";
// echo $ f_name. "\ n";
// echo $ size. "\ n";

if (substr ($ file, 0, 1)! = '.')
{// directory
if ($ pr_dir == 'd')
{$ t_dir = $ dir. "/". $ f_name. "/";
$ dir_size = $ dir_size + ftp_dir_size ($ connect, $ t_dir);
}
// file
if ($ pr_dir == '-')
{echo "***". $ f_name. "-". $ size. "\ n";
$ dir_size = $ dir_size + $ size;
}
}

}
}
return $ dir_size;
}
// =============================================== ===============
set_time_limit (3600);
$ host = $ argv [1];
$ dir = $ argv [2];
echo $ host. "\ n";
echo $ dir. "\ n";
$ user = "anonymous";
$ password = "";
$ connect = ftp_connect ($ host);
if ($ connect)
{$ login = ftp_login ($ connect, $ user, $ password);
if ($ login)
{if (ftp_chdir ($ connect, $ dir))
{$ dir = ftp_pwd ($ connect);
echo "new directory -". $ dir. "\ n";
}
else {echo "Cannot change directory \ n"; }
$ size = ftp_dir_size ($ connect, $ dir);
// print directory size
echo "\ nDirectory size =". $ size;
}
ftp_close ($ connect);
}
else {echo "Not connect with". $ host; }
?>
To estimate the amount of disk space occupied by mirrors on a private server, it is convenient to use the ncdu utility, which allows you to display a summary of each directory:
image
The results are recommended to be tabulated, which will allow to assess the dynamics of change (or stability) of each of the mirrors.
image

4. Creating a mirror



4.1. Creating an rsync-based mirror

As a rule, the most common way to create a mirror is to synchronize with the rsync server of the source site using the rsync utility of the same name. If the site of the owner of the software has instructions for configuring rsync, there are no problems at all. For example, on the site http://cran.r-project.org/ there is a sufficiently detailed instruction for setting up a mirror. As regards the rsync configuration, we have:
image
This manual contains at least two points of interest to us:

In accordance with these instructions, we create a task in cron to start rsync according to the existing instructions, and the process is complete. It is recommended to log the work of rsync, which will allow to find out, if necessary, the reasons for the impossibility of synchronization.
Full rsync documentation is available at http://rsync.samba.org/ . The documentation section ( http://rsync.samba.org/documentation.html ) lists all possible uses of rsync.
We note one important point. In case of insufficient disk space instead of the directive - delete-after it is preferable to use the directive - delete. In the case of the use of the --delete-after directive, the files are first uploaded, and then the ones missing on the site - the source. At the same time, the mirror as quickly as possible is brought into a state of full compliance with the original, which may be important for frequently changing mirrors, such as mozilla. In the case of the use of the --delete directive, you first delete the files that are not already present on the site - the source of the files, and then download new ones. This mode is recommended for quite large mirrors (CentOS, DragonflyBSD).
All other settings - depending on your preferences, if they do not contradict the instructions.
In some cases, there is no data on setting up a mirror on the site of the software owner ( gcc.gnu.org ):
image
In this case, the installation of the mirror is impossible without prior contact with the software owner (for details, see section 4).

4.2. Creating FTP Mirror

A more rare case is the creation of a mirror based on FTP. In general, the configuration sequence is the same as described above, except that you need to use an ftp client instead of rsync for synchronization. Please note that the very well-proven wget utility (as well as many others) is not suitable for our case - in addition to uploading files, we need to ensure the removal of those absent from the source site. Perhaps the best choice would be to use the utility lftp ( lftp.yar.ru ) in mirror mode.
The procedure for setting up the lftp utility (including when working in mirror mode) can be found on the developer’s website ( http://lftp.yar.ru/lftp-man.html ).

4.3. Control disk space

After the mirrors are installed, it is recommended to install a monitoring system of available disk space with an alert (e-mail, sms) on the server, which will not allow the server to stop if the disk space is completely filled.
Obviously, the issue of disk space monitoring is necessary in the presence of a large number of mirrors should be given sufficient attention. Unlike its own software, any mirror can dramatically increase its volume, for example, in the case of a new version. For automatic control of disk space, it is recommended to use a monitoring system, for example, Munin (http://muninmonitoring.org/). This is a fairly customizable, extensible system. Along with the ability to visually monitor parameters using a web interface, it allows you to create e-mail alerts when a controlled parameter reaches a certain threshold value.
In some cases it is possible to reduce the volume of the mirror by synchronizing only stable versions of the software. Obviously, this must be agreed to by the owner of the software. In this case, tuning the mirror becomes more difficult, but the gain is obvious.

5. Adding a mirror to the list of mirrors


Adding a mirror to the list of mirrors on the site of the product owner is, from my point of view, the most complicated procedure in the whole technology of organizing mirrors. This is due to the fact that you can perform all the previously described operations on your own. To add a mirror to the list of mirrors, of course, you need to obtain the consent of the site owner.

5.1. Option 1

The most trivial case is in the installation instructions for the mirrors that something like “install a mirror, write to us at mirrors@example.com and we will add you to your list of mirrors”.
This was, for example, when installing mirrors:

However, even in this case, success is not guaranteed immediately. Perhaps the letter did not reach the first time. Perhaps people gathered to do, but did not have time immediately, and then forgotten. But you never know the reasons. Possible and simply delayed response.
Let's take as an example the correspondence on creating a mirror for ImageMagick:

In any case, if the answer is not received, I recommend sending letters asking to include a list of mirrors at least three times - this is the maximum value (from personal experience), after which you can get an answer about including your site in the list of mirrors.

5.2. Option 2

In some cases, the instructions for installing a mirror are written - “Install a mirror, send a letter and we will consider the question of your inclusion in the list of mirrors”. That is, in this case there is a possibility that your mirror will not be included in the list of mirrors. Nevertheless, it’s not worth it to fail, but it is worth quietly working, and then acting in accordance with paragraph 1.

5.3. Option 3

In some cases, in order to submit a request for inclusion in the list of mirrors, a subscription to the mailing list is required, for example, CentOS (http://www.centos.org/). In this case, your requests to add to the list of mirrors should be added to the mailing list.
Example of correspondence with CentOS:


5.4. Text of the letter

So, you are going to send a letter to the owner of the product either with a question about the possibility of creating a mirror and finding out some technical details, or about inclusion in the list of mirrors when the mirror is created.
In any case, you must specify the minimum data set:

Before sending the letter, read the installation instructions for the mirror (if any) again. In some cases, product owners are asked to specify, for example, the maximum number of connections to your server. Either the number of launches of the synchronization procedure is usually enough to synchronize once a day, but sometimes they are asked twice a day. Be sure to supplement the text of your letter with this data.

5.5. Signature and address

Signature to the letter - name, title, e-mail address. Position: perfect - site administrator. In this case, the email address matters! An address whose domain matches the domain of the mirror is desirable. Or, as a common exception to the rule - mail to GMail. Having mail on GMail for the server administrator is normal.
Below is a screenshot of the real letter (used when installing the CentOS mirror)
image
Screenshot of the real letter - used when installing the ImageMagick mirror
image

6. Distribution of mirrors


Considering that, as a rule, there is a web server in the composition of almost any server, we already have the opportunity to organize the distribution of our mirrors using the HTTP protocol.
However, we strongly recommend that you configure the appropriate FTP server for distributing mirrors via FTP. In this case, at a minor cost (installing and configuring an FTP server), we get a gain of at least three points;

Also, as an additional means to reduce the workload of a regular web server, we can recommend installing an additional web server to service your own services, for example, lighttpd.

7. Organization of work


The organization of the system of mirrors is quite a long work. The installation of a single mirror, taking into account the correspondence, can last for weeks (by installation, we understand the achievement of the ultimate goal - the inclusion of a mirror in the list of mirrors on the website - source). And even more so the creation of the system turns into a process that can last for months.
In this work on your own server must be consistent with the process of correspondence. In this case, the creation of mirrors, as a rule, is made simultaneously for several sites - sources. In this regard, it is not at all superfluous to systematize the planning and control of work performance (http://is.gd/WeFybI). The following data should be included for each mirror in this document:

For each mirror should be provided for the state of the work and planned operations.
In the same document or, if necessary from the point of view of information processing convenience, in a separate document for each mirror, it is necessary to organize the recording of correspondence on the organization of the mirror. The following data should be submitted - the text of your letter and the date of sending, the text of the answer and the date of receipt, the e-mail address of the sender and the e-mail address of the recipient.
This will allow a much more organized process of creating a system of mirrors.

8. Conclusion


Over a period of 4 months, the PR value for the fatcow.com domain was raised from zero to PR = 4.
In the processing table 1 shows the data on sites, mirrors which are placed and added to the lists of mirrors.
TABLE 2
Project nameWebsite addressQ, GbPRQ / PR
Mozillawww.mozilla.org2392.6
Operawww.opera.com1391.4
KDEwww.kde.org2173.0
Imagemagickwww.imagemagick.orgfour70.6
Dragonfly bsdwww.dragonflybsd.org177629.5
CentOSwww.centos.org142720.0

Note. Q is the disk space occupied by the mirror as of December 15, 2011.
image
The total amount of disk space occupied by the mirrors listed in Table 1 is 380Gb.
The last column of the table shows the ratio Q / PR - a kind of indicator of the quality of the mirror. The lower it is (of course, to some extent arbitrary), the more appropriate the mirror is for our purposes.
From the results presented in the table, two conclusions follow;
1. Placement of operating systems as a mirror is quite expensive. Creating mirrors based on them is advisable in the presence of an agreement of limited placement - for example, only stable copies of the product.
As an example, we cite the Dragonfly BSD, the total volume of which is approximately 400 GB. Especially a large amount of disk space occupied by software packages (packeges) for different versions. In this regard, software synchronization was limited only by the latest versions of Dragonfly BSD:
image
As a result, the required disk space was reduced at least twice, to an acceptable 200 GB.
However, such an agreement is not likely to be achieved in each case. Therefore, the problem of disk space will directly depend on the objectives of raising PR.
2. First of all, the system-wide software running on different operating systems (web-server, FTP-server (ProFTPd), database server (MySQL), etc.) should be planned for deployment.
Based on the analysis of table 2, we carried out an adjustment of the selection of mirrors to be placed. At present, the mirrors presented in Table 3 are in processing (in different degrees of readiness. As the work is suspended due to the replacement of the disk).
TABLE 3
Project nameWebsite addressQ, GbPRQ / PR
apachewww.apache.org2893.1
cpanwww.cpan.orgten71.4
crancran.r-project.org74710.6
gccgcc.gnu.org2773.9
opensslwww.openssl.org0.374eight0.05
puttywww.putty.org0,01760.003
xemacswww.xemacs.org0.07160.012

Obviously, a PR value of four is far from finite.
The main cost items are:

It should be borne in mind that simultaneously with the support of the system of mirrors, this server is used for other works, which generally reduces the cost of supporting the system of mirrors. Total spending per month is about $ 500.
Translated from English, source: webhostinggeeks.com

Source: https://habr.com/ru/post/142324/


All Articles