
A mirror is a copy of data from one information resource on another. Mirrors are used to provide access to copies of information through several sources. With the help of mirrors, for example, distribution of * nix-systems distributions is carried out: repository copies are stored on numerous mirrors located in different parts of the world. The use of mirrors allows you to rationally distribute the load and ensure high speed of downloading packages.
Our company has its own mirror of packages in which copies of the repositories of popular linux systems are stored. In this article we would like to tell in detail about its device.
Launching a cloud server project in 2010, we chose a net-install installation model for them, in which distributions are installed by a “native” installer from one of the official mirrors. Thanks to this model, you can always get the latest software version with all the latest changes made by the distribution maintainers. Another advantage of the net-install model is that it eliminates a number of problems associated with cloned instances (the need to generate SSH keys, UUIDs of file systems, etc.).
')
We chose mirror.yandex.ru as the main mirror, because it is closely located and contains all the repositories we need. At first it suited us quite well. But then the unexpected happened. The number of installations grew, engineers began testing templates; in the end, Yandex, outraged by the huge number of identical requests, simply closed access to its mirror for our subnets.
We began to look for a solution with the help of which it would be possible to ensure stability and minimize the likelihood of emergency situations. We had the following idea: to raise nginx as a proxy server for several mirrors. This decision seemed to us quite reasonable and reliable: even if one of the uplinks falls, we can easily download files from the other. However, we immediately faced the problem of a heterogeneous structure of mirrors: for example, the CentOS repository on one uplink could lie in / centos, on the other - in / CentOS, and on the third - generally in / www / mirror / srv / pub / centos.
Since universal mirrors that contain the repositories of all distributions we need (CentOS, Debian, Ubuntu, OpenSUSE) can be counted on fingers, for each distribution we had to make a separate list of mirrors.
Having put this idea into practice, we faced much more serious difficulties:
- The speed of uplinks is inconsistent: very often it happens that the same host gives 5-10 Mb / s, and within a couple of hours - no more than 5-20 Kb / s. Since the installer downloads packages one by one, due to differences in speed, the installation may take an indefinite time;
- some uplinks could be incorrectly configured: it happened that in response to a request, instead of an RPM package, they received the HTML page “It works!”;
- On some uplinks, the packages in the catalog might not be available. Or packages were present, but had the wrong checksums. This could happen, for example, due to a broken sync order with upstream: first, index files, and then packages, and not vice versa. Errors could also occur due to the incorrect configuration of rsync, which wrote files in place, and did not save the contents to a temporary file with subsequent atomic replacement.
Due to all these difficulties, we had more than one automatic installation failure. To get rid of failures once and for all, we created our own mirror - mirror.selectel.ru. It is available only from Selectkle's IP addresses (outgoing traffic is paid for us and we don’t risk providing it to the public, for you can easily get 10-20 gigabits).
Having created our own mirror, we solved all the problems mentioned above. Among the advantages gained through our own mirror, the following should also be mentioned:
- synchronization with uplinks occurs without interrupting customer service and does not affect in any way the working copy given to them;
- a synchronized copy replaces the current one only if all new packages have checksums converging;
- if an uplink for some reason is not available or returns erroneous data, the mirror continues to give data from the old but working copy;
- synchronization of uplinks is divided by distributions: for some distributions it can be done less frequently than for others. There is also the possibility of partial cloning of some repositories.
From this mirror, the installation of operating systems on dedicated servers.
How are the repositories
As a rule, repositories consist of two main parts: a directory (index) and a pool (package repository).
The catalog stores information about all packages located in the repository: name, description, architecture, version, checksums, and in some cases also information about dependencies and package contents. The catalog also indicates where exactly the file of one or another version of each package lies in the pool.
The package files themselves are stored in the pool. They can be decomposed according to a hierarchy or simply folded into one directory.
RPM repositories
At the root of each RPM repository is a directory with files directory - repodata. The description of all sections of the directory is stored in the repomd.xml file. Each section is represented by a separate file in the directory directory. The description shows the path to the file containing the section, as well as its checksum.
The contents of the repomd.xml file might look like this:
<? xml version = "1.0" encoding = "UTF-8"?>
<repomd xmlns = "http://linux.duke.edu/metadata/repo" xmlns: rpm = "http://linux.duke.edu/metadata/rpm">
<revision> 1362531727 </ revision>
<data type = "primary"> <! - Description of the primary section - XML ​​database containing information about the repository packages ->
<checksum type = "sha256"> 87aa4c4e19f9a3ec93e3d820f1ea6b6ece8810cb45f117a16354465e57a1b50d </ checksum>
<open-checksum type = "sha256"> 77b5cfcf2c06156858a14a52595e1f69cd8cbb58c09699a3ea4391379260e943 </ open-checksum>
<location href = "repodata / 87aa4c4e19f9a3ec93e3d820f1ea6b6ece8810cb45f117a16354465e57a1b50d-primary.xml.gz" />
<timestamp> 1362531876 </ timestamp>
<size> 2043735 </ size>
<open-size> 12931923 </ open-size>
</ data>
<data type = "primary_db"> <! - The description of the primary_db section is the same as primary only in the sqlite database ->
<checksum type = "sha256"> 243fdef956d09cb6d022e894e40d145f497bcf3d6d2bed79814e1c88452b9d29 </ checksum>
<open-checksum type = "sha256"> 533872a158160ac3a83746a676c125b5cfb2411725079502b0d5be4f4d05196e </ open-checksum>
<location href = "repodata / 243fdef956d09cb6d022e894e40d145f497bcf3d6d2bed79814e1c88452b9d29-primary.sqlite.bz2" />
<timestamp> 1362531897.21 </ timestamp>
<database_version> 10 </ database_version>
<size> 3605913 </ size>
<open-size> 14942208 </ open-size>
</ data>
...
</ repomd>
The RPM directory consists of the following sections:
- primary - contains a description of all the packages stored in the repository, the paths to the files of these packages and their checksums;
- filelists - contains lists of files included in each package;
- group - contains descriptions of groups of packages installed using yum groupinstall;
- other - contains additional information (for example, changelogs).
Structuring and grouping packages for different operating systems are organized differently. For example, CentOS stores all package files in the Packages directory located in the repository root. In addition, a separate repository was created for each of the existing architectures.
OpenSUSE stores packages for all architectures in a single repository with separate pools in the i686 / x86_64 / etc directories.
Deb repositories
In DEB repositories, all packages are stored in a common pool. This avoids duplication of packages included in different releases. For each release in the repository created a separate directory.
Directory parsing begins with the file / dists / [distribution] / Release (distribution here means the code name of the release - squeeze / wheezy / jessie). It contains a list of release components, as well as information about the size and checksums of all index files. The release file is signed by the archive's main engineers; the signature is stored in the Release.gpg file (sometimes the contents of the Release along with the signature may be in the InRelease file).
The description of the contents of the pool is in the index files of two types: Packages (they list binary packages) and Sources (they list the sources).
The path to the Packages file is / dists / [distribution] / [component] / binary- [architecture] / Packages, and to the Sources file is / dists / [distribution] / [component] / source / Sources.
Note: sometimes index files are compressed using gzip or bzip2 - in this case, the extension .gz or .bz2 is added to the file name, respectively. Some clients support LZMA (.lzma), XZ (.xz) and LZIP (.lz).
Here is an example of an entry from the Packages file:
Package: openssh-server
Source: openssh
Version: 1: 6.2p2-6
Installed-Size: 747
Maintainer: Debian OpenSSH Maintainers
Architecture: amd64
Replaces: openssh-client (<= 2.16), libcomerr2 (> = 1.01), libgssapi-krb5-2 (> = 1.10 + dfsg ~), libkrb5-3 (> = 1.6.dfsg.2), libpam0g (> = 0.99 .7.1), libselinux1 (> = 1.32), libssl1.0.0 (> = 1.0.1), libwrap0 (> = 7.6-4 ~), zlib1g (> = 1: 1.1.4), openssh-client (= 1: 6.2p2-6), sysv-rc (> = 2.88dsf-24) | file-rc (> = 0.8.16), libpam-runtime (> = 0.76-14), libpam-modules (> = 0.72-9), adduser (> = 3.9), dpkg (> = 1.9.0), lsb -base (> = 4.1 + Debian3), procps
Recommends: xauth, ncurses-term
Suggests: ssh-askpass, rssh, molly-guard, ufw, monkeysphere, openssh-blacklist, openssh-blacklist-extra
Conflicts: rsh-client (<< 0.16.1-1), sftp, ssh (<< 1: 3.8.1p1-9), ssh-krb5 (<< 1: 4.3p2-7), ssh-nonfree (<< 2) ssh-socks, ssh2
Description: secure shell (SSH) server, for secure access from remote machines
Multi-Arch: foreign
Homepage: http://www.openssh.org/
Description-md5: 842cc998cae371b9d8106c1696373919
Tag: admin :: login, implemented-in :: c, interface :: daemon, network :: server,
protocol :: ssh, role :: program, security :: authentication,
security :: cryptography, use :: login, use :: transmission
Section: net
Priority: optional
Filename: pool / main / o / openssh / openssh-server_6.2p2-6_amd64.deb
Size: 257438
MD5sum: 1f18e568c17d81cc2c493ee48c93a03f
SHA1: 207f131bbd4d709a47bcb69c997520c998ed7593
SHA256: 242b7f041292dea0702b24e19dc6355f47147796b227f1024665920a493641f2
How our mirror works
The repository of each distribution on the mirror is stored in two copies: the shadow (background) and working (foreground). Both parts lie on a separate LVM volume, which allows them to add disk space on the go. A working copy of the mirror is stored in the working part, it is distributed using nginx. The shadow part is synchronized with the upstream-mirror, and then undergoes thorough validation testing.
The validation procedure includes checking the directory, its digital signature (if any), as well as checking the checksums of all index files. It is rather difficult to check the checksums of all packages: some repositories can store packets for tens or even hundreds of gigabytes in pools of some repositories. Therefore, checksums are checked only for new packages that rsync has touched. After checking the shadow and working part are swapped. This operation is performed using a simple mv. Thus, it is possible to practically ensure the atomicity of the substitution (three quick calls to mv are enough to swap directories) and minimize the possible downtime. The return of open files during the replacement does not stop.
After the two parts are swapped, the shadow part is locally “catching up” to the current state from the working copy.
Mirror-sync
The algorithm described above is implemented in our script set, called mirror-sync, recently
published on GitHub under the GNU GPL license. We hope that our ideas will be useful to a wide audience, and some of our readers will take advantage of our experience in creating their own mirror. All comments containing comments and suggestions for improving the mirror, we will certainly take into account in future work.
For those who can not comment on posts on Habré, we invite to our
blog .