📜 ⬆️ ⬇️

Select and configure SDS Ceph

Hello everyone, Dear readers and practitioners!

I felt all sorts of different and diverse Block / File Storage with SANs and was, in general, happy, until the task appeared to understand - what is Object Storage? And if there are already many decisions on the market, choose the one ...

Why is Object Storage?


Well, first of all, it is a storage system specifically designed for the storage of objects. And this is precisely and above all Storage.

Secondly, the system elegantly turns into WORM (write once read many) and back.
Thirdly - separation of pools with data for different users is achieved, flexible setting of quotas for users, pool sizes, number of objects, number of buckets in pools. In general, normal storage administration functionality.
Fourth, fifth, etc. Everyone who knows why he needs it, is sure to find more than one advantage. And maybe disadvantages.
')
Requirements for the selection of object storage:

- OS de facto standard - CentOS;
- you need object storage, reserved between data centers;
- if possible be completely free.
Quite such meager and vague requirements.

I started the search process with an overview of the available:


Having chosen Ceph, I embarked on the path of walking on a rake, of which there were enough, but they were also passed with optimism.

The first thing that was done was the choice of version . Last Luminous did not put, just had a release, but we are a serious company. :) About Hammer had a lot of complaints. I liked Jewel in a stable release. I created a repository, zsinkal it with ceph.com, put the job in kroner and released it through nginx. We will also need EPEL:

# Ceph-Jewel
/usr/bin/rsync -avz --delete --exclude='repo*' rsync://download.ceph.com/ceph/rpm-jewel/el7/SRPMS/ /var/www/html/repos/ceph/ceph-jewel/el7/SRPMS/
/usr/bin/rsync -avz --delete --exclude='repo*' rsync://download.ceph.com/ceph/rpm-jewel/el7/noarch/ /var/www/html/repos/ceph/ceph-jewel/el7/noarch/
/usr/bin/rsync -avz --delete --exclude='repo*' rsync://download.ceph.com/ceph/rpm-jewel/el7/x86_64/ /var/www/html/repos/ceph/ceph-jewel/el7/x86_64/
# EPEL7
/usr/bin/rsync -avz --delete --exclude='repo*' rsync://mirror.yandex.ru/fedora-epel/7/x86_64/ /var/www/html/repos/epel/7/x86_64/
/usr/bin/rsync -avz --delete --exclude='repo*' rsync://mirror.yandex.ru/fedora-epel/7/SRPMS/ /var/www/html/repos/epel/7/SRPMS/
# Ceph-Jewel
/usr/bin/createrepo --update /var/www/html/repos/ceph/ceph-jewel/el7/x86_64/
/usr/bin/createrepo --update /var/www/html/repos/ceph/ceph-jewel/el7/SRPMS/
/usr/bin/createrepo --update /var/www/html/repos/ceph/ceph-jewel/el7/noarch/
# EPEL7
/usr/bin/createrepo --update /var/www/html/repos/epel/7/x86_64/
/usr/bin/createrepo --update /var/www/html/repos/epel/7/SRPMS/


and proceeded with further planning and installation.

First, it is necessary to draw the solution architecture , even if it is a test one, but which can easily be scaled to a sale, to which we will strive. I got the following:

- three OSD nodes in each data center;
- Three MON nodes on three sites (one each), which will provide the majority for the Ceph cluster; (node ​​monitors can be shared with other roles, but I chose to make them virtual and bring them to VMWare in general)
- two RGW nodes (one in each data center), which provide API access to Object Storage using S3 or Swift protocol; (RadosGW nodes can be co-ed with other roles, but I chose to make them virtual and also render them on VMWare)
- node for deployment and centralized management; (virtual server that rolls between data centers in VMWare)
- node monitoring cluster and current / historical performance. (the same story as with the Deploy node)

Plan networks - I used the same network for the Ceph cluster "ecosystem". For access to the deployment nodes, monitoring and RGW, two networks were forwarded to the nodes:

- Ceph cluster network, for access to resources;
- “public” network, for access “from outside” to these nodes.
Official documentation recommends using different networks within the cluster for heartbeat and data movement between OSD nodes, although the same documentation states that using one network reduces latency ... I chose one network for the entire cluster.

Cluster installation begins with a basic procedure: preparing server-nodes with CentOS OS.
- we set up repositories, if they are local. (for example, like mine) We also need an EPEL repository;
- on all nodes in / etc / hosts we enter information about all nodes of the cluster. If the infrastructure uses DHCP, then it is better to do bind for the addresses and fill in / etc / hosts anyway;
- we configure ntp and we synchronize time, it is critical, for correct operation of Ceph;
- create a user to manage a Ceph cluster, any name, the main thing is not the same name - ceph.

For example:

sudo useradd -d /home/cephadmin -m cephadmin
sudo passwd cephadmin
echo "cephadmin ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephadmin
chmod 0440 /etc/sudoers.d/cephadmin


- install the ssh server on the deployment node, generate keys for the created user, copy the keys to all the nodes of the cluster, set in sudo with the option NOPASSWD. We will need to get for the user a passwordless entry from the deployment node to all the nodes of the cluster;
- we create a config file for the created user in the .ssh directory, describe all the nodes and set 600 permissions on this file;

[cephadmin@ceph-deploy .ssh]$ cat config
Host ceph-cod1-osd-n1
Hostname ceph-cod1-osd-n1
User cephadmin
...................
Host ceph-cod2-osd-n3
Hostname ceph-cod2-osd-n3
User cephadmin


- open ports 6789 / tcp and 6800-7100 / tcp in firewalld, if you decide to leave it;
- disable SELinux; (although from the ceph version - Jewel with the installation, normal SE-policies roll in)
- on the cluster management node, perform yum install ceph-deploy .

It seems everything is ready! Go to the installation and configuration of the cluster itself.


In the home directory of our user create a directory where the cluster configs will be located. It is better not to lose this folder, since recover will be very problematic. In the process of configuring the cluster and screwing different roles and services, this directory will be replenished with files and keys.

Create the first MON node in our future cluster: ceph-deploy new # our_name_MON_nody . The ceph.conf file has appeared in the directory created earlier, in which the cluster description will now be entered and its contents will be applied to the nodes we need.

We install Ceph-Jewel itself on all nodes: ceph-deploy install --release = jewel --no-adjust-repos # node1 # node2 ... # node_N . The key --no-adjust-repos must be used if the repository for installation is local and for the installation script to look for the path in the existing /etc/yum.repos.d/*.repo, and not try to register its repository. With the Jewel version, the stable version is installed by default, unless otherwise indicated.

After successful installation, initialize the cluster ceph-deploy mon create-initial

When the cluster is initialized, the initial configuration is written to the ceph.conf file, including the fsid . If this fsid is subsequently changed or lost by the cluster, this will lead to its “collapse” and, as a result, loss of information! So, after having the initial configuration in ceph.conf, we boldly open it (by making a backup) and begin to edit and enter the values ​​that we need. When spilling on the nodes we need, we must specify the --overwrite-conf option. Well, the approximate content of our config:

[root@ceph-deploy ceph-cluster]# cat /home/cephadmin/ceph-cluster/ceph.conf
[global]
fsid = #-_
mon_initial_members = ceph-cod1-mon-n1, ceph-cod1-mon-n2, ceph-cod2-mon-n1
mon_host = ip-adress1,ip-adress2,ip-adress3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

#Choose reasonable numbers for number of replicas and placement groups.
osd pool default size = 2 # Write an object 2 times
osd pool default min size = 1 # Allow writing 1 copy in a degraded state
osd pool default pg num = 256
osd pool default pgp num = 256

#Choose a reasonable crush leaf type
#0 for a 1-node cluster.
#1 for a multi node cluster in a single rack
#2 for a multi node, multi chassis cluster with multiple hosts in a chassis
#3 for a multi node cluster with hosts across racks, etc.
osd crush chooseleaf type = 1

[client.rgw.ceph-cod1-rgw-n1]
host = ceph-cod1-rgw-n1
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-cod1-rgw-n1/keyring
rgw socket path = /var/run/ceph/ceph.radosgw.ceph-cod1-rgw-n1.fastcgi.sock
log file = /var/log/ceph/client.radosgw.ceph-cod1-rgw-n1.log
rgw dns name = ceph-cod1-rgw-n1.**.*****.ru
rgw print continue = false
rgw frontends = «civetweb port=8888»

[client.rgw.ceph-cod2-rgw-n1]
host = ceph-cod2-rgw-n1
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-cod2-rgw-n1/keyring
rgw socket path = /var/run/ceph/ceph.radosgw.ceph-cod2-rgw-n1.fastcgi.sock
log file = /var/log/ceph/client.radosgw.ceph-cod2-rgw-n1.log
rgw dns name = ceph-cod2-rgw-n1.**.*****.ru
rgw print continue = false
rgw frontends = «civetweb port=8888»


There are also a couple of comments:

- if the ceph-radosgw. * Service does not start, then I had a problem with creating a log file. I decided this by simply creating the file with my hands and putting the mask 0666 on it;
- you can choose the API provider for the cluster between civetweb, fastcgi and apache , that's what the official dock says:
As of firefly (v0.80), Ceph Object Gateway is running on Civetweb (embedded into the ceph-radosgw daemon) instead of Apache and FastCGI. Using Civetweb to simplify the Gateway installation and configuration.

In short , I have set global variables, replication rules and the cluster “survival” rule in our file, as well as RGW nodes admitted to the cluster. As for the parameters osd pool default pg num and osd pool default pgp num, an interesting note was found:
And for example a count of 64 total PGs. I’m not sure, I’m not sure what I’ve been It is a bit like it seems to be expecting between the 20s and 32 ps per OSD. A value below 20 gives you this error, and a value above 32 gives another error.
So, since there are 9 OSDs, the minimum value would be 9 * 20 = 180, and the maximum value 9 * 32 = 288. I chose 256 and configured it dinamically.

And this one:
PG (Placement Groups) is a placement group or logical collection of objects in Ceph that are replicated to the OSD. One group can save data to several OSD, depending on the level of complexity of the system. The formula for calculating placement groups for Ceph is as follows:

Number of PG = (number of OSD * 100) / number of replicas

The result should be rounded to the nearest power of two (for example, according to the formula = 700, after rounding = 512).
PGP (Placement Group for Placement purpose) - placement groups for location purposes. The quantity must be equal to the total number of placement groups.

And on all nodes where there is a key, we change the permissions: sudo chmod + r /etc/ceph/ceph.client.admin.keyring

Before the OSD device is entered into the cluster, on OSD nodes it is necessary to perform preparatory work:

We split the disk for use as a log. This should be an SSD (but not necessarily, just a dedicated disk is enough) and be divided into no more than 4 equal partitions, since Partitions must be primary:

parted /dev/SSD
mkpart journal-1 1 15G
mkpart journal-2 15 30G
mkpart journal-3 31G 45G
mkpart journal-4 45G 60G


And formatted in xfs . We change the rights to the disks on the OSD nodes, which are intended for magazines and which will be managed by Ceph:

chown ceph:ceph /dev/sdb1
chown ceph:ceph /dev/sdb2
chown ceph:ceph /dev/sdb3


And be sure to change the GUID for these partitions, so that udev will work correctly and the correct device rights will come after the reboot. I stepped on this rake when, after reloading the OSD, the node rose, but the services were able to failed. Since A properly working udev has assigned its owner and default group as root: root. As they say, the result exceeded expectations ... To prevent this from happening, we do this:

sgdisk -t 1:45B0969E-9B03-4F30-B4C6-B4B80CEFF106 /dev/sdb
GUID


After with our deployment of the node, we execute the ceph-deploy disk zap and ceph-deploy osd create . This completes the basic installation of the cluster, and you can see its status with the commands ceph -w and ceph osd tree .

But how can we ensure fault tolerance for data centers?

As it turned out, Ceph has a very powerful tool - working with crushmap
In this map, you can enter several levels of abstraction and I did it up to the primitive simply - I introduced the concept of rack and in each rack laid out the nodes on the basis of the data center. From this point on, my data was redistributed in such a way that, recorded in one rack, had an obligatory replica in another rack. Since Ceph algorithm considers the storage of two replicas in one "rack" unreliable. :) That's all, and turning off all the nodes of a single data center really left the data available.

[cephadmin@ceph-deploy ceph-cluster]$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.17200 root default
-8 0.58600 rack ceph-cod1
-2 0.19499 host ceph-cod1-osd-n1
0 0.04900 osd.0 up 1.00000 1.00000
1 0.04900 osd.1 up 1.00000 1.00000
2 0.04900 osd.2 up 1.00000 1.00000
3 0.04900 osd.3 up 1.00000 1.00000
-3 0.19499 host ceph-cod1-osd-n2
4 0.04900 osd.4 up 1.00000 1.00000
5 0.04900 osd.5 up 1.00000 1.00000
6 0.04900 osd.6 up 1.00000 1.00000
7 0.04900 osd.7 up 1.00000 1.00000
-4 0.19499 host ceph-cod1-osd-n3
8 0.04900 osd.8 up 1.00000 1.00000
9 0.04900 osd.9 up 1.00000 1.00000
10 0.04900 osd.10 up 1.00000 1.00000
11 0.04900 osd.11 up 1.00000 1.00000
-9 0.58600 rack ceph-cod2
-5 0.19499 host ceph-cod2-osd-n1
12 0.04900 osd.12 up 1.00000 1.00000
13 0.04900 osd.13 up 1.00000 1.00000
14 0.04900 osd.14 up 1.00000 1.00000
15 0.04900 osd.15 up 1.00000 1.00000
-6 0.19499 host ceph-cod2-osd-n2
16 0.04900 osd.16 up 1.00000 1.00000
17 0.04900 osd.17 up 1.00000 1.00000
18 0.04900 osd.18 up 1.00000 1.00000
19 0.04900 osd.19 up 1.00000 1.00000
-7 0.19499 host ceph-cod2-osd-n3
20 0.04900 osd.20 up 1.00000 1.00000
21 0.04900 osd.21 up 1.00000 1.00000
22 0.04900 osd.22 up 1.00000 1.00000
23 0.04900 osd.23 up 1.00000 1.00000


PS What do you want to say?

In principle, Ceph met my expectations. Like object storage. Of course, there are features of its configuration and the organization of pools for this type of storage is not quite “book” - like, here is a pool with a date, and this is a pool with metadata. About this I will try to tell.

Also tried to play around with block storage. Unfortunately, “throwing away” block devices through FC Ceph does not know how, or I did not find how. As for iSCSI, for me this is not a true and happy way. Although it works, even with MPIO.

Three sources that seemed most useful to me:
official
informal
crib

Source: https://habr.com/ru/post/338782/


All Articles