Ceph - from "on the knee" to "production"

Select CEPH. Part 1

We had five racks, ten optical switches configured by BGP, a couple of dozen SSDs and a bunch of SAS disks of all colors and sizes, and also a proxmox and a desire to shove all the static into our own S3 storage. Not that all this was needed for virtualization, but once started using opensource - then go to your hobby to the end. The only thing that bothered me was BGP. There is no one in the world who is more helpless, irresponsible and immoral than internal BGP marching. And I knew that pretty soon we would plunge into it.

The task was trivial - there was CEPH, it did not work very well. It was necessary to do "good."
The cluster I got was heterogeneous, hastily tuned and practically not tuned. It consisted of two groups of different nodes, with one common grid acting as a cluster and a public network. The nodes were packed with four types of disks - two types of SSD, collected in two separate placement rule and two types of HDD of different sizes, collected in the third group. The problem with different sizes was solved with different OSD weights.

The configuration itself was divided into two parts - tuning the operating system and tuning the CEPH itself and its settings.

Bleeding OS

Network

High latency affected both recording and balancing. When recording - because the client does not receive a response about the successful recording, while replicas of data in other placement groups do not confirm success. Since the distribution rules of replicas in the CRUSH map we had one replica per host, the network was always used.

Therefore, the first thing I decided to do was tweak the current network, in parallel trying to convince to move to separate networks.

For a start, twisted the settings of network cards. Started by setting up the queues:

what happened:

ethtool -l ens1f1

root@ceph01:~# ethtool -l ens1f1 Channel parameters for ens1f1: Pre-set maximums: RX: 0 TX: 0 Other: 1 Combined: 63 Current hardware settings: RX: 0 TX: 0 Other: 1 Combined: 1 root@ceph01:~# ethtool -g ens1f1 Ring parameters for ens1f1: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 256 root@ceph01:~# ethtool -l ens1f1 Channel parameters for ens1f1: Pre-set maximums: RX: 0 TX: 0 Other: 1 Combined: 63 Current hardware settings: RX: 0 TX: 0 Other: 1 Combined: 1

It can be seen that the current parameters are far from maximums. Increased:

 root@ceph01:~#ethtool -G ens1f0 rx 4096 root@ceph01:~#ethtool -G ens1f0 tx 4096 root@ceph01:~#ethtool -L ens1f0 combined 63

Guided by a great article

https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/

increased the length of the send queue txqueuelen from 1000 to 10,000

 root@ceph01:~#ip link set ens1f0 txqueuelen 10000

Well, following the documentation of the ceph

https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/

increased MTU to 9000.

 root@ceph01:~#ip link set dev ens1f0 mtu 9000

Added in / etc / network / interfaces, so that all of the above is loaded at startup

cat / etc / network / interfaces

 root@ceph01:~# cat /etc/network/interfaces auto lo iface lo inet loopback auto ens1f0 iface ens1f0 inet manual post-up /sbin/ethtool -G ens1f0 rx 4096 post-up /sbin/ethtool -G ens1f0 tx 4096 post-up /sbin/ethtool -L ens1f0 combined 63 post-up /sbin/ip link set ens1f0 txqueuelen 10000 mtu 9000 auto ens1f1 iface ens1f1 inet manual post-up /sbin/ethtool -G ens1f1 rx 4096 post-up /sbin/ethtool -G ens1f1 tx 4096 post-up /sbin/ethtool -L ens1f1 combined 63 post-up /sbin/ip link set ens1f1 txqueuelen 10000 mtu 9000

After that, following the same article, he thoughtfully began to twist the handle of the core 4.15. Considering that there are 128G RAM on the nodes, there is a configuration file for sysctl

cat /etc/sysctl.d/50-ceph.conf

 net.core.rmem_max = 56623104 #        54M net.core.wmem_max = 56623104 #        54M net.core.rmem_default = 56623104 #        . 54M net.core.wmem_default = 56623104 #         54M #    net.ipv4.tcp_rmem = 4096 87380 56623104 # (,  , )    tcp_rmem #  3  ,      TCP. # :   TCP       #   .     #       (moderate memory pressure). #       8  (8192). #  :  ,    #   TCP  .     #  /proc/sys/net/core/rmem_default,   . #       ( ) #  87830 .     65535  #     tcp_adv_win_scale  tcp_app_win = 0, #  ,       tcp_app_win. # :   ,     #     TCP.     , #    /proc/sys/net/core/rmem_max.  «» #     SO_RCVBUF     . net.ipv4.tcp_wmem = 4096 65536 56623104 net.core.somaxconn = 5000 #    ,  . net.ipv4.tcp_timestamps=1 #     (timestamps),    RFC 1323. net.ipv4.tcp_sack=1 #     TCP net.core.netdev_max_backlog=5000 ( 1000) #       ,  #    ,     . net.ipv4.tcp_max_tw_buckets=262144 #   ,    TIME-WAIT . #     – «»     #    . net.ipv4.tcp_tw_reuse=1 #   TIME-WAIT   , #     . net.core.optmem_max=4194304 #   - ALLOCATABLE #    (4096 ) net.ipv4.tcp_low_latency=1 #  TCP/IP      #     . net.ipv4.tcp_adv_win_scale=1 #          , #    TCP-    . #   tcp_adv_win_scale ,     #   : # Bytes- bytes\2  -tcp_adv_win_scale #  bytes –     .   tcp_adv_win_scale # ,       : # Bytes- bytes\2  tcp_adv_win_scale #    .  - – 2, # ..     ¼  ,   # tcp_rmem. net.ipv4.tcp_slow_start_after_idle=0 #    ,     # ,       . #   SSR  ,    #  . net.ipv4.tcp_no_metrics_save=1 #    TCP      . net.ipv4.tcp_syncookies=0 #   syncookie net.ipv4.tcp_ecn=0 #Explicit Congestion Notification (   )  # TCP-.      «» #       .     # -        #    . net.ipv4.conf.all.send_redirects=0 #   ICMP Redirect …  .    #   ,        . #    . net.ipv4.ip_forward=0 #  .   ,     , #    . net.ipv4.icmp_echo_ignore_broadcasts=1 #   ICMP ECHO ,    net.ipv4.tcp_fin_timeout=10 #      FIN-WAIT-2   #   .  60 net.core.netdev_budget=600 # ( 300) #        , #          #  .    NIC ,    . # ,     SoftIRQs # ( )  CPU.    netdev_budget. #    300.    SoftIRQ  # 300   NIC     CPU net.ipv4.tcp_fastopen=3 # TFO TCP Fast Open #        TFO,      #    TCP .     ,  #  )

With luster network was allocated on separate 10Gbps network interfaces into a separate flat network. On each machine, mellanox 10/25 Gbps network dual-port cards were installed, plugged into two separate 10Gbps switches. The aggregation was done using OSPF, since bonding with lacp for some reason showed a total bandwidth of up to 16 Gbps, while ospf successfully disposed of both tens on each machine. Future plans were to use ROCE on these meloxes to reduce latency. How to set up this part of the network:

Since the machines themselves have external IPs on BGP, we need software - (or rather, at the time of this writing, it was frr = 6.0-1 ) already standing.
In total, the machines had two network interfaces, each with a total of 4 ports. One network card looked at the factory with two ports and BGP was configured on it, the second one looked at two different switches with two ports and OSPF was set on it

More details on OSPF configuration: The main task is to aggregate two links and have fault tolerance.
Two network interfaces are configured in two simple flat networks - 10.10.10.0/24 and 10.10.20.0/24

 1: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 inet 10.10.10.2/24 brd 10.10.10.255 scope global ens1f0 2: ens1f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 inet 10.10.20.2/24 brd 10.10.20.255 scope global ens1f1

by which cars see each other.

DISK

The next step was to optimize the performance of the disks. For SSD, I changed the scheduler to noop , for HDD - deadline . If roughly - then NOOP works on the principle of "who first got up - that and sneakers", which in English sounds like "FIFO (First In, First Out)". Requests queue up as they arrive. DEADLINE is more accurate to read, plus the process from the queue gets almost exclusive access to the disk at the time of the operation. For our system, this is great - after all, only one process works with each disk - OSD daemon.
(Those who wish to plunge into the I / O scheduler can read about it here:
http://www.admin-magazine.com/HPC/Articles/Linux-IO-Schedulers

Preferred to read in Russian: https://www.opennet.ru/base/sys/linux_shedulers.txt.html )

Recommendations for tuning Linux are also advised to increase nr_request

nr_requests
I’m looking for what I’ll want to use. It will help you to make it possible for you to go there. If you are using the schedule, it’s suggested that you set the value to 2 times the value of the queue depth.

BUT! The citizens themselves are the developers of CEPH convince us that their priority system works better

WBThrottle and / or nr_requests

WBThrottle and / or nr_requests
File storage uses buffered I / O to write; this brings a number of advantages if the file storage log is on faster media. Customer requests are notified as soon as the data is logged, and then dumped onto the data disk itself at a later time using standard Linux functionality. This makes it possible for OSD spindle drives to provide write latency similar to SSD when recording in small packets. This delayed write delay also allows the kernel itself to rebuild I / O requests to disk with the hope of either merging them together or allowing existing disk heads to choose some more optimal path on top of their plates. The end effect is that you can squeeze slightly more I / O operations from each disk than would be possible with direct or synchronous I / O operations.

However, there is a certain problem if the volume of incoming entries to this Ceph cluster will outrun all the capabilities of the underlying disks. In such a scenario, the total number of I / O operations pending writing to a disk can grow uncontrollably and result in an I / O queue filling the entire disk and Ceph queues. Read requests are particularly bad, as they get stuck between write requests, which can take a few seconds to flush to the main disk.

To defeat this problem, Ceph has a delayed write throttling mechanism called WBThrottle built into file storage. It is designed to limit the total amount of lazy-write I / O operations that can line up and begin its reset process earlier than it would have done naturally due to the inclusion by the kernel itself. Unfortunately, testing demonstrates that the default values still may not cut existing behavior to a level that can reduce such an impact on the latency of read operations. Adjustment can alter this behavior and reduce the overall length of the write queues and make it possible for such an impact not to be strong. However, there is a certain compromise: by reducing the total maximum number of entries allowed for queuing, you can reduce the ability of the kernel itself to maximize its efficiency in streamlining incoming requests. It is worth thinking a little what you need more for your particular application, workloads and adjust according to them.

To control the depth of such a delayed write queue, you can either reduce the total maximum number of outstanding I / O operations by using the WBThrottle settings, or by decreasing the maximum value for outstanding operations at the core level of your kernel. Both can effectively manage the same behavior and it is your preferences that will be at the heart of the implementation of this setting.
It should also be noted that Ceph's prioritized operations system is more efficient for shorter queries at the disk level. When reducing the total queue to a given disk, the main location in the queue moves to Ceph, where it has more control over what I / O operation has priority. Consider the following example:

 echo 8 > /sys/block/sda/queue/nr_requests

http://onreader.mdl.ru/MasteringCeph/content/Ch09.html#030202

COMMON

And a few more kernel tweaks to allow ~~your car is soft and silky~~ squeeze some more performance out of iron

cat /etc/sysctl.d/60-ceph2.conf

  kernel.pid_max = 4194303 #     25,       kernel.threads-max=2097152 # , , . vm.max_map_count=524288 #      . #        #         # malloc,    mmap, mprotect  madvise,     #  . fs.aio-max-nr=50000000 #   input-output #  Linux     - (AIO), #       - # ,    -  . #     , #      -. #  aio-max-nr     #  . vm.min_free_kbytes=1048576 #       . #  1Gb,       , #    OOM Killer   OSD.     #    ,      vm.swappiness=10 #       10% . #   128G ,  10%  12 .     . #    60%   ,   , #       vm.vfs_cache_pressure=1000 #    100.     #     . vm.zone_reclaim_mode=0 #         #  ,     . #     ,     . #       # ,    , zone_reclaim_mode #  ,   , # ,   ,   . vm.dirty_ratio=20 #   ,     ""  #    : #   128  . #   20  SSD,     CEPH  #     3G . #   40  HDD,      1G # 20%  128  25.6 . ,     , #    2.4G .         #    -    DevOps   . vm.dirty_background_ratio=3 #   ,    dirty pages  , #    pdflush/flush/kdmflush     fs.file-max=524288 #      ,,   ,    .

Immersion in CEPH

Settings on which I would like to linger more:

cat /etc/ceph/ceph.conf

 osd: journal_aio: true #  ,  journal_block_align: true #  i/o journal_dio: true #   journal_max_write_bytes: 1073714824 #     #      journal_max_write_entries: 10000 #      journal_queue_max_bytes: 10485760000 journal_queue_max_ops: 50000 rocksdb_separate_wal_dir: true #    wal #       # NVMe bluestore_block_db_create: true #       bluestore_block_db_size: '5368709120 #5G' bluestore_block_wal_create: true bluestore_block_wal_size: '1073741824 #1G' bluestore_cache_size_hdd: '3221225472 # 3G' #     #     bluestore_cache_size_ssd: '9663676416 # 9G' keyring: /var/lib/ceph/osd/ceph-$id/keyring osd_client_message_size_cap: '1073741824 #1G' osd_disk_thread_ioprio_class: idle osd_disk_thread_ioprio_priority: 7 osd_disk_threads: 2 #        osd_failsafe_full_ratio: 0.95 osd_heartbeat_grace: 5 osd_heartbeat_interval: 3 osd_map_dedup: true osd_max_backfills: 2 #       . osd_max_write_size: 256 osd_mon_heartbeat_interval: 5 osd_op_threads: 16 osd_op_num_threads_per_shard: 1 osd_op_num_threads_per_shard_hdd: 2 osd_op_num_threads_per_shard_ssd: 2 osd_pool_default_min_size: 1 #  .    osd_pool_default_size: 2 #  ,    #     #   osd_recovery_delay_start: 10.000000 osd_recovery_max_active: 2 osd_recovery_max_chunk: 1048576 osd_recovery_max_single_start: 3 osd_recovery_op_priority: 1 osd_recovery_priority: 1 #       osd_recovery_sleep: 2 osd_scrub_chunk_max: 4

Some of the parameters that were tested for QA on version 12.2.12 are missing in the ceph 12.2.2 version, for example osd_recovery_threads. Therefore, the plans included an update on the sale until 12.2.12. Practice has shown compatibility in a single cluster of versions 12.2.2 and 12.2.12, which allows you to make a rolling update.

Test cluster

Naturally, for testing it was necessary to have the same version as in the battle, but at the time of the start of my work with the cluster in the repository there was only a newer one. Seeing that you distinguish in the minor version not very large ( 1393 lines in the configs against 1436 in the new version), we decided to start testing a new one (it would be updated anyway, why go on the old trash)

The only thing that they tried to keep the old version was the ceph-deploy package , because some of the utilities (and some of the employees) were sharpened by its syntax. The new version was quite different, but it had no effect on the work of the cluster itself, and versions 1.5.39 left it.

Since the ceph-disk command explicitly says that it is deprecated and use, dear, with the ceph-volume command - we started creating OSD with this very command, without wasting time on the outdated one.

The plan was to create a mirror of two SSD disks on which we place OSD logs, which, in turn, are located on spindle SASs. So we will protect ourselves against data problems when the disk is falling with the magazine.

Create a cluster of steel documentation

cat /etc/ceph/ceph.conf

 root@ceph01-qa:~# cat /etc/ceph/ceph.conf #     [client] rbd_cache = true rbd_cache_max_dirty = 50331648 rbd_cache_max_dirty_age = 2 rbd_cache_size = 67108864 rbd_cache_target_dirty = 33554432 rbd_cache_writethrough_until_flush = true rbd_concurrent_management_ops = 10 rbd_default_format = 2 [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster network = 10.10.10.0/24 debug_asok = 0/0 debug_auth = 0/0 debug_buffer = 0/0 debug_client = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_filer = 0/0 debug_filestore = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_journal = 0/0 debug_journaler = 0/0 debug_lockdep = 0/0 debug_mon = 0/0 debug_monc = 0/0 debug_ms = 0/0 debug_objclass = 0/0 debug_objectcatcher = 0/0 debug_objecter = 0/0 debug_optracker = 0/0 debug_osd = 0/0 debug_paxos = 0/0 debug_perfcounter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_rgw = 0/0 debug_throttle = 0/0 debug_timer = 0/0 debug_tp = 0/0 fsid = d0000000d-4000-4b00-b00b-0123qwe123qwf9 mon_host = ceph01-q, ceph02-q, ceph03-q mon_initial_members = ceph01-q, ceph02-q, ceph03-q public network = 8.8.8.8/28 #  ,  )) rgw_dns_name = s3-qa.mycompany.ru #     rgw_host = s3-qa.mycompany.ru #    [mon] mon allow pool delete = true mon_max_pg_per_osd = 300 #     #     #  , ,    , #     OSD.     PG #     -    mon_osd_backfillfull_ratio = 0.9 mon_osd_down_out_interval = 5 mon_osd_full_ratio = 0.95 #   SSD     #   -      #   5%   (   1.2Tb) #   ,     # bluestore_block_db_size     #   mon_osd_nearfull_ratio = 0.9 mon_pg_warn_max_per_osd = 520 [osd] bluestore_block_db_create = true bluestore_block_db_size = 5368709120 #5G bluestore_block_wal_create = true bluestore_block_wal_size = 1073741824 #1G bluestore_cache_size_hdd = 3221225472 # 3G bluestore_cache_size_ssd = 9663676416 # 9G journal_aio = true journal_block_align = true journal_dio = true journal_max_write_bytes = 1073714824 journal_max_write_entries = 10000 journal_queue_max_bytes = 10485760000 journal_queue_max_ops = 50000 keyring = /var/lib/ceph/osd/ceph-$id/keyring osd_client_message_size_cap = 1073741824 #1G osd_disk_thread_ioprio_class = idle osd_disk_thread_ioprio_priority = 7 osd_disk_threads = 2 osd_failsafe_full_ratio = 0.95 osd_heartbeat_grace = 5 osd_heartbeat_interval = 3 osd_map_dedup = true osd_max_backfills = 4 osd_max_write_size = 256 osd_mon_heartbeat_interval = 5 osd_op_num_threads_per_shard = 1 osd_op_num_threads_per_shard_hdd = 2 osd_op_num_threads_per_shard_ssd = 2 osd_op_threads = 16 osd_pool_default_min_size = 1 osd_pool_default_size = 2 osd_recovery_delay_start = 10.0 osd_recovery_max_active = 1 osd_recovery_max_chunk = 1048576 osd_recovery_max_single_start = 3 osd_recovery_op_priority = 1 osd_recovery_priority = 1 osd_recovery_sleep = 2 osd_scrub_chunk_max = 4 osd_scrub_chunk_min = 2 osd_scrub_sleep = 0.1 rocksdb_separate_wal_dir = true

 #   root@ceph01-qa:~#ceph-deploy mon create ceph01-q #        root@ceph01-qa:~#ceph-deploy gatherkeys ceph01-q #   .       - ,       # mon_initial_members = ceph01-q, ceph02-q, ceph03-q #         root@ceph01-qa:~#ceph-deploy mon create-initial #        root@ceph01-qa:~#cat ceph.bootstrap-osd.keyring > /var/lib/ceph/bootstrap-osd/ceph.keyring root@ceph01-qa:~#cat ceph.bootstrap-mgr.keyring > /var/lib/ceph/bootstrap-mgr/ceph.keyring root@ceph01-qa:~#cat ceph.bootstrap-rgw.keyring > /var/lib/ceph/bootstrap-rgw/ceph.keyring #      root@ceph01-qa:~#ceph-deploy admin ceph01-q #  ,   root@ceph01-qa:~#ceph-deploy mgr create ceph01-q

, ceph-deploy 12.2.12 — OSD db -

 root@ceph01-qa:~#ceph-volume lvm create --bluestore --data /dev/sde --block.db /dev/md0 blkid could not detect a PARTUUID for device: /dev/md1

, blkid PARTUUID, :

 root@ceph01-qa:~#parted /dev/md0 mklabel GPT #   , #  GPT     #        = bluestore_block_db_size: '5368709120 #5G' #    20  OSD,     #    root@ceph01-qa:~#for i in {1..20}; do echo -e "n\n\n\n+5G\nw" | fdisk /dev/md0; done

, OSD (, , )

OSD bluestore WAL, db

 root@ceph01-qa:~#ceph-volume lvm create --bluestore --data /dev/sde --block.db /dev/md0 stderr: 2019-04-12 10:39:27.211242 7eff461b6e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid stderr: 2019-04-12 10:39:27.213185 7eff461b6e00 -1 bdev(0x55824c273680 /var/lib/ceph/osd/ceph-0//block.wal) open open got: (22) Invalid argument stderr: 2019-04-12 10:39:27.213201 7eff461b6e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_db add block device(/var/lib/ceph/osd/ceph-0//block.wal) returned: (22) Invalid argument stderr: 2019-04-12 10:39:27.999039 7eff461b6e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) mkfs failed, (22) Invalid argument stderr: 2019-04-12 10:39:27.999057 7eff461b6e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (22) Invalid argument stderr: 2019-04-12 10:39:27.999141 7eff461b6e00 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (22) Invalid argumen

- ( , ) WAL OSD — ( WAL, , , ).

, WAL NVMe, .

 root@ceph01-qa:~#ceph-volume lvm create --bluestore --data /dev/sdf --block.wal /dev/md0p2 --block.db /dev/md1p2

, OSD. , — SSD , SAS.

20 , , — .
, , :

ceph osd tree

root@eph01-q:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 14.54799 root default
-3 9.09200 host ceph01-q
0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000
3 ssd 1.00000 osd.3 up 1.00000 1.00000
4 hdd 1.00000 osd.4 up 1.00000 1.00000
5 hdd 0.27299 osd.5 up 1.00000 1.00000
6 hdd 0.27299 osd.6 up 1.00000 1.00000
7 hdd 0.27299 osd.7 up 1.00000 1.00000
8 hdd 0.27299 osd.8 up 1.00000 1.00000
9 hdd 0.27299 osd.9 up 1.00000 1.00000
10 hdd 0.27299 osd.10 up 1.00000 1.00000
11 hdd 0.27299 osd.11 up 1.00000 1.00000
12 hdd 0.27299 osd.12 up 1.00000 1.00000
13 hdd 0.27299 osd.13 up 1.00000 1.00000
14 hdd 0.27299 osd.14 up 1.00000 1.00000
15 hdd 0.27299 osd.15 up 1.00000 1.00000
16 hdd 0.27299 osd.16 up 1.00000 1.00000
17 hdd 0.27299 osd.17 up 1.00000 1.00000
18 hdd 0.27299 osd.18 up 1.00000 1.00000
19 hdd 0.27299 osd.19 up 1.00000 1.00000
-5 5.45599 host ceph02-q
20 ssd 0.27299 osd.20 up 1.00000 1.00000
21 ssd 0.27299 osd.21 up 1.00000 1.00000
22 ssd 0.27299 osd.22 up 1.00000 1.00000
23 ssd 0.27299 osd.23 up 1.00000 1.00000
24 hdd 0.27299 osd.24 up 1.00000 1.00000
25 hdd 0.27299 osd.25 up 1.00000 1.00000
26 hdd 0.27299 osd.26 up 1.00000 1.00000
27 hdd 0.27299 osd.27 up 1.00000 1.00000
28 hdd 0.27299 osd.28 up 1.00000 1.00000
29 hdd 0.27299 osd.29 up 1.00000 1.00000
30 hdd 0.27299 osd.30 up 1.00000 1.00000
31 hdd 0.27299 osd.31 up 1.00000 1.00000
32 hdd 0.27299 osd.32 up 1.00000 1.00000
33 hdd 0.27299 osd.33 up 1.00000 1.00000
34 hdd 0.27299 osd.34 up 1.00000 1.00000
35 hdd 0.27299 osd.35 up 1.00000 1.00000
36 hdd 0.27299 osd.36 up 1.00000 1.00000
37 hdd 0.27299 osd.37 up 1.00000 1.00000
38 hdd 0.27299 osd.38 up 1.00000 1.00000
39 hdd 0.27299 osd.39 up 1.00000 1.00000
-7 6.08690 host ceph03-q
40 ssd 0.27299 osd.40 up 1.00000 1.00000
41 ssd 0.27299 osd.41 up 1.00000 1.00000
42 ssd 0.27299 osd.42 up 1.00000 1.00000
43 ssd 0.27299 osd.43 up 1.00000 1.00000
44 hdd 0.27299 osd.44 up 1.00000 1.00000
45 hdd 0.27299 osd.45 up 1.00000 1.00000
46 hdd 0.27299 osd.46 up 1.00000 1.00000
47 hdd 0.27299 osd.47 up 1.00000 1.00000
48 hdd 0.27299 osd.48 up 1.00000 1.00000
49 hdd 0.27299 osd.49 up 1.00000 1.00000
50 hdd 0.27299 osd.50 up 1.00000 1.00000
51 hdd 0.27299 osd.51 up 1.00000 1.00000
52 hdd 0.27299 osd.52 up 1.00000 1.00000
53 hdd 0.27299 osd.53 up 1.00000 1.00000
54 hdd 0.27299 osd.54 up 1.00000 1.00000
55 hdd 0.27299 osd.55 up 1.00000 1.00000
56 hdd 0.27299 osd.56 up 1.00000 1.00000
57 hdd 0.27299 osd.57 up 1.00000 1.00000
58 hdd 0.27299 osd.58 up 1.00000 1.00000
59 hdd 0.89999 osd.59 up 1.00000 1.00000

 root@ceph01-q:~#ceph osd crush add-bucket rack01 root #  root root@ceph01-q:~#ceph osd crush add-bucket ceph01-q host #   root@ceph01-q:~#ceph osd crush move ceph01-q root=rack01 #     root@ceph01-q:~#osd crush add 28 1.0 host=ceph02-q #     #       root@ceph01-q:~# ceph osd crush remove osd.4 root@ceph01-q:~# ceph osd crush remove rack01

, , — ceph osd crush move ceph01-host root=rack01 , . CTRL+C .

: https://tracker.ceph.com/issues/23386

crushmap rule replicated_ruleset

 root@ceph01-prod:~#ceph osd getcrushmap -o crushmap.row #     root@ceph01-prod:~#crushtool -d crushmap.row -o crushmap.txt #   root@ceph01-prod:~#vim crushmap.txt #,  rule replicated_ruleset root@ceph01-prod:~#crushtool -c crushmap.txt -o new_crushmap.row #  root@ceph01-prod:~#ceph osd setcrushmap -i new_crushmap.row #

: placement group OSD. , .

, — , OSD , , root default.
, , root ssd , default root. OSD .
, .

.

root- — ssd hdd

 root@ceph01-q:~#ceph osd crush add-bucket ssd-root root root@ceph01-q:~#ceph osd crush add-bucket hdd-root root

—

 # : root@ceph01-q:~#ceph osd crush add-bucket ssd-rack01 rack root@ceph01-q:~#ceph osd crush add-bucket ssd-rack02 rack root@ceph01-q:~#ceph osd crush add-bucket ssd-rack03 rack root@ceph01-q:~#ceph osd crush add-bucket hdd-rack01 rack root@ceph01-q:~#ceph osd crush add-bucket hdd-rack01 rack root@ceph01-q:~#ceph osd crush add-bucket hdd-rack01 rack #  root@ceph01-q:~#ceph osd crush add-bucket ssd-ceph01-q host root@ceph01-q:~#ceph osd crush add-bucket ssd-ceph02-q host root@ceph01-q:~#ceph osd crush add-bucket ssd-ceph03-q host root@ceph01-q:~#ceph osd crush add-bucket hdd-ceph01-q host root@ceph01-q:~#ceph osd crush add-bucket hdd-ceph02-q host root@ceph01-q:~#ceph osd crush add-bucket hdd-ceph02-q host

 root@ceph01-q:~#   0  3  SSD,   ceph01-q,     root@ceph01-q:~# ssd-ceph01-q root@ceph01-q:~#ceph osd crush add 0 1 host=ssd-ceph01-q root@ceph01-q:~#ceph osd crush add 1 1 host=ssd-ceph01-q root@ceph01-q:~#ceph osd crush add 2 1 host=ssd-ceph01-q root@ceph01-q:~#ceph osd crush add 3 1 host=ssd-ceph01-q root-ceph01-q:~#

ssd-root hdd-root root-default ,

 root-ceph01-q:~#ceph osd crush remove default

, — root — , ( root, )

:
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crushmaprules

 root-ceph01-q:~#ceph osd crush rule create-simple rule-ssd ssd-root host firstn root-ceph01-q:~#ceph osd crush rule create-simple rule-hdd hdd-root host firstn root-ceph01-q:~#    ,     root-ceph01-q:~#   -        , root-ceph01-q:~#       root-ceph01-q:~#  ,   ,    root-ceph01-q:~#        : root-ceph01-q:~# ##ceph osd crush rule create-simple rule-ssd ssd-root rack firstn

, — PROXMOX:

  root-ceph01-q:~# #ceph osd pool create {NAME} {pg_num} {pgp_num} root-ceph01-q:~# ceph osd pool create ssd_pool 1024 1024 root-ceph01-q:~# ceph osd pool create hdd_pool 1024 1024

  root-ceph01-q:~#ceph osd crush rule ls #    root-ceph01-q:~#ceph osd crush rule dump rule-ssd | grep rule_id # ID  root-ceph01-q:~#ceph osd pool set ssd_pool crush_rule 2

— , ( ) , .

300 , — 10 Tb 10 PG — (pg) — ).

PG — — .

, CEPH.

https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data
http://www.admin-magazine.com/HPC/Articles/Linux-IO-Schedulers
http://onreader.mdl.ru/MasteringCeph/content/Ch09.html#030202
https://tracker.ceph.com/issues/23386
https://ceph.com/pgcalc/

Source: https://habr.com/ru/post/456446/