How to build a scalable infrastructure on Storage Spaces

We often wrote about clusters in a box and Windows Storage Spaces and the most common question is how to choose the right disk configuration?

The traditional question usually looks like this:

- I want 100500 IOPS!
')
or:

- I want 20 terabytes.

Then it turns out that IOPS actually need not so much and you can get by with a few SSDs, but with 20 terabytes (4 disks in modern times) they want to do pretty well, and in reality it will turn out multi-level storage.

How to approach this correctly and plan ahead?

You need to consistently answer several key questions, for example:

The priority of data integrity protection (for example, in case of disk failure).
Performance is required, but not at the cost of protection.
Moderate capacity requirement.
The size of the budget for building a mixed solution.
Ease of management and monitoring.

The process of creating storage on Storage Spaces
Main technologies:

Storage Spaces - Virtualized storage: fault tolerant and productive.
Failover Clustering - High availability storage access.
Scale-Out File Server (SOFS) and Cluster Shared Volumes (CSV) - Scalable and unified access to the repository.
SMB3 - Failover and performance protocol using SMB Multichannel, SMB Direct, and SMB Client Redirection.
System Center, PowerShell, and In-box Windows Tooling - Manage / Configure / Maintain.

Advantages of Storage Spaces: A flexible and inexpensive data storage solution based entirely on mass hardware and Microsoft software.

The material assumes familiarity with the basics of Storage Spaces (available here and here ).

Steps:

one	Decision evaluation	Determine the key requirements for the solution, the characteristics of a successful installation.
2	SDS Design (Advance)	Selection of existing units (hardware and software) for the requirements, the choice of topology and configurations.
3	Test deployment	Check the selected solution in a real environment, possibly at a reduced scale (Proof-of-Concept, PoC).
four	Validation	The check for compliance with the requirements of paragraph 1. Usually begins with a synthetic load (SQLIO, Iometer, etc.), the final check is carried out in a real environment.
five	Optimization	According to the results of the previous steps and the revealed difficulties, adjustment and optimization of the solution (add / remove / replace hardware blocks, modify the topology, reconfigure the software, etc.), return to step 4
6	Deployment (in a real environment)	After optimization of the initial decision to the formulated requirements, there comes a period of implementation of the final version in the real operating environment.
7	The working process	Phase of operation. Monitoring, troubleshooting, upgrading, scaling, etc.

The first step was described above.

Defining Software Defined Storage Architecture (Advance)

Highlights:

Job tiring.
Counting the number of HDD and SSD.
Optimize the ratio of SSD: HDD.
Determination of the required number of disk shelves.
Requirements for SAS HBA and cable infrastructure.
Determine the number of storage servers and configurations.
Determining the number of disk pools (Pool).
Configuring pools.
Count the number of virtual disks.
Definition of virtual disk configurations.
Calculate the optimal size of the virtual disk.

General principles

You must install all updates and patches (as well as firmware).

Try to make the configuration symmetrical and complete.

Consider possible failures and create a plan to eliminate the consequences.

Why firmware is important

Key Requirements: Tiring

Multi-level storage (tearing) significantly increases the overall performance of the storage system. However, SSD drives have a relatively small capacity and a high price per gigabyte and somewhat increase the complexity of managing the storage.

Usually, the tearing is still used, and the “hot” data is placed on the productive level of SSD disks.

Key requirements: type and number of hard drives

The type, capacity and number of disks should reflect the desired capacity of the storage system. SAS and NL-SAS drives are supported, but expensive SAS 10 / 15K drives are not often needed when using tiring with a storage level on an SSD.

A typical choice *: 2-4 TB NL-SAS, model from one manufacturer with the latest stable firmware version.

* Typical values reflect current recommendations for working with virtualized tasks. Other types of load require additional work.

Capacity calculation:

Where:
a) Number of hard drives.
b) Number of disks - the nearest nearest even value to the resulting number.
c) Typical data block size (in gigabytes).
d) Reserve for capacity expansion.
e) Data integrity loss. Two-way mirror, three-way mirror, parity and other array types offer a different margin of "durability".
f) Each disk in Storage Spaces stores a copy of metadata (about pools, virtual disks) +, if necessary, provide a reserve for fast rebuild technology.
g) Disk capacity in gigabytes.

Performance calculation:

Where:
a) Number of hard drives.
b) Number of disks - the nearest nearest even value to the resulting number.
c) The number of IOPS that is planned to be obtained from the level of hard drives.
d) Record percentage.
e) Penalty on recording, which is different for different array levels used, which must be taken into account (for the system as a whole, for example, for 2-way mirror 1, the write operation causes 2 recordings on the disks, for 3-way mirror there are already 3 records).
f) Estimate disk performance for writing.
g) Evaluation of disk performance for reading.

* Calculations are given for the starting point of system planning .

Key requirements: type and number of SSD

With the hard drives sorted out, it's time to count the SSD. Their type, quantity and volume is based on the desired maximum performance of the disk subsystem.

Increasing the capacity of the SSD level allows you to move more tasks to the fast level using the integrated tiering engine.

Since the number of columns (Columns, data parallelization technology) in a virtual disk should be the same for both levels, increasing the number of SSDs usually allows you to create more columns and increase HDD level performance.

A typical choice: 200-1600 GB on MLC-memory, model from one manufacturer with the latest stable firmware version.

Performance calculation *:

Where:
a) The number of SSD drives.
b) SSD number - the nearest nearest even value to the resulting number.
c) The number of IOPS that you plan to receive from the SSD level.
d) Record percentage.
e) The penalty for writing is different for different levels of the array, it is necessary to take into account when planning.
f) Estimate the performance of SSD write.
g) Estimation of SSD performance for reading.

* Only for a starting point. Usually, the number of SSDs significantly exceeds the required minimum for performance due to additional factors.

The recommended minimum number of SSD in the shelves:

Array type	24 Disk JBOD (2 Columns)	60 disk JBOD (4 columns)
Two-way mirror	four	eight
Three-way mirror	6	12

Key requirements: SSD: HDD ratio

Matching is the balance of performance, capacity and cost. Adding SSD improves performance (most operations occur at the SSD level, increasing the number of columns in virtual disks, etc.), but significantly increases the cost and reduces the potential capacity (instead of a capacious disk, a small SSD is placed).

Typical choices: SSD: HDD *: 1: 4 - 1: 6

* By quantity, not capacity.

Key Requirements: Number and Configuration of Disk Shelves (JBOD)

Disk shelves differ in many ways - the number of disks to be installed, the number of SAS ports, etc.

Using multiple shelves allows you to consider their availability in fault tolerance schemes (using the enclosure awareness feature), but also increases the disk space required for Fast Rebuild.

The configuration should be symmetrical in all shelves (meaning cable connection and location of disks).

A typical choice: the number of shelves> = 2, IO modules in the shelf - 2, a single model, the latest firmware versions and the symmetrical arrangement of the disks in all the shelves.

Usually, the number of disk shelves is selected from the total number of disks, taking into account the margin for expansion and fault tolerance, taking into account the shelves (using the enclosure awareness function) and / or adding SAS paths for throughput and fault tolerance.

Payment:

Where:
a) Amount of JBOD.
b) Rounded up.
c) The number of HDD.
d) SSD count.
e) Free drive slots.
f) Maximum number of disks in JBOD.

Key requirements: SAS HBA and cables

The SAS cable topology must connect each storage server to the JBOD shelves using at least one SAS path (that is, one SAS port in the server falls on one SAS port in each JBOD).

Depending on the number and type of disks in the JBOD shelves, the total performance can easily reach the limit of one 4x 6G SAS port (~ 2.2 GB / s).

Recommended:

Number of SAS ports per server> = 2.
Number of SAS HBA per server> = 1.
Latest SAS HBA firmware versions.
Multiple SAS paths to JBOD shelves.
Windows MPIO settings: Round-Robin.

Key requirements: storage server configuration

Here it is necessary to take into account such features as server performance, number in a cluster for fault tolerance and serviceability, load characteristics (total IOPS, throughput, etc.), use of unloading technologies (RDMA), number of SAS ports per JBOD and multipath requirements .

2-4 servers are recommended; 2 x processor 6+ cores; memory> = 64 GB; two local hard drives in the mirror; two ports 1+ GbE for control and two ports 10+ GbE RDMA for data exchange; BMC port is either dedicated or combined with 1GbE and supports IPMI 2.0 and / or SMASH; SAS HBA from the previous paragraph.

Key requirements: number of disk pools

At this step, we take into account that pools are units for both management and fault tolerance. A failed disk in the pool affects all virtual disks located on the pool, each disk in the pool contains metadata.

Increase the number of pools:

Increases the reliability of the system, as the number of elements of fault tolerance increases.
Increases the number of disks for backup space (i.e. extra penalty), since Fast Rebuild works at the pool level.
Increases the complexity of system management.
Reduces the number of columns in the Virtual Disk (Virtual Drive, VD), reducing performance, because the VD can not stretch across multiple pools.
Reduces time to handle pool metadata, such as a virtual disk rebuild or pool control transfer in the event of a cluster failure (performance improvement).

Typical choice:

The number of pools from 1 to the number of disk shelves.
Disks / Pool <= 80

Key requirements: pool configuration

The pool contains default data for the corresponding VDs and several settings that affect the storage behavior:

Option	Description
RepairPolicy	Sequential vs Parallel (respectively, lower load on the IO, but slowly, or high load on the IO, but quickly)
RetireMissingPhysicalDisks	With the Fast Rebuild option, the dropped out disks do not cause the pool to be restored if the Auto value is set (if the value is Always, the pool rebuilding is always started using the Spare disk, but not immediately, but 5 minutes after the first unsuccessful attempt to write to this disk).
IsPowerProtected	True means that all write operations are considered complete without confirmation from the disk. A power outage can cause data corruption if there is no protection (a similar technology has appeared on hard drives).

Typical choice:

Hot Spares: No
Fast Rebuild: Yes
RepairPolicy: Parallel (default)
RetireMissingPhysicalDisks: Auto (default, MS recommended)
IsPowerProtected: False (default, but if enabled - performance increases significantly)

Key requirements: number of virtual disks (VD)

Take into account the following features:

The ratio of SMB folders serving clients to the CSV layer and the corresponding VD; should be 1: 1: 1.
Each virtual disk with a tearing has a dedicated WBC cache; increasing the number of VDs can improve performance for some tasks (other things being equal).
Increasing the number of VD increases the complexity of management.
Increasing the number of loaded VDs facilitates the distribution of the load of the failed node to other elements of the cluster.

A typical choice of VD numbers: 2-4 per storage server.

Key requirements: virtual disk configuration (VD)

Three types of array are available - Simply, Mirror and Parity, but only Mirror is recommended for virtualization tasks.

3-way mirroring, compared to 2-way, doubles protection against disk failure at the cost of a slight decrease in performance (higher recording penalty), a decrease in available space and a corresponding increase in cost.

Comparison of types of mirroring:

Number of pools	Mirror type	Overhead	Pool stability	System stability
one	2-way	50%	1 disc	1 disc
one	3-way	67%	2 disks	2 disks
2	2-way	50%	1 disc	2 disks
2	3-way	67%	2 disks	4 discs
3	2-way	50%	1 disc	3 disks
3	3-way	67%	2 disks	6 discs
four	2-way	50%	1 disc	4 discs
four	3-way	67%	2 disks	8 discs

Microsoft for Cloud Platform System (CPS) recommends a 3-way mirror.

Key requirements: number of columns

Usually, an increase in the number of columns increases the performance of the VD, but delays may also increase (more columns - more disks that should confirm the operation). The formula for calculating the maximum number of columns in a particular mirrored VD:

Where:
a) Rounding down.
b) The number of disks in a smaller level (usually the number of SSDs).
c) Fault tolerance by disk. For 2-way mirror it is 1 disk, for 3-way mirror it is 2 disks.
d) The number of copies of the data. For a 2-way mirror it is 2, for a 3-way mirror it is 3.

If the maximum number of columns is selected and a disk failure has occurred, then the pool does not have the required number of disks to match the number of columns and, accordingly, fast rebuild will be unavailable.

The usual choice is 4-6 (1 less than the calculated maximum for Fast Rebuild operation). This feature is only available in PowerShell; when building through a graphical interface, the system will choose the maximum possible number of columns, up to 8.

Key Requirements: Virtual Disk Options

Virtual Disk option	Considerations
Interleave	For random loads (virtualized, for example), the block size must be greater than or equal to the largest active request, since any query larger than this will be split into several operations, which will reduce performance.
WBC cache size	The default value is 1 GB, which is a reasonable balance between performance and fault tolerance for most tasks (increasing the cache size increases the transfer time in case of failures; when CSV volume is transferred, a reset and restore procedure is required, as well as problems due to denial of service) .
IsEnclosureAware	Improved protection against failures, if possible - it is recommended to use. For activation, it is necessary to match the number of disk shelves to the array level (available only through PowerShell). For a 3-way mirror you need 3 shelves, for dual parity 4, etc. This feature allows you to experience a complete loss of the disk shelf.

Typical choice:

Interleave: 256K (default)
WBC Size: 1GB (default)
IsEnclosureAware: use whenever possible.

Key requirements: virtual disk size

Obtaining the optimal VD size on the pool requires calculations for each storage level and summing up the optimal sizes of the VD levels. It is also necessary to reserve space for Fast Rebuild and metadata, as well as for internal rounding during calculations (buried in the depths of the Storage Spaces stack, so we simply reserve).

Formula:

Where:
a) A conservative approach, leaves a few more unallocated space in the pool than is necessary for Fast Rebuild to work. The value in GiB (the power of 2, GB is a derivative of a power of 10).
b) Value in GiB.
c) Backup Space for Storage Spaces metadata (all disks in the pool contain metadata from both pool and virtual disks).
d) Backup space for Fast Rebuild:> = size of one disk (+ 8GiB) for each level in the pool in the disk shelf.
e) The size of the storage level and so rounded up into several “tiles” (Spaces splits each disk in a pool into pieces called “tiles” that are used to build an array), the tile size is equal to the size of the Storage Spaces unit (1GiB) multiplied by number of columns; therefore, we round the size down to the nearest whole so as not to highlight the excess.
f) Cache size per write, in GiB, for the desired level (1 for SSD level, 0 for HDD level).
g) The number of disks in a particular pool level.

Spaces-Based SDS: Next Steps

We decided on the initial estimates, move on. Changes and improvements (and they will appear) in the results we will make on the basis of testing.

Example!

What do you want to get?

High level of fault tolerance.
Target performance: 100K IOPS with SSD level, 10K IOPS with HDD level with random load in 64K blocks with 60/40 read record
Capacity - 1000 virtual machines, 40GB each and 15% of the reserve.

Equipment:

2/4/6 / 8TB drives
IOPS (R / W): 140/130 R / W IOPS (declared specifications: 175 MB / s, 4.16ms)

SSD with 200/400/800 / 1600GB capacity
Iops
Read: 7000 IOPS @ 460MB / s (stated: 120K)
Write: 5500 IOPS @ 360MB / s (40K stated)

SSDs in real-world tasks show significantly different figures from the declared ones :)

Disk shelves - 60 disks, two SAS modules, 4 ports on each ETegro Fastor JS300 .

Input data

Tiering is required because high performance is required.

High reliability and not very big capacity is required, we use 3-way mirroring.

Drive selection (based on performance and budget requirements): 4TB HDD and 800GB MLC SSD

Calculations:

Spindles by capacity :

Performance spindles (we want to get 10K IOPS) :

The total capacity significantly exceeds the initial estimates, as it is necessary to meet the performance requirements.

SSD level:

The ratio of SSD: HDD:

Number of disk shelves:

The location of the disks in the shelves:

Increase the number of disks to get a symmetrical distribution and an optimal SSD: HDD ratio. For ease of expansion it is also recommended to fill the disk shelves completely.

SSD: 32 -> 36
HDD: 136 -> 144 (SSD: HDD 1: 4)
SSD / Enclosure: 12
HDD / Enclosure: 48

SAS cables

For a fault-tolerant connection and maximizing bandwidth, you need to connect each disk shelf with two cables (two ways from the server to each shelf). Total 6 SAS ports per server.

Number of servers

Based on the requirements for fault tolerance, IO, budget and the requirements of the multipath organization, we take 3 servers.

Number of pools

The number of disks in the pool must be equal to or less than 80; we take 3 pools (180/80 = 2.25).

HDDs / Pool: 48
SSDs / Pool: 12

Pool configuration

Hot Spares: No
Fast Rebuild: Yes (reserve enough space)
RepairPolicy: Parallel (default)
RetireMissingPhysicalDisks: Always
IsPowerProtected: False (default)

Number of virtual disks

Based on the requirements, we use 2 VD for each server, for a total of 6 equally distributed in pools (2 per pool).

Virtual Disk Configuration

Based on the requirements for resistance to data loss (for example, it is not possible to replace a failed disk within a few days) and load, we use the following settings:

Resiliency: 3-way mirroring
Interleave: 256K (default)
WBC Size: 1GB (default)
IsEnclosureAware: $ true

Number of columns

Virtual disk size and storage levels

Total:

Storage Servers: 3
SAS ports / server: 6
SAS paths from the server to each shelf: 2
Disk shelves: 3
Number of pools: 3
Number of VD: 6
Virtual Disks / Pool: 2
HDD: 144 @ 4TB (~ 576TB of raw space), 48 / shelf, 48 / pool, 16 / shelf / pool
SSD: 36 @ 800GB (~ 28TB of raw space), 12 / shelf, 12 / pool, 4 / shelf / pool
Size of Virtual Disk: SSD Tier + HDD Tier = 1110GB + 27926GB = 28.4TB
General useful capacity: (28.4) * 6 = 170TB
Overhead: (1 - 170 / (576 + 28)) = 72%

Source: https://habr.com/ru/post/257089/

All Articles

How to build a scalable infrastructure on Storage Spaces

More articles: