📜 ⬆️ ⬇️

Testing flash storage. IBM FlashSystem 840

Last year, they wrote about testing the IBM RamSan FlashSystem 820 . But this time, thanks to one large customer, IBM FlashSystem 840 fell into our hands . For about a year, the system, “childhood diseases”, is already behind. it's time to evaluate her professional capabilities.


Testing method


During testing, the following tasks were solved:

Testbed Configuration


For testing, at the Customer’s site, 2 different stands were successively assembled.
For tests of groups 1 and 2, the load is generated by a single server, and the stand has the form shown in the figure:
Figure 1. Block diagram of the test stand 1.

Server: IBM 3850X5, connected directly by eight 8Gb FC connections to. Storage IBM FlashSystem 840

For group 3 tests , an IBM 3650M4 server is added to the described bench, also connected directly to the IBM Flash System 840 storage system. At this stage, each server is connected to the storage system via four optical links.
Figure 2. Block diagram of the test bench 2.
As additional software, Symantec Storage Foundation 6.1 is installed on the test server, which implements:

See tiresome details and all sorts of clever words.
On the test server, the following settings were made to reduce disk I / O latency:
  • Changed the I / O scheduler from cfq to noop by assigning the noop value noop the scheduler parameter of the Symantec VxVolume
  • The following parameter has been added to /etc/sysctl.conf minimizes the queue size at the level of the Symantec logical volume manager: vxvm.vxio.vol_use_rq = 0 ;
  • Increased the limit of simultaneous I / O requests to the device to 1024 by assigning the value 1024 to the nr_requests parameter of the Symantec VxVolume ;
  • Disabled checking for the possibility of merging I / O operations (iomerge) by assigning the value 1 to the nomerges parameter of the Symantec VxVolume ;
  • The queue size on FC adapters has been increased by adding the options ql2xmaxqdepth=64 (options qla2xxx ql2xmaxqdepth=64) /etc/modprobe.d/modprobe.conf configuration file ql2xmaxqdepth=64 (options qla2xxx ql2xmaxqdepth=64) ;

On the storage system, the following configuration settings are performed for partitioning disk space:
  • The configuration of Flash modules is implemented RAID5;
  • For tests of groups 1 and 2, 8 LUNs of the same volume are created on the storage system with a total volume covering the entire usable capacity of the disk array. Block size LUN - 512byte. Created LUNs are presented to one test server. For tests of group 3, 16 LUNs of the same volume are created with a total volume covering the entire usable capacity of the disk array. The created LUNs are presented in 8 pieces to each of the 2 test servers.

Software used in the testing process


To create a synthetic load (performance of synthetic tests) on the storage system, the Flexible IO Tester (fio) version 2.1.4 utility is used. All synthetic tests use the following fio configuration parameters of the [global] section:
  • thread=0
  • direct=1
  • group_reporting=1
  • norandommap=1
  • time_based=1
  • randrepeat=0
  • ramp_time=10

The following utilities are used to remove performance indicators under synthetic load:
  • iostat , part of the sysstat version 9.0.4 package with txk keys;
  • vxstat , which is part of Symantec Storage Foundation 6.1 with svd keys;
  • vxdmpadm , part of Symantec Storage Foundation 6.1 with the -q iostat keys;
  • fio version 2.1.4, to generate a summary report for each load profile.

The removal of performance indicators during the test with the utilities iostat, vxstat, vxdmpstat performed at intervals of 5 seconds.

Testing program.


Tests are performed by creating a synthetic load on the block device (fio), which is a stripe, 8 column, stripe unit size=1MiB logical volume stripe, 8 column, stripe unit size=1MiB , created using Veritas Volume Manager from 8 LUNs presented from the system under test.
Testing consisted of 3 groups of tests:
Ask for details
Group 1: Tests that implement a continuous load of random write type with a change in the size of the block I / O operations (I / O).

When creating a test load, the following fio program parameters are used (in addition to those previously defined):
  • rw=randwrite ;
  • blocksize=4K ;
  • numjobs=64 ;
  • iodepth=64 .

A group of tests consists of three tests that differ in the total volume of LUNs presented with the tested storage system and the size of a block of I / O operations:
  • The test for recording performed on a fully-marked storage system - the total volume of the presented LUNs is equal to the effective storage capacity of the storage system, the test duration is 18 hours;
  • Recording tests with varying block size (4,8,16,32,64,1024K), performed on a fully-marked storage system, the duration of each test is 1 hour. Pause between tests - 2 hours.
  • Recording tests with a variable block size (4,8,16,32,64,1024K), performed on a storage system filled to 70%, the duration of each test is 1 hour. Pause between tests - 2 hours. For this test, 8 LUNs are created on the tested storage; their total capacity is 70% of the effective storage capacity. Created LUNs are presented to the test server, where one of them using Symantec VxVM is collected the volume on which the test load falls.

Based on the test results, the following graphs are generated based on the data output by the vxstat command, combining the test results:
  • IOPS as a function of time;
  • Bandwidth (BandWidth) as a function of time.
  • Latency as a function of time.

The analysis of the received information is carried out and conclusions are drawn about:
  • The presence of performance degradation during long-term load on the record;
  • The performance of the service processes storage (garbage collection) limiting the performance of the disk array to write during a long peak load;
  • The degree of influence of the size of a block of I / O operations on the performance of storage service processes;
  • The amount of space reserved for storage for leveling storage service processes.
  • Influence of storage density on the performance of service processes.

Group 2: Performance tests of a disk array with different types of load generated by a single server, executed at the block device level.

During testing, the following types of loads are investigated:
  • load profiles (changeable software parameters fio: randomrw, rwmixedread ):

  1. random recording 100%;
  2. random write 30%, random read 70%;
  3. random read 100%.

  • block sizes: 1KB, 8KB, 16KB, 32KB, 64KB, 1MB (changeable software parameter fio: blocksize );
  • methods of processing I / O operations: synchronous, asynchronous (variable software parameter fio: ioengine );
  • the number of load generating processes: 1, 2, 4, 8, 16, 32, 64, 128 (changeable software parameter fio: numjobs );
  • queue depth (for asynchronous I / O operations): 32, 64 (changeable software parameter fio: iodepth ).

A test group consists of a set of tests representing all possible combinations of the above types of load. To level the impact of the service processes of the storage system (garbage collection) on the test results between tests, a pause is realized equal to the amount of information recorded during the test divided by the performance of the service storage systems (determined by the results of the first group of tests).
')
Based on the test results, the following graphs are generated for each combination of the following load types based on the data output by the fio software after each of the tests: load profile, method of processing I / O operations, queue depth, which combine tests with different I / O block values :
  • IOPS as a function of the number of load generating processes;
  • Bandwidth as a function of the number of processes that generate the load;
  • Latitude (clat) as a function of the number of load generating processes;

The analysis of the obtained results is carried out, conclusions are drawn about the load characteristics of the disk array at latency less than or about 1ms, about the maximum performance indicators of the array about the performance of the array under single-threaded load. It also determines the optimal block size for working with an array, as a unit in which it is possible to perform the maximum number of I / O operations, while transferring the maximum amount of data.

Group 3: Disk array performance tests under different types of load generated by two servers, executed at the block device level;

To perform tests of this group, one more server is added to the stand configuration. The disk array is divided into 16 LUNs of the same size, totally occupying the entire storage volume. Each server is presented 8 LUN. Tests are conducted similarly to tests of group 2, with the exception of the fact that the load is generated simultaneously by two servers. Estimated total performance obtained by both servers during each test. At the end of the tests, the conclusion is made about the impact of the number of servers generating the load on the storage performance.

Test results


Group 1: Tests that implement a continuous load of random write type with a change in the size of the block I / O operations (I / O).

Conclusions:
1. With a long load on the recording at a certain point in time, a significant degradation of storage performance is recorded (Figure 3). A drop in performance is expected and is a feature of the SSD (write cliff) operation, associated with the inclusion of garbage collection (GC) processes and the limited performance of these processes. The performance of the disk array, fixed after the write cliff effect (after a drop in performance), can be considered as the maximum average performance of the disk array.
Figure 3. Changing the speed of I / O operations (iops), data transfer speeds (bandwidth) and delays (Latency) during long-term recording by the 4K block

2. The block size with long write load affects the performance of the GC process. So with small blocks (4K) the speed of work of GC is 640 MBytes / s, on medium and large blocks (16K-1M) CG works at a speed of about 1200Mbytes / s.

3. The difference in the values ​​of the maximum storage operation time at peak performance recorded during the first long test and the subsequent equivalent test with the 4K unit is due to the fact that the storage system was not completely filled before the start of testing.

4. The maximum operating time of the storage system at peak performance is significantly different with the 4K block and all other blocks, which is most likely caused by the storage space limit reserved for the execution of GC processes.

5. About 2TB is reserved for the performance of service processes on the storage system.

6. When tests on storage systems are 70% full, performance drops slightly later (about 10%). There are no changes in the speed of the GC processes.

Charts and tables. (All pictures are clickable)
Block device performance graphs for different types of load generated by a single server.
Data transfer rate (bandwidth)I / V Speed ​​(IOPS)Latency
Fully spaced storage (100% formated)

Not completely labeled storage (70% formated)



Table 1 The dependence of storage performance on the block size during long-term recording load.

Group 2: Performance tests of a disk array with different types of load generated by a single server, executed at the block device level.

The main results of the tests presented in the graphs are tabulated.
Tables and graphs. (All pictures are clickable)
Table 2 Storage performance with one load generating process (jobs = 1)

Table 3 Maximum storage performance with delays less than 1ms

Table 4. Maximum storage performance with delays up to 3ms

Table 5 Maximum storage performance with a different load profile.

Block device performance graphs for different types of load generated by two servers.
(All pictures are clickable)
Synchronous way in / inAsynchronous way in / in with a queue depth of 32Asynchronous way in / in with a queue depth of 64
Random reading


With random recording


With mixed load (70% read, 30% write)







Conclusions:


1. Maximum recorded performance parameters for the storage system (from the average during the execution of each test - 3 min):

Record:

Reading:

Mixed load (70/30 rw)

Minimal latency fixed:


2. The storage system enters saturation mode at

3. On read operations with large blocks (16K-1M), a throughput of more than 6 GB / s was obtained, which roughly corresponds to the total throughput of the interfaces used when connecting the server to the storage system. Thus, neither the storage controllers nor flash drives are the bottleneck of the system.

4. An array with asynchronous i / v method produces 1.5–2 times more performance on small blocks (4-8K) than with a synchronous i / v method. On large and medium blocks (16K-1M), the performance with synchronous and asynchronous I / V is approximately equal.

5. The graphs below show the dependence of the maximum obtained performance indicators of the tested storage system (IOPS and data transfer rate) on the block size of I / O operations. The nature of the graphs leads to the following conclusions:

Maximum performance in reading and writing in a synchronous manner in / in various block sizes.

Maximum performance in reading and writing asynchronously in / in (qd32) with a different block size.

Group 3: Disk array performance tests with different types of load generated by two servers, executed at the block device level.

For each of the tests, a performance was obtained that coincided within the error of 5% with the results of tests of group 2, when the load was generated by one server. We did not give graphics and performance data for the tests of group 3, as a result of their identity with the results of the second group.
In other words, the study showed that the server is not the “bottleneck” of the test bench.

findings


In general, the system showed excellent results. We were unable to identify obvious bottlenecks and obvious problems. All results are stable and predictable. Comparing with our previous testing of IBM FlashSystem 820, it is worth noting differences management interfaces. The 820th model is controlled by the sometimes inconvenient java applet inherited from the Texas Instruments RamSan 820. At the same time, the 840th has a web-interface resembling the XIV Storage System and Storwize already familiar to IBM products. Working with them is noticeably nicer and, ultimately, faster.
In addition, IBM FlashSystem 840 acquired the necessary hot-swap functionality for all components and microcode update on the fly for enterprise-class devices. The choice of possible connection interfaces and configurations of flash modules has significantly expanded.

The disadvantages, perhaps, include the presence of performance degradation during long recording. Although, it is rather a disadvantage of today's flash memory technologies, manifested in the fact that the manufacturer did not artificially limit the speed of the system. Even with a long maximum load on the recording and after a drop in performance, the storage system showed remarkable results.


PS The author expresses cordial thanks to Pavel Katasonov, Yuri Rakitin and all other company employees who participated in the preparation of this material.

Source: https://habr.com/ru/post/253785/


All Articles