
I submit to the public the translation of a fairly large article on the use of MySQL on ZFS, as well as comparative testing of ZFS and UFS.
I apologize in advance for probably not the highest quality of translation - I have never studied English. With thanks, I will accept the amendment.')
About ZFS
When Sun Microsystems released the ZFS file system in 2004, it undertook to remove all the practical limitations of file storage and make the mystery of managing file storage a thing of the past. ZFS is a 128-bit file system with — let's call just a few properties — the “copy-on-write” transactional semantics, quick snapshots, and optional compression.
The most significant benefit from using ZFS is the ease of administration. In ZFS, traditional tasks such as formatting a disk or editing / etc / fstab are completely eliminated, and building mirrors or RAID arrays can be accomplished with meaningful commands that are easily remembered and have an interactive prompt. ZFS for storage management does what MySQL does for databases, in addition to being open source as well as MySQL. ZFS was originally released under Solaris, but the port under Linux is in the process, Aplpe puts it in OS X 10.5, and it is included in FreeBSD 7. Read more on
the ZFS Community page at opensolaris.orgZFS administration relies most on two commands: zpool, for managing storage pools, and zfs, for file systems.
A storage pool is a key abstraction: a pool can consist of several physical devices, and can accommodate several file systems. Whenever you add storage to a pool, it becomes accessible to any file system that needs it. In order to use the entire disk as a ZFS storage, you can run the following command:
# zpool create zp1 c2t0d0
Here, zp1 is the name of the pool, and c2t0d0 is a disk device. If naming devices in the Solaris style looks unfamiliar, keep in mind that on Linux and other OS you need to use their native name.
If your disk is already formatted, for example, with UFS on one partition, you can create a pool on another partition.
# zpool create zp1 c2t0d0s2
You can also use the file as a storage:
# zpool create zp1 ~ / storage / myzfile
If you already have a pool, you can build a file system on it:
# zfs create zp1 / data
# zfs create zp1 / logs
Do not think that you will have to wait for the creation of the file system after running this command, it runs instantly. In the future, if there is a shortage of free space, simply add a device to the pool and the file systems will increase.
# zpool add zp1 c3t0d0
ZFS and tablespace
To appreciate what this power means for MySQL, consider the problem of managing storage under InnoDB
innodb_data_file_path = / disk1 / ibdata1: 10G; / disk2 / ibdata2: 10G; / disk3 / ibdata3: 10G: autoextend
In this example, the InnoDB space is divided into three physical disks, possibly due to the fact that each disk has only 10 GB, or perhaps it is an attempt to spread the load. The third disk is marked as autoextend, which means it can be increased if necessary.
Unfortunately, the attempt to balance the load on the three disks in practice does not work. InnoDB must fill the first disk before starting to use the next one and fill the second one before starting the third one. Another problem is the impossibility of online tablespace expansion. If you are experiencing a shortage of disk space on disk number 3 and added a file from another disk, you must restart the server for the changes to take effect. (You will also be forced to perform some tricky actions with autoextend in this example, but these details are beyond the scope of this discussion).
The situation can be corrected if the management of the storage is completely removed from the sphere of influence of the database. It’s enough to use ZFS in innodb_data_file_path.
innodb_data_file_path = / dbzpool / data / ibdatafile: 10G: autoextend
You can break this storage into as many devices as you like, and ZFS will intelligently balance the load. You can combine devices, mirror them, add space when you need to increase it, add backup disks on the fly, and exclude broken disks without restarting the database. You can allow the database to be a database, and — for you from mysql — we can concentrate on keeping the database simple, inexpensive, and easy to use.
In the MySQL Falcon engine, a tablespace can be created and deleted online, but it can contain only one data file. In future versions, they promise support for multiple data files, but ZFS users do not need this feature.
File systems, buffering and previewing
The file system level traditionally contains a small set of properties that can significantly increase performance for most applications. The first is
buffering - an intermediate area in memory, between the physical disk and its application using.
The following is a read-ahead: if an application requests a large block of a file, the file system should read an even larger block, in anticipation that the application will continue reading further.
The database server, for that matter, is not a typical offering. Although the MyISAM database, which relies on the file system buffer for caching records, is therefore not subject to the problem described in this section, many others use their own buffering layer. The performance of the database comprehensively depends on the efficiency of using this cache: every time there is a slip past the cache and reading from the physical disk occurs, the efficiency decreases.
The problem occurs when the data is cached in memory twice - on the database server, and the second time in the file system. Storing two copies of data reduces memory efficiency by half. When the memory is full, it contains only half of the data possible, thus reducing the number of cache hits, increasing the number of reads and decreasing the efficiency. This is what we call the double buffer problem.
The next optimization, the preview, can also be detrimental to the database server. The readings made by the database do not fit into the predictable pattern, therefore, the file system reads data that will never be used. Roch Bourbonnais, an engineer and blogger from Sun, explored this - when the traditional UFS file system "sees access to 2 consecutive pages, it reads a full cluster, which today is usually 1MB in size." The time that a disk head spends reading unwanted data needs to be spent somewhat more productively.
Many MySQL users, especially on Solaris, say that these two problems are solved by simply disabling settings. Access to the file system without buffering and previewing is known to Direct I / O, and it can be enabled in my.cnf by setting innodb_flush_method to O_DIRECT:
innodb_flush_method = O_DIRECT
Before the release of MySQL 5.0.42 and 5.1.18, in April 2007 it was necessary to mount partitions containing data files with the forcedirectio option on Solaris. Fortunately, this is not the case now, and InnoDB can use direct access without forcing all applications to use direct access to the entire partition.
ZFS and I / O scheduler
When InnoDB is used on ZFS, the dual buffer problem undoubtedly arises, and its effect is easily seen on the tests below. As for the preview, although the early releases of ZFS suffered from the preview, the current ones (with OpenSolaris Nevada build 70 and later) are spared from this. Another SUN engineer and blogger, Neelakanth Nadgir, did some ZFS tests with databases in 2006, and reported that the former performance problems were fixed in new builds.
ZFS brings a third interesting performance improvement to the I / O scheduler. The ZFS scheduler works according to the following plan:
• When an I / O request arrives at the kernel, it is placed in a queue and assigned priority and deadline (higher priority is more like deadline)
• Read requests have a higher priority than write requests. Because read requests are actually synchronous (the application is blocked, waiting for a response). The read request goes to the front of the line.
• The scheduler looks at all pending requests and decides which will be executed next. He does this by combining the most urgent requests into a “deadline group” and sends them to disk in the order of the Logical Block Address.
Another blogger from SUN has at least
one test in which the ZFS I / O scheduler
(comment. here my brain broke)
There is no less than one benchmark where I’ll make it.
Deploying MySQL to ZFS: Details
A bunch of ZFS and MySQL looks like a promising combination, but how do they fit together? Often enough, the database server and file system should solve the same set of problems, and each one offers a solution. Two solutions can be well combined with each other, or unsuccessfully overlap each other. Below are a few important points.
Set ZFS Recordsize to Block Size
ZFS attempts to heuristically determine the block size for each file based on the file size and the type of its use. Small files are managed in tiny chunks, and huge files in large chunks. ZFS developers know that such a scheme should be optimal in the general case, but it may not be completely suitable for databases, and therefore they allowed the user to set a fixed record size when the file system is used by the database. Some of the reasoning later in this article is valid only if the database page size is equal to the size of the ZFS record.
InnoDB works with the data file in 16KB pages until you recompile it with a different page size. Falcon uses the default page size of 4KB, but allows the installation of a my.cnf. In both cases, it is important to set ZFS recordsize to equal the page size. This must be done before any files in the section are created; the command should look like this:
# zfs set recordsize = 16K zp1 / data
InnoDB double entry buffer and transactional store
Some transactional systems (not InnoDB) write the entire data pages to the transaction log. InnoDB, however, only logs the page number and the modified record, as this is a critical moment when the user waits for the transaction to complete, InnoDB sends less data to the disk and this improves performance.
If those, others, databases crash, then restart and perform recovery using a log, they can pull entire pages of data from the log and write them to the data file.
With InnoDB recovery is not so simple. Only a small part of the information is logged, therefore the data page must be correct on the disk. And InnoDB tries to guarantee this using a special area of ​​the data file, called a doublewrite buffer, which contains a second copy of the last pages written. After a crash, there are two copies of the changing data - one in the normal file, the second in the double-entry buffer, at least one of these copies must be intact. The double entry buffer is not needed on ZFS, and by disabling it, you can get a small performance boost.
From what, first of all, it protects the double-entry buffer when partially writing: In the case where the database sent a 16KB page to disk, but only 4KB or 8KB was actually recorded before the power went out. In our case, the transaction does not come from the database, but from the file system: ZFS ensures that partial data writing will never happen.
To get a complete picture of partial recording, imagine a separate disk block, for example, block 82, which was overwritten in the traditional Unix file system at the time when the power was interrupted. When the system has rebooted, the beginning of block 82 contains partially recorded new data, the next piece contains garbage, and the ending block contains old data that will never be overwritten. If it happens that the block contains an InnoDB page, then at the end of the page there is a checksum that, when checked, will indicate a failure in the page contents.
In ZFS, before a power outage, when the database overwrites block 82, something completely different happens. Block 82 will never be overwritten, in fact, instead, the new block will be entirely written somewhere else. Accordingly, if the power is lost during the recording process, block 82 will remain intact. If everything is OK with the power, the data will be saved somewhere else and the new block will be included in the file (at least in the current version of the file) instead of the old one (there may be a snapshot of the previous version of the file, in which case the file will remain old block. People close to databases may recall competitive access with multi-versioning, but at the file system level, this property has existed for many years in FreeBSD, NetApp, and Veritas filesystems).
MySQL performance comparison ZFS vs. UFS on Open Solaris
I did two separate tests to compare UFS and ZFS on Solaris. The machine used for x86 desktop tests, 2 GB RAM, under OpenSolaris Nevada build 74 and MySQL 5.0.50. The first test is a single-threaded “utility” designed to display common operations, such as bulk insert and “ALTER TABLE”. This test showed no noticeable differences between the two file systems, although some individual operations are 5% slower on ZFS than on UFS.
The following test was more interesting. This test included 6 client streams accessing a 3.2GB database, 85% patterned SELECT queries, separated by INSERT and UPDATE queries. The 3.2 GB database is too large to fit in memory, i.e. The test shows I / O limits, as well as the test simulates the behavior of most real-world applications: a separate set of records (about 5% of all) are selected much more frequently than others.
This test is especially sensitive to the behavior of the cache, each test included some warm-up time to fill the cache, and tests cannot legitimately be compared if the warm-up time is different. To reset the warm-up time to zero, sometimes you need to remount the file system, while the cache buffers are reset.
ZFS vs. UFS after 5 minutes of warm-up. Higher numbers mean better performance. 3.2 GB database per server with 2 GB of RAM.InnoDB Buffer Pool Size | UFS Direct I / O | ZFS |
---|
100 MB | 36.84 | 96.21 |
250 MB | 45.70 | 80.69 |
350 MB | 50.68 | 102.75 |
500 MB | 57.32 | 94.99 |
700 MB | 61.92 | 85.95 |
1000 MB | 82.41 | 98.35 |
1200 MB | 95.01 | 88.98 |
1400 MB | 101.32 | 117.83 |
The test with a five-minute warm-up shows that the performance of databases on UFS is steadily increasing with increasing buffer size. ZFS results are unstable and non-linear. If the warm-up time is increased to 10 minutes, the picture becomes clearer.
ZFS vs. UFS after 10 minutes of heating.InnoDB Buffer Pool Size | UFS Direct I / O | ZFS |
---|
100 MB | 35.71 | 179.18 |
250 MB | 44.25 | 154.26 |
350 MB | 50.60 | 110.14 |
500 MB | 57.10 | 93.29 |
700 MB | 66.57 | 101.35 |
1000 MB | 103.47 | 114.18 |
1200 MB | 168.66 | 156.04 |
1400 MB | 325.05 | 290.14 |

Again, the performance on the UFS is steadily increasing with increasing buffer size, and very much to the very end. The last two points at 1200 and 1400 MB mean 60 and 70% of computer memory, which is enough to store half of the data.
When the buffer size is small, ZFS is significantly better than UFS. ZFS cache is very efficient. ZFS on a 100MB buffer is more efficient than UFS even on 1200MB. But if the pool size increases, ZFS performance drops.
This behavior is caused by double buffering. Since some data resides in both caches, less data is stored in memory, and access to the physical disk is often necessary. This effect has the worst effect when the buffer size is set to 500 MB.
After 500MB, two file systems begin to act in more similar ways. The database server needs most of the physical memory, leaving very little for ZFS, as a result, ZFS is minimized to the performance and the problem of double buffering is reduced.
Conclusion
ZFS allows for very simple and flexible administration, without any real loss in performance. In the worst case, ZFS works much like UFS with Direct I / O in these tests. In the case of InnoDB, the performance graph suggests using a new strategy “set the inndb buffer less and allow ZFS to manage the buffering.” I did not test Falcon because it was not yet in beta when I ran the tests, similarly for Falcon on ZFS you can focus on the cache lines and decreasing the page cache. And although the problem of double buffering is clearly visible on the ZFS performance graph, even with this problem, in the worst case, ZFS is less productive than UFS.
(note Per. On the last paragraph, I broke the brain)
It is clear that there has been a need to make sure that I’ve been working in recent years (late 2007). ) release of Solaris
ps: Once again, I apologize for the quality, the text is voluminous and it is not possible to lick it without that already spent on this inadmissible amount of time. The translation was made not so much for publication as for more complete penetration into the topic. I myself seemed to like it, I hope it will be useful to someone else. In the near future I plan to conduct my tests, if the community is interested, I will publish them.