Enterprise-class solid-state drives (SSD) have emerged as new data storage technologies and are already being used in the design of high-performance systems. SSD is much faster than hard drives (HDD), but they are more expensive. Here and now, their role is to participate in hybrid storage environments (SSD and hard drives together), and their cost prevents full transition to their use for most companies. We deal with a multi-level storage environment, where SSDs are the fastest and most expensive devices, and high-capacity hard drives, on the contrary, are the slowest and cheapest. The term "hybrid" and the phrase "multi-level storage" can be considered synonymous.
The key to successful use of solid-state drives in a hybrid storage environment is software that automatically manages the placement of data and is responsible for ensuring that the most frequently used data is placed on fast SSDs.
Teradata Virtual Storage (TVS) was developed by Teradata Labs specifically to manage such a hybrid storage environment. In this article, we will look at TVS from the point of view of its development goals and describe how it works.
This article is based on my report at the conference
Teradata Forum 2012 , which was held in Moscow in late November.
At the heart of the efficiency of hybrid storage systems is the idea that different data is used by consumers at different frequencies. For example, daily reporting and operational analysis requires data for the last day-to-week, while accessing data that is older than three years is not required every day, and only one or two indicators. The characteristic of data through the access frequency can be easily represented as temperature, as it is understood in school physics, but instead of molecules, data blocks “move” here. So, the data that lie dead weight, call cold, and the most popular data - hot.
Management based on data temperature, we understand the ability to manage the priority allocation of system resources based on business rules, better using the capabilities of different data storage systems, taking into account the ever-increasing amounts of information.
The amount of data also matters. If you look at how much data in the system is hot and how much is cold, you get a characteristic that is unique to a
particular system at a
particular point in time. It turns out that the frequency of access to data and the volume of this data do not mutually correlate, and the combination of their average (for a certain period) values together is a characteristic of the system that we are analyzing.
In terms of data volume, one should also take into account that over the past ten years the power of processors and the power of the I / O subsystem grow in different ways. To compare them directly is not the purpose of this article, but the ratio of the number of processor seconds per I / O unit changes annually, and not in favor of the HDD.
HDD-only system configurations have evolved (and evolve) as follows:
- The amount of data per node is growing;
- The granularity of growth is quite large (73 GB - 136 GB - 300 GB ...);
- With the number of disks, the space occupied by the racks in the data center increases;
- CPU performance per terabyte of data - falls.
The constant increase in CPU performance has led to an increase in the number of disks needed to keep the load on processors and disks uniform. More disks - more data. There is a third way - few disks and a lot of data. But if there are a lot of users (read parallel requests) in the system, then a small number of disks becomes a bottleneck. After all, the speed of access to data on the HDD has practically not changed over the past decade, and the speed of random access is a key characteristic for a disk subsystem under multi-session load conditions.
Such growth is quite suitable for those companies and systems that have low CPU requirements per terabyte. The rest is to accept the excess capacity of hard drives (now you can’t find a HDD less than 300 GB, which for Teradata means the minimum amount of 14 TB per node).
The advent of SSD to the market has now changed the situation. Differences SSD are known, it is flash memory with an interface identical to the HDD in physical and electrical parameters. In addition, the enterprise-class has its own characteristics:
- Single-level cell (SLC) technology is used when only one bit of information is stored in a single cell;
- Factory test chip selection;
- Error correction algorithms;
- Embedded controllers with software performance optimization and reliability control.
Teradata spent many years and efforts to prepare for changes in storage devices, and developed a virtualization layer that is located between the Teradata File System logical file system and real devices. The result was called Teradata Virtual Storage.
The key features in the design were selected:
- Automation. Teradata Labs did not plan to create new tasks for DBA.
- Distribution of I / O between devices within a hybrid storage system.
- Performance. We always think about it, whatever that term implies.
Automatic operation, as a fundamental requirement, was to eliminate the need for manual intervention in the process of locating and migrating the hottest data to the fastest devices, and vice versa. The DBA has a lot of other work to keep track of whether the load profile from users is consistent with how the data is located between the disks.
The optimal load distribution is influenced by many factors: the total amount of data in the system, the capacity of each medium, the temperature of the data being accessed, the current level of activity in the system, the rate of change of access patterns. All these indicators change dynamically, and it is meaningless to assume that input-output will come from the fastest media (unless all the data in the system fits on them).
')
The TVS philosophy has been to achieve a balanced distribution of I / O operations: the storage location must correspond to the data temperature. The goal was defined as follows: at least 80% of I / O operations must pass through the fastest devices, and the rest from slower layers. Knowing in advance that the slow layers will also be accessed, you need to carefully consider the ratio of SSD and HDD in the system so that the hard drives are able to withstand these remaining - potential - 20%.
Holding hot data on fast SSDs, most of the HDD load goes onto them. As a result, we obtain a better and denser distributed response time, both from those and from others. As soon as the activity level on the HDD increases, the queue length on the disk increases, the access speed slows down or jumps. This is especially noticeable when a short tactical request that needs a single I / O operation is in the queue after large analytical requests. By transferring hot data to SSDs, we reduce queues on disks and provide a good response time.
At the same time, disks and disk arrays differ in their performance characteristics. Teradata Labs has done a lot of work on profiling
different arrays of
different vendors,
different generations and
different types (SSD / HDD). The resulting database is used by TVS in the configuration process to determine the speed level at which a particular device is working or the location within it.
In particular, its slow and fast zones are defined inside the HDD. In hard disks, the lower Logical Block Addresses (LBA) of each physical disk correspond to its external tracks and the fastest response time. A disk broken into partitions for each partition will have its own estimate of the speed of operation, and hence its position on the scale of the speed of operation of the devices.
The data in the Teradata database is distributed (ideally evenly) between different virtual processors (called AMP). Each AMP stores its part of the data of the user object (table), and only he has access to it. Individual records at the level of the Teradata File Storage file system are grouped into data blocks, those, in turn, into cylinders (a group of data blocks that are continuously sequential on the disk). A data block contains sorted records of a single table. The blocks have a different size, the maximum is 127.5 KB, that is, 255 sectors of 512 bytes each. In a mature system, data blocks have an average of 96 KB. From the point of view of the file system, the table can be thought of as a collection of data blocks that lie on all AMPs.
Teradata Virtual Storage provides three levels of storage speed: fast, medium and slow. When ranking, various factors are taken into account: type of device, rotation speed (for HDD), physical location (internal or external HDD tracks). TVS separates from the space of fast devices a part called soft reserve for critical database objects:
- Spool - Temporary tables containing intermediate and final results when executing queries.
- WAL (Write after Logging) - Contains transaction log entries, as well as WriteAheadLog log entries, which is used to protect against losing changes to blocks of data processed in memory.
- DEPOT - A small data area that is used to protect against the failure of a disk array of those data blocks that are updated “in place” when new data is written directly on top of existing ones.
The speed of working with these objects is critical for the overall performance of the system, so we always keep them on high-speed media.
The boundaries between soft reserve, fast, medium, and slow storage areas are redefined for each platform and release of the DBMS.
In the process of configuration, devices and sections of devices of different speeds are evenly distributed between AMPs, so that everyone gets equivalent performance. During the operation of the system, TVS manages the distribution of data between these devices, that is, an increase in productivity is achieved through the redistribution of data.
Managing a hybrid storage environment manually will be a cumbersome and time-consuming task that requires constant monitoring and administrative action. In the case of TVS, we are dealing with a fully automated service.
As already mentioned, Teradata Virtual Storage distributes data automatically depending on their temperature. The frequency of access to each data block is monitored by the TVS and stored as temperature data aggregated at the cylinder level (i.e. all blocks of one cylinder will always have the same temperature). The temperature here is a relative value within the system.
TVS periodically reduces the temperature settings for all cylinders in the system. At the beginning, when recording, the cylinder temperature is always set to the average over the system, but over time old data is addressed less often than new ones, and this principle is expressed in the process of aging. When activated, the temperature of all cylinders in the system decreases uniformly.
But this is only part of the work of the module, which bears the name Migrator. As you might guess, Migrator uses the temperature value of the cylinder to determine whether to move it to a storage device at a different speed. At each AMP, Migrator maintains an orderly queue of "wrong" placed cylinders. Those that contain hot data, but lie on a slow carrier, are placed at the top of the queue. Cold data, which appear on fast carriers, is the other way round. There will be no immediate benefit from their transfer, and only if the space on the fast media ends - the Migrator begins to save him from cold data.
For example, at the beginning at some point in time, blocks of different temperatures are located on arbitrary carriers. Gradually, with the work of Migrator, the picture changes: hot blocks migrate to fast carriers, cold - to slow, warm ones - to carriers of average productivity.
TVS processes run on each node. The migrator chooses from the queue and performs on average the migration of one cylinder on each AMP every 5 minutes, and no more than two parallel migrations on the node. This continues until the end of the queue. At this rate, approximately 10% of all cylinders employed in the system can migrate in a week. It seems that the load on the system, which in this mode is about 2% for CPU and I / O, is more than an acceptable price for the achieved high-quality distribution of data to devices.
While the cylinder is migrating, writing to it is impossible, and the reverse is also true.
TVS also supports another migration mode called optimize. In this mode, Migrator uses all available system resources for the fastest possible data migration, and the same 10% of data can migrate in about 8 hours. Naturally, we are not talking about any application on the industrial system, but sometimes it is useful.
When the filling of the system with data exceeds 95%, the Migrator will stop its work.
In addition to the two modes of operation described above, there is another mode - asynchronous migration, which runs even less often than regular migration mode - when the free space in the soft reserve, on fast or average speed media, drops to 10% of its capacity. . The purpose of this mode is to free up space on a fast carrier for fresh hot data
cakes . In this mode, the Migrator will work until the 10% threshold is passed or until there are no “incorrectly placed” cylinders, that is, the hot data in this mode does not leave fast carriers, and the warm data is average.
One of the interesting features of TVS control is the so-called initial data temperature. By default, it is set in the system parameters, but using the session property of the query band (this is a text string) and the keyword TVSTemperature, you can change these settings for the data that is loaded by this session.
For example, let's take a look at the configuration options of one of the serious systems supplied by Teradata, Active EDW 6690. Its features with respect to the disks are as follows:
- 15-20 pieces 400 GB SSD;
- amount of hot data: 1.6 - 2.2 TB;
- 60 - 160 pieces of 300 or 600 GB HDD;
- the amount of warm and cold data: 4.9 - 26 TB;
- up to 174 disks per node.
Only 2.5 ”drives are used, the SSD and HDD are located in the same disk array. Moreover, the drives in the system do not have to put everything at once, and then you can add.
When a new system is just beginning to live, it is impossible to predict the distribution of data by temperature. But for systems already operating, Teradata suggests estimating the actual temperature of the data. In the course of such an assessment, the optimal amount of data per node and the recommended share of SSD in it are determined.
For evaluation, a data collection tool is used (the utility is called IOCNT). The utility is installed on several nodes of the existing customer system and collects information on the ratio of hot and cold data. To reduce the measurement error, the utility connects to several AMPs and aggregates information about the actual activity of accessing data for selected periods of time.
As already mentioned, the temperature distribution of the data for each system is different. But we can distinguish a couple of steady cases when the average data temperature in the system is high or low, and suggest options for balanced configurations.
If the average temperature of the data in the system is high, the capacity of the fast media should be comparable to the capacity of the slow media. Part of the HDD in the system is replaced by SSD, and the configuration of such a system contains less than terabytes per node, perhaps less disks, but the available processor power per terabyte of data is large.
Otherwise, when a large proportion of data in the system is rarely used, we supplement capacious HDDs with a small amount of SSDs to speed up I / O for basic operations and critical objects. At the same time, the ratio of processor power per terabyte of data practically corresponds to the base one.
Finally, let me give you an example of a screenshot of the TVS Monitor portlet from the Teradata Viewpoint administration tool. The portlet allows you to track the statistics of data temperature and the degree of congestion of devices of different speeds. The administrator can see how the data is distributed now, assess the current need for data migration, and also study how the distribution has changed in the past.
