Abstract: On the new trend - software defined strorge and the main birth injury of block devices - the promise of infinite reliability.
Lyrics
On the horizon, a new buzzword:
Software defined $ thing . We already have an established and formed circle of everything related to software defined networks (SDN), the turn has come and storage (SDS). Apparently, further we will have software defined computing or something else like that, then HP / VMWare will suddenly start up and catch up and offer (private) “software defined enterprise”, which will also mean everything that was, but more fashionable and relevant.
However, the story is not about buzzwords. Behind each such strange name (grid, elastic, cloud) is the further development of technologies - the construction of further layers of interaction of components (uh ... interactions of interaction participants, otherwise you can’t tell), the main motive of which is the departure from the granularity of a computer system, so that all terminology the entire subject area has gone from the "interprocess interaction" and became autonomous. In a more or less decent form, we (in the form of an accomplished fact) see the work of www in the
magical world of javascript , when we are not in any way concerned about the servers on which tasks are running - all communication takes place at the level between the browser (taking into account its intimate details DOM, JS, etc.) and abstraction, called URI, which does not matter - one is a server or hundreds of different ones.
')
This interaction looks very tempting, so it is spread to all other areas as much as possible.
Before the story about SDS, let's look at what has already happened: SDN (software defined network).
In SDN, all network equipment (real hardware or virtual switches on virtualization hosts) is used as stupid implementers, and all intellectual work on building a real network is delegated to an application that “understands” what is needed and makes the network topology as needed. I omit all the names of specific technologies (openflow, big switch, floodlight, Nicra), as the main idea in SDN is to create a network configuration using software, not implementation details.
So, what is then Software Defined Storage (SDS)? By analogy, we can say that this is a data storage system in which all intellectual work on building a data storage system is delegated to the program, and hardware and “local software” (host level) work as stupid executors.
Probably the most successful and exemplary solution here is Openstack's Swift, which creates a stable and scalable storage of blobs with blunt disks and xfs on them, from which nothing is needed - only capacity and a little performance. Everything else does the software.
But swift is not quite “storage”, it is “object storage”. That is, file storage. Without the ability to write in the middle of the file, and certainly not providing tens of thousands of IOPS'ov to record with microsecond delays.
And the public is eager for this. Reliable, cheap, with arbitrary and guaranteed redundancy, fault tolerance, high availability, geo replication, auto ballanced, self-healing, from the hardware of iron (that is, cheap again), high-performance, with unlimited scaling of performance and capacity as the number of nodes grows, muti tentant, accountable (here the client could not stand the excitement and started falling on the carpet, shoving his legs). All this, and yes even a spoon.
In the reality
The analogy of SDN-SDS has one small nuance that makes everything difficult. In SDN, the network equipment (that which was dull and just obeyed the command center) required one thing - to shift the Baitics. In SDS, stupid storage devices are required not only to take baitics and bring them to / from the client, but also to store them.
In this place lies the biggest, complex and unpleasant problem. We can take and throw out sdyhayuschy switch. We can even do it programmatically. No one will notice anything.
But we cannot just take and throw out the working “stupid” storage. Before another repository can continue to work, someone must go and copy his data to him.
Yes, yes, it's all about storage. If we had write-only burial sites for information, their implementation would be trivial. Can't write here? Raise another node, start writing there.
But you see, we would also have to read what was written down from the dead node: And the node died. Oh?
Thus, the SDS model completely coincides with SDN from the point of view of the IO process. But storage is a completely new, separate problem. Which is called
CAP-theorem . And the solution is not visible there.
What is the problem? If the task cannot be solved, then the conditions of the problem must be changed.
And here the most interesting thing begins - if the tops cannot, and the lower classes do not want - this is the beginning of the revolution, right? A task change is a change of the model used to work with block devices. The whole mess around SDS is, after all, about the file system on the block device, on which you can put a SQL database and work with it very, very quickly, reliably, cheaply, consistently (again the client went into a happy hysterical .. .).
Good TCP and evil file system
If someone gives you a network in which 1 out of 10,000 packets will be lost, you will assume that you have an ideal network. All network applications, without exception, are ready for packet loss, and problems begin to appear when losses rise to tens of percent.
Good-to-good TCP forgives almost everything - repetition, loss, jittering (abrupt change in latency), change in bandwidth, data corruption inside the packet ... If it becomes really bad, then TCP starts working slowly and sluggishly. But work! Moreover, even if the working conditions become unbearable even for TCP (for example, 70-80% packet loss), most network applications are ready for the situation of a network connection failure, and it simply reconnects, without far-reaching consequences.
Compare this to block devices. What happens if you sell a disk device that loses 1 out of 1,000,000 requests? Evil file system will not forgive this. What will happen if you improve the quality 100 times, and you will break 1 out of 100000000 requests? The file system will not forgive it. And not just will not forgive, but revenge in the most terrible way. If the file system detects that 1 out of a trillion write requests failed, then it will refuse to work with such a shameful block device. At best, it will go into read only mode, at worst, it will simply stop working.
And what will happen to the program, in which the file system has thrown out such a thing? Nobody knows. Maybe it just ends. Or maybe it will start to work badly. Or hang. If there was a paging file on this block device, then some operating systems will panic. Especially if there were some important data (for example, a piece of the file buffer for reading from the cat program - and the entire server with all its thousands of clients goes to flash with three LEDs on the keyboard).
What, for example, will the database do if we change only one of a billion blocks as a result of an error? (one 4k sector on a 4TB disk). First, she won't notice. Secondly, if she notices (she does not like something in the read), she will declare the database incomplete, subject to apartheid, circumcision, deprivation of civil rights and declare basa non granta in the system.
In other words, the disk stack is expected to have
infinite reliability .
The whole block stack is merciless to errors. Vendors ask tens and hundreds of millions of rubles for systems that almost never make mistakes. But even their systems make mistakes. Less common than commodity iron. But to whom is this easier if you do not forgive even one mistake per quadrillion operations? (1 bad block on 4 Eb of written / read, 4k blocks).
Of course, the solution to this will be to increase reliability. Raids, cluster systems, mainframes ... Somewhere we have already seen it. It turns out not expensive, but prohibitively expensive. If laptops were made using mainframe technology, they would break down a thousand times less and cost a million times more.
Someone is whispering something about raids. Well, let's see what the raids are doing. The raid takes several block devices and assembles a new block device from them. With increased reliability (and perhaps, performance). At the same time, he makes exactly the same requirements for the quality of devices from below - an error - and the disk is declared bad. Forever and ever. Next there is a rebild of varying degrees of culture.
The most advanced proprietary solutions allow drives to sometimes make mistakes and reject them after exceeding a certain threshold.
But at the same time, if some kind of problem happens, any raid error (for example, a timeout on IO) will lead to the same announcement of the entire raid as “bad”. With the same implications for applications using data on the file system on this raid. In other words, a raid is required to make several unreliable devices ... again, infinite reliability (zero probability of failure). Theorver is indignant.
... And the kind, forgiving TCP looks at lost souls with compassion and love.
What to do?
First, we must admit that there are no perfect things. If DNA with billions of years of evolution has failed to protect itself from errors, then hope for a couple of years (decades) of engineering, to put it mildly, is not reasonable. Errors can be. And the main thing that needs to be learned to do with these mistakes is not to make tantrums because of the smallest imperfection.
We got an error back? Trying to repeat, failed to repeat - return higher up the stack. The file system silently goes and puts the metadata in another place if it was not possible to write to it (and does not make a tantrum the size of the entire server). The DBMS, having received a write (read) error to / from the journal (a), does not declare the database obsessed, and does not curse all applications with this database to the seventh generation, but simply retrieves a backup copy, no backup copy, accurately marks the data as damaged, returns an error or flagged damage. An application that works with the database, having received such a thing, does not do anything stupid, but calmly works with what it is, trying to minimize damage and honestly speaking about the amount of damage to someone who works with this data. And each of the levels completely checks the correctness of the data from the underlying level, without relying on the words “yes, I managed to read the pi number from the file, its value is 0x0000000000000000”.
Yes, we have damaged one bank transaction on your card. Yes, we do not know exactly how much money you have been charged. But we have intermediate balances, so you can continue to use the card, and we will either write off the damaged data after the old age, or restore it next week. This is instead of “Unknown Error. Card transaction is not possible, please contact your bank card support service. ”
Eating off a small piece of data should not result in damage to a larger piece of data. In Hebrew mythology, one case was described when, because of the bitten apple, the whole humanity was culled, the whole paradise was broken up, the legs of the snake were stripped and all the legs were behaved as the modern file system behaves when it detects a bitten hard disk. As far as I know, this event is considered a tragic mistake. Don't do that anymore. Bitten the apple - throw out the apple, and no more.
Thus, the main change that SDS should bring is the change in attitude to errors of block devices. So that 1% of disk errors are considered not very good, but a tolerable indicator. And 0.01% - just great service.
Under these conditions, it will be possible to make services without waiting for infinite reliability - reasonable expectations for reasonable money.
Block devices of the future
And how then does the software defined storage of the future look like? If we allow ourselves sometimes to make mistakes, then our task is not to prevent them, but to reduce their number.
For example, we can highly parallelize operations. If 1000 nodes are responsible for storing data, the failure of one or two of them means only 0.1% or 0.2% of read or write errors for us. We do not need to bother with guaranteed synchronous replications. Well, yes, “the node has flown out, thrown out of the service, added a new one”. In principle, this is not a very good situation (because if a couple more then take off, then we will crawl up to a 0.4% loss, which will reduce the quality of data storage). But we can raise the node from the backup. Yes, there will be data that is outdated by the day, and for the part of the data we will lie mercilessly (return not what we wrote down). But the higher level is ready for this, right? And due to the fact that only 2-3% of the data from the node has changed, instead of 0.1% of reading failures (and almost 0% of write failures, because we write to other nodes), we get 0.002% of false data on reading.
0.002% is after all 99.998% reliability. Dream? If you are ready for this - yes.
And the resulting construction turns out to be incredibly simple: a swift-like storage system for blocks spread over a heap of servers and a heap of disks. Without special requirements for mandatory data integrity - if we sometimes give outdated data, then this is just “nonsense in reading,” and if we don’t do it very often, then everyone is happy with everything. We can at any time "lose" the client's request and be sure that he will send it if necessary. We can work not in the revolutionary heroic mode “
SHD would be made of these people: It would not have been safer in the SHD world ”, but in a comfortable mode, when diligence and diligence most of the time completely compensates for rare mistakes.
And where is the SDS?
In all the previous there was not a word about SDS. Where is 'software defined'?
In the scheme described above, “node executioners” will only perform what the software commands them. The software, in turn, will form a description of where and what to read and where to write. In principle, it is all there. Cluster file systems of the previous generation, CEPH, perhaps, slightly overdeveloped to the network level BTRFS, maybe came to the rescue elliptics - it is practically ready. It remains to write a normal multi-tenancy, the conversion from the logical topology of the client view into the “stupid hardware” commands (controller for SDN) - and everything is ready.
Total
The main conclusion: the key problem in the development of block devices at the moment are excessively high (infinite) expectations of reliability and reliability of block devices, as well as the existing bad tradition to inflate block device errors, increasing the size of the damage domain to the problem domain (and sometimes its limits). Rejection of 100% reliability always and everywhere will allow with much less effort (that is, with lower cost) to provide conditions for creating (or even applying existing) solutions for SDN.