📜 ⬆️ ⬇️

One day in the life of a hardened admin or a story about how to tame the storage system

Today we will talk about the heroic everyday life of administrators and data storage systems. In this article, we will tell two real stories of the implementation of storage systems and try to share our experience in implementing and operating storage solutions. The names of the participants, of course, are fictitious.

History 1. How to temper the admin


The harsh everyday life of administrator Petit began, and in the evening another batch of equipment arrived along with the storage system, but users already groaned when new storage resources will be issued to them. And now the system administrator, despite the weather and the completed work day, is already running to his data center (or server room, who has it). After all, there is his main goal - storage, about which he had already read a lot on the manufacturer's website, he practically studied the booklets on how it works. After all, he was the one who defended the purchase of this system from his CIO and brought a thousand “for” and a little “against”, and then that very moment came, the happiness was very close.

Long-awaited meeting


By accessing the server, he discovers the long-awaited box with storage. For a moment, and now the system is already shining and glittering, the logo illuminated by an LED strip of general lighting is merrily poured. The admin knows the entire TIA / EIA-569 standard and, of course, has already stipulated that the floor in the server room will withstand this "beast", because the storage system weighs no less and no less than a young forest elephant, try to pick it up. Petya is dreaming how he will see the new unallocated space of his storage in the Web-console of the DSS, but the question is how to connect it to the existing system and how to upgrade?

On stage, a new hero


Petya is calm, service engineer Kohl from the manufacturer of the storage system comes to his rescue. The one who took important courses on storage maintenance (or did not take place, everything happens for the first time sometime). He carries the secret documentation, which says and sometimes shows how to turn on a lot of jumpers and connect the connecting wires to perform this upgrade. Having successfully connected their storage system to electricity, having previously run to the nearest market and having changed the socket from a two-phase switching scheme to a three-phase one, as required by documentation, Petya and Kolya turn their hearts on a new system. And suddenly they notice that the main system goes into Recovery mode, which means a serious system failure. Yuri Petrovich, the head of Petit, is ringing, and when he heard that everything works as planned, it goes into a different dimension, but with a 100% return to its former state tomorrow at 8.00.
')

Chipmunks rush to the rescue


Kohl tells Peter that there are no hopeless situations, because there is round-the-clock support, and he will call now there at night and ask him what to do, as the documentation clearly indicates in large red letters - “CONTACT SUPPORT, FURTHER ACTIONS MAY CAUSE SERIOUS DAMAGE TO YOUR HARDWARE. Having phoned to support only near the security checkpoint, since the reception zone is nowhere to be found, Kolya hears a pleasant voice in English in his phone, which promises to help immediately, provided Kolya sends several megabytes of debug information via e-mail. Having previously recorded the return mailing address of his colleague from sunny India, Nikolai is waiting for the letter to leave the Outbox, and puts all his strength into it so that it will go to the addressee. Petya does not waste time and checks how his systems work, suddenly drives or something else drove off. After receiving the answer, Kolya begins to get a glimpse into the history of his actions and discovers that the upgrade procedure in the documentation politely recommended connecting the wires to the other connectors in the storage system, as well as the message: “WARNING! CABLE PLUGGED INCORRECTLY MAY CAUSE SERIOUS DAMAGE TO YOUR HARDWARE.

“Eureka!” Exclaims Kolya.

“How is that? - retorts Petya and continues. “My storage system is like a Boeing aircraft, because you can also provide a sticker and place it next to the correct connector, so as not to confuse it when turned on.”

The situation soon changes, Kohl completed his work, and the storage system goes into normal mode. All systems work as they should, and at the same time there was not even an overload of the UPS, since Peter had calculated everything in advance when he planned the power consumption of the new equipment.
And the long-awaited moment has come. At 8.00 at the scheduled meeting, Petya reports that the storage system is ready for use.

Working on bugs?


The story we told at the very beginning of the article is taken from the real life situation of one large company (names are fictitious), which acquired the storage system. In fact, there are many such stories, and we have something to tell, if only to show how our compatriots can solve complex problems. After all, often a foreigner would never have an idea how to solve a complex problem without clear and step-by-step instructions for remote technical support.

Let's try to put together all the problems, here are some of them:

  1. The human factor - it turned out that there is only one experienced administrator in the company.
  2. Failure of the engineering infrastructure - the power supply system is not ready, because there was no necessary three-phase connector for connecting the storage system.
  3. Lack of qualifications - the service engineer from the storage vendor does not have sufficient qualifications.
  4. Ineffective technical support - the chain of remote service support turned out to be very difficult during operation.
  5. “Last mile” - the process of assembling the storage system is complicated and also poorly documented, which caused a fatal engineer error that, according to a report from the documentation, could lead to stopping the storage system.

Do we still have no competent service engineers?


Many will want to answer that we do not have such a problem, as any self-respecting storage system buyer will always train their staff and pay for courses from the vendor. Objectively, not all problems can be simply solved by sending an administrator to the courses from the storage system vendor. This happens for a number of reasons:

  1. Training courses from the manufacturer do not teach the commissioning of storage systems.
  2. The objective of the training courses is to inculcate the skill of the “user”, who should not break the system, but must fulfill the simple duties of maintaining the storage system in working condition.
  3. Courses motivate administrators to develop in the direction of this vendor storage system and form a community of fans of this vendor, but do not form an objective point of view on storage systems, i.e. do not form the basic knowledge of the principles of storage and its internal structure.

In general, almost all training courses of storage systems vendors do not set themselves the task of preparing an autonomous flying "Carlson" with a jet engine and a wrench that can come to the rescue at any time and change the broken engine of a flying aircraft.

And what does a storage vendor want?


The vendor has a very specific goal - to form a clear understanding of their product and to offer their own formulation of numerous terms, such as terminology of RAID and many other features, and most importantly - unobtrusively form an opinion about their indispensability.

Indeed, in practice, any Russian system integrator completely depends on the storage vendor company, and this must be said honestly. Even the presence of service centers in our country does not change the situation. The reason is simple - the storage technology of foreign vendors is not developed in Russia, but abroad. Therefore, the real expertise of representatives of global vendors is absent locally in Russia. We simply do not have such engineers who are able to develop software for storage systems, perform diagnostics and repair complex components of storage systems (controllers, disks).

The situation is changing with the advent of small commercial organizations that are beginning to produce their own versions of storage systems, and our example is not the only one in today's practice. We want to tell the community that we can design our own storage systems taking into account the current market conditions in Russia, and this is what we can do.

Lack of qualifications and what do we end up with in practice?


One administrator studied on courses from vendor A, another administrator from vendor B. And they decided to discuss how RAID60 works and how many disks it should be, and they could not agree. And when it came to disk configuration, each protected their system and vendor.

image

Information about the storage device in different vendors is designed so that the consumer can understand the functional purpose of a particular part of the storage system, but could not understand the principles of the operation of this complex system.

Consider one simple practical example.


Using a “long-range” optical SFP transceiver requires the administrator to know which working wavelengths exist and, accordingly, which types of fiber optic cable this transceiver supports. A simple mistake in choosing an optical cable will force you to spend a lot of time looking for the causes of performance problems in the storage system, turning to the vendor’s technical support when the real reason is on the surface.

Thus, in addition to the usual skills of configuring storage systems, basic engineering knowledge is required in the field of data transmission standards and data transfer protocols, which, unfortunately, are not fully taught in the courses of storage vendors. The reason for this phenomenon is trivial - vendors cannot reveal the features of the device components of their storage systems, as part of the characteristics they declared in marketing articles may be unconfirmed.
In our opinion, many of the problems encountered in servicing storage systems can and should be detected and diagnosed automatically.

How can you protect against such problems?


The solution of such problems in our opinion can only be in new knowledge and experience.

That is why we have developed an SDK that allows low-level control over the operation of input-output operations at the level of SCSI commands. Using Broadcom's Fiber Channel adapters, we get all the necessary information about the link level connectivity. Within our SDK, almost all commands of the SCSI SPC-3 standard are implemented. Using our SDK, you can emulate SCSI devices (disk, VTL) and analyze problem areas on a SAN network.

Is there any inquisitiveness of the mind of the “Russian” engineer?


If we consider the organization of RAID in the storage of various vendors, the reason for the disputes, in our opinion, is simply the unwillingness to understand the essence of the issue. Looking at the article “A Case for Redundant Arrays of Inexpensive Disks (RAID)” by engineers David Patterson, Garth A. Gibson, and Randy Katz, who described the principles of RAID and its variants, you can get all the information about the RAID device, but you don’t need to take axiom private engineering solutions for storage vendors. Of course, there are many such private issues and differences among storage vendors. Sometimes basic knowledge of the principles of the functioning of a system helps to delve deeper into the essence of the problem and understand the complex situation.

What is in my name to you, you estimate the storage volume


Vendor companies formulate their principles of RAID functioning, based on their commercial benefits, to take at least a calculation of storage capacity, when one megabyte equals one thousand kilobytes of stored information. It is easy to calculate the losses from such imaginary calculations, when the client pays for the actual gigabytes, but they are systematically underpaid for him.

Our principle is to evaluate different storage systems on the same weights, i.e. based on the result that they allow you to get. Of course, the set of features that consumers evaluate is different for everyone, and it includes: total cost of ownership, cost of upgrade, cost of technical support, etc.

When evaluating the storage system, we recommend using the following approaches:

  1. Use the same tool for load testing storage systems (proprietary, fio, etc.)
  2. Fix the version of the microcode that is installed on the storage system at the time of testing.
  3. Check the microcode version of the discs and record them in the report.
  4. Use tests with real systems and not limited to synthetic tests.

The question arises, which of the listed approaches is manageable by the clients? In practice, the client cannot control any of the listed approaches when choosing a storage system. The situation is aggravated by the fact that in fact Russian companies are hostages of the pricing policy of foreign vendors.

Myths from storage vendors


As a rule, consumers always focus on external marketing characteristics of storage systems. And what if we look inside the product itself? It turns out that there is nothing unusual there, recalling the practice of one three-letter vendor 10 years ago, many storage systems used a conventional Pentium III processor. Analyzing the development of foreign vendors, the hardware storage platform is always as cheap and simple as possible. There is a common myth that a reliable storage solution requires a very complex hardware platform, and it provides high reliability. A number of vendors do design complex digital components for their storage systems, but for other reasons, which are often explained by simple economies. By the roughest estimates, the total cost of "iron" in the storage system does not exceed 10-15% of its actual value for the end user. The client does not pay for the hardware platform, but for the software that makes this hardware work.

Now, many Russian developers are “playing around” with systems based on PCI Express, which have been actively used by foreign companies on the storage system for more than 10 years. The fact is that progress in the field of storage is not determined by the development of complex digital elements, but consists in creating simple and multifunctional schemes, where most of the logic will be on the software side.

The art of one or another vendor in designing storage systems is precisely to create a universal storage system platform (software) that can move to any hardware platform with minimal costs.

Now there is a new trend in the use of virtualization in foreign storage vendors.

The storage virtualization system is usually positioned as a universal access system to any other storage systems, which hides the features of the existing storage system zoo at the client and simultaneously increases the performance of the existing storage systems. Of course, it is also worth noting the "compatibility matrix", which is released by storage vendors.

Both the first storage virtualization idea and the second compatibility matrix idea are completely false. Imagine how a storage vendor has in its hangar all the storage systems on the market and a whole staff of specially trained people who check each driver and each version of the operating system for compatibility by carefully writing the results to the matrix. Given the struggle for financial results and tough competition in the market, many vendors simply cannot support such systems and maintain compatibility matrices. As a result, each client at his own risk and risk applies the next software updates.

Consider a case from the practice of a single client who has acquired a storage virtualization system.

History 2. About how the admin SHD tamed


Admin storage Petya starts setting up his new virtual storage, which recently installed. Deep expertise and self-reliance allow Peter to carefully configure data access for his servers through a new storage virtualization system. By checking the settings on each server, Petya makes sure that all the “paths” to the disks lead to a new storage virtualization system. Now all services are under control, including the most important ones - e-mail and electronic payment processing systems.

It is a time of intense load on the part of users and error messages appear when accessing data in the log of the operating system. Long and painful negotiations with technical support show that everything is set up correctly, and in order to finally solve the performance problem, you need to upgrade the storage system and buy additional volume of solid state SSD. This prospect does not please neither Petyu, nor his chief, Yuri Petrovich, who has long and painfully defended the budget for the purchase of storage systems. "What to do? - Yuri Petrovich thinks. - But I could take the decision from another vendor. It might be more expensive, but it could have been more reliable, and now these problems would not have happened. ”

As a result, this story ends with a gradual migration to the old solution and the rejection of the new virtual storage system. Of course, let's not forget that the depreciation charges continue to go to the expense of the new storage system, and it hangs on the organization’s budget.

Why can compete with foreign storage vendors?


In our opinion, the answer is very simple - foreign vendors use the vendor lock-in strategy, and therefore any architecture of a foreign storage system will have a serious flaw, which means there is a niche for replacing such storage systems. We are aware of all the changes in foreign vendors and understand the device solutions from more than 10 global manufacturers, such as: Hitachi, DELL (EMC), HPE, Netapp, etc.

How can you avoid problems during storage operation and get the necessary experience in storage programming?


We run a school for storage developers. First of all, we invite students of Russian universities for free.

School participants will be able to really learn how to work with the SCSI and NVMe protocol, learn new data protection algorithms and in practice try this knowledge in working with virtualization systems based on VMWare. In the course of laboratory work, we will talk about the methods and principles of load testing, as well as the main criteria for evaluating the performance of storage systems. We will also pay attention to such a problem as data migration and tell you about free and effective ways to migrate data for storage systems. Sign up for our school can anyone here .

In the following articles we will talk about machine learning in the storage system, as well as share information about our new storage system models.

Source: https://habr.com/ru/post/351662/


All Articles