ScaleIO Testing Experience

In this publication, I want to share my experience in testing distributed storage, based on EMC ScaleIO 1.32.2.

I decided to try it after reading the article "How to make a fault-tolerant storage system from domestic servers" and "Do not rush to throw out the old servers, you can assemble a fast Ethernet-storage system in an hour" .

At first, I would cast doubt on one image from the second article. According to the documentation, the cluster can consist of only two nodes, and there three are shown (in blue).

')
It was written to discuss the problems that have arisen, since the EMC did not receive a response. Yes, the system was deployed on a test stand, there is no technical support from the manufacturer for the licensing terms. But the search on the world wide web did not bring the desired result.

Actually, the characteristics of the test bench

two nodes with the role of MDM (primary and secondary)
one node with the role of Tie Breaker . It also has a GUI for monitoring and administration.
three nodes with the role of Data Server . On each of them, the storage devices ( device ) were organized as follows: two devices — raw partitions on disks connected via iSCSI protocol. One device was represented by a large file.
Windows 2012 standard acted as an operating system on each node. The amount of RAM 4 GB. Network - 1 GB

The first neponyatka happened after the installation of Meta Data Manager on the first node. So that it could be configured, we had to restart the OS, because when I tried to execute the --add_primary_mdm command immediately after the installation process, a connection error was persistently displayed, although all the necessary ports were in the LISTENING state and all the necessary services were started.

Then the process of attaching the second node and cluster configuration, the installation of the Data Server roles went smoothly.

For each Data Server node, two storage devices were successfully connected as RAW partitions on disks connected via iSCSI and one large file on a local disk.

The connection feature of iSCSI disks was that the sources of these disks were computers on the network that were turned on / off haphazardly, unpredictably, which helped to fully verify such declared fault-tolerant technologies as: Rebuild , Rebalance . In the course of monitoring the system for two weeks, there were no complaints about these aspects of the work. Everything worked out with a bang.

Problems began when trying to increase the number of connected devices on each of the Data Server nodes. I couldn’t find out why new devices were not connected with the --add_sds_device command or through the GUI. All operations ended with the error "Comminication error". And so for each node. In addition, each of the connected devices is available in the OS as a block device, does not oppose formatting, creation of file objects on it, and work with it using the SMB protocol.

However, the most critical error surfaced only after a couple of weeks.

One day I noticed that the cluster is in the degraded status. At night there were problems with electricity and the network was partially not working. Both Data Manager nodes were in Secondary status. At the same time, the Tie breaker node was available on the network from both nodes.

Forced translation of the node in the Primary is not possible, the administrative port is not listening, it is impossible to upload the cluster settings to a file.

That is, all nodes of the Data Server , Data Client are working, sharing information with each other at the network level, the disk partition provided to the client is available, the integrity of the information is not compromised.

But the situation is a dead end: neither change the configuration, nor add new nodes.

I tried to raise a new Primary Data Manger , create a new cluster and connect an existing Secondary node to it. The ghostly hope died without being born - the new cluster was clean (in principle, it was understandable from the very beginning).

Another small drawback is the inability to adjust the size of the GUI to the size of the current monitor resolution - the size of the GUI is fixed and designed for a resolution of at least 1280x1024.

Spent a lot of time talking to Google, nothing adequate could not be found.

I decided to go to the EMC website, and there the online consultant window. I asked to contact someone from technical support and wrote him a letter describing the problems identified.

In the reply letter (in Russian) I was asked clarifying questions. I answered them and promised to answer me after a while. Without waiting for a response within a week, I reminded myself in a letter, but so far I have not received anything in return.

My findings

The result of the testing described in the article on the second link at the beginning of the article states that

Failover tests were successful

I can not agree with that. This is the first software-defined distributed storage I tested. Gradually I will test others. According to the results I will unsubscribe.

Source: https://habr.com/ru/post/273345/

All Articles

ScaleIO Testing Experience

Actually, the characteristics of the test bench

My findings

More articles: