
Often, unpleasant things such as an increase in the number of errors on ports and an increase in signal attenuation on sfp modules occur on storage networks. Taking into account the high level of reliability of the SAN infrastructure consisting of two or more factories, the likelihood of an emergency is not so great, but the overlap of negative factors can lead to data loss or performance degradation. For example, imagine a situation: FOS is updated at one of the factories, everything works through a second factory, and between the switchboard to which the disk array is connected and the switchboard to which the servers are connected, the CRC error at one of the trunk ports starts to grow rapidly. Or even worse, the link disappears due to a decrease in the signal level caused by an increase in the temperature of the SFP module, which in turn has increased due to the increased utilization of this channel. In such cases, they usually say: "Well, who knew" or "100% reliable systems do not exist" and so on.
Literate architecture + proper monitoring = fault tolerance
So the problem is indicated, it is necessary to develop a set of measures to improve the resiliency of the storage network, it can be divided into two stages:
- Bringing the storage network architecture to “SAN best practices”
- monitoring system deployment
If there is a lot of literature and training courses about the SAN best practices, and you can invite tough experts from the integrator to conduct an examination, then choosing the right way to create a good SAN monitoring system is not so easy. This can be explained by a tight binding: software developer - manufacturer of switches. Of course, I don’t want to say that Cisco Fabric Manager or Brocade Network Advisor are bad, but they don’t allow me to do everything that is necessary in my opinion to increase the resiliency of the SAN network.
')
What to do
And so, the task has been set, it is necessary to find a solution, often it can be complicated by the lack of money in the budget for this year, or the integrator’s lack of information about the existence of suitable software, but this is not a problem because All the necessary components are freely available and all that is required is to make it all work.
Let us analyze the implementation of CRC error monitoring on brocade switch SAN ports, most of the other parameters can be monitored in the same way.
Step one, data collection protocol
Information about the number of CRC errors can be obtained from the switches in different ways (snmp, https, telnet and ssh). My choice fell on the latter. telnet is not secure and it is better to disable it, https is difficult to extract specific values, and the snmp tree can change significantly both on different switches and when switching to the new FOS.
Step two, data collection method
To work with ssh, linux is best adapted in conjunction with bash + expect; this method can be used to connect via ssh with interactive command entry.
Step three, where to store
There is no big difference, you can store it even in text files, but we will look at an example with mysql. All monitoring is implemented in two scripts:
porterrshow.sh - collecting information and searching for the increment of CRC error values
expect.tcl - ssh connection
and three txt files:
temp.txt - data buffer
switches.txt - a list of san switches in the format login name password on each line
crc.txt - report on CRC errors found
The Select query looks for an increment in the growth of CRC errors compared to the data received one hour ago, respectively, the script must be run once per hour, and the script must begin and finish its work at the same hour. This limitation can be easily circumvented by entering the sequence number field of the script launch, or losing performance and setting a more complicated condition for sampling time values. The expect, mysql and ssh client packages must be installed on the server. The dbname database must contain a user user with read and write permissions to the table tablename. In the table tablename we get data similar to the output of the porterrshow command on the switch + date and time.
porterrshow.sh
expect.tcl