📜 ⬆️ ⬇️

About InfiniBand: how we reduced ping from 7 μs to 2.4 μs (and test results)


InfiniBand switch SX6005. 12 FDR 56Gb / s ports per unit, switching 1.3Tb / s.

Many people believe that InfiniBand is “space”. That is, it is believed that it is expensive and necessary only for “supercomputers” (HPC) with a capacity of 1-2 Petaflops and with huge volumes of processed data. However, using this technology, you can organize not only the fastest interconnects in clusters, but also drastically reduce delays in critical applications. Specifically - to do something that can be solved using Ethernet , but more economical and faster. Here is an example.

Task


One of our major customers from the financial sector had a problem with the speed of the two applications. The specifics of the application was that it was necessary to process a large number of transactions with minimal delay. 6-7 μs latency - this is the best results that they achieved by upgrading servers and maximum software refinement. Further possible optimizations promised improvements of 0.3-0.5 µs. We came and said that we could reduce the delays in half.
')

Decision


We suggested making communication between servers using InfiniBand. Of course, the customer’s specialists had heard about this technology, but they needed specific tests on the site to check everything personally. To do this, we prepared a set of demo equipment, and they allocated several servers for it. "Battle" server did not want to touch for obvious reasons. So, we built a fragment of a live network using IB connections at the customer’s site.

To solve this problem, we took the equipment manufactured by Mellanox. The main criteria for choosing this particular manufacturer was that the Mellanox switches have “universal ports”, which are programmatically defined either as an InfiniBand port or as regular Ethernet — which will then allow seamless integration into the customer's existing Ethernet network. In addition, Mellanox produces a full range of equipment (including switches, network cards for any servers, interface cables, and so on), which in the future will make it possible to assemble the entire solution regardless of the type and manufacturer of the servers available to the customer.

Customers installed their applications on servers and rolled tests (it seems, using real tasks of almost the previous financial day) and testing began.

Test results


First Test Architecture:

IB equipment: ConnectX-3 HCA, SX6025 FDR switch, HP Proliant DL380 G7 server, RHEL OS 6.4 MRG 2.2.

Testing steps:

Run IB Read latency test:
On the 1st node: ib_read_lat -a
On the 2nd node: ib_read_lat –a <node1_ip>

Run IB Write latency test:
On the 1st node: ib_write_lat -a -R
On the 2nd node: ib_write_lat –a -R <node1_ip>

Run the IB Send latency test:
On the 1st node: ib_send_lat -a -R
On the 2nd node: ib_send_lat –a -R <node1_ip>


results
ib_read_lat -
#bytes #iterations t_min [usec] t_max [usec] t_typical [usec]
2 1000 2.34 9.90 2.36
4 1000 2.34 95.75 2.37
8 1000 2.34 14.15 2.37
16 1000 2.35 14.27 2.37
32 1000 2.37 12.02 2.40
64 1000 2.38 15.85 2.42
128 1000 2.49 14.03 2.52
256 1000 2.67 11.69 2.69
512 1000 2.98 15.09 3.02
1024 1000 3.62 14.01 3.66
2048 1000 4.90 94.37 4.95
4096 1000 6.09 16.45 6.13
8192 1000 8.42 14.42 8.47
16384 1000 13.10 20.04 13.15
32768 1000 22.44 26.07 22.98
65536 1000 41.66 53.00 41.72
131072 1000 79.52 82.96 79.65
262144 1000 155.42 160.17 155.51
524288 1000 307.13 372.69 307.26
1048576 1000 610.54 619.63 610.89
2097152 1000 1217.37 1305.74 1217.84
4194304 1000 2431.34 2466.40 2431.94
8388608 1000 4859.15 4928.79 4860.07

ib_write_lat -
#bytes #iterations t_min [usec] t_max [usec] t_typical [usec]
2 1000 1.26 6.29 1.28
4 1000 1.26 7.44 1.28
8 1000 1.27 5.87 1.28
16 1000 1.27 47.73 1.29
32 1000 1.34 5.79 1.35
64 1000 1.34 5.25 1.36
128 1000 1.48 5.36 1.50
256 1000 2.22 7.44 2.26
512 1000 2.94 47.86 2.98
1024 1000 3.58 7.95 3.63
2048 1000 4.88 8.22 4.91
4096 1000 6.06 9.99 6.09
8192 1000 8.39 11.21 8.43
16384 1000 13.07 15.25 13.48
32768 1000 22.82 27.43 22.89
65536 1000 41.95 45.60 42.04
131072 1000 79.88 85.01 79.93
262144 1000 155.75 160.06 155.84
524288 1000 307.50 332.07 307.65
1048576 1000 610.99 628.83 611.27
2097152 1000 1218.10 1227.02 1218.48
4194304 1000 2432.72 2475.44 2433.46
8388608 1000 4989.11 5025.70 4991.06

ib_send_lat -
#bytes #iterations t_min [usec] t_max [usec] t_typical [usec]
2 1000 1.32 5.74 1.34
4 1000 1.32 5.18 1.34
8 1000 1.32 5.33 1.34
16 1000 1.33 5.40 1.35
32 1000 1.35 5.79 1.37
64 1000 1.40 5.43 1.42
128 1000 1.53 5.52 1.55
256 1000 2.28 5.60 2.31
512 1000 2.92 7.45 2.95
1024 1000 3.56 7.79 3.59
2048 1000 4.85 8.94 4.88
4096 1000 6.03 13.98 6.07
8192 1000 8.36 16.11 8.40
16384 1000 13.02 20.84 13.09
32768 1000 22.39 30.22 23.21
65536 1000 41.93 66.03 42.02
131072 1000 79.84 92.94 79.92
262144 1000 155.72 164.96 155.81
524288 1000 307.49 321.99 307.68
1048576 1000 610.97 626.82 611.27
2097152 1000 1218.05 1241.91 1218.48
4194304 1000 2432.68 2473.29 2433.54
8388608 1000 4968.03 4994.76 4991.17


The following test (IPoIB performance):
1. Run the Sockperf UDP test:
On the 1st node: sockperf sr -i
2- : sockperf pp -t 10 –i

2. Sockperf TCP:
1- : sockperf sr -i --tcp
2- : sockperf pp -t 10 –i --tcp

sockperf sr -i
2- : sockperf pp -t 10 –i

2. Sockperf TCP:
1- : sockperf sr -i --tcp
2- : sockperf pp -t 10 –i --tcp


Second test results
Test Sockperf UDP:
For a packet of 16 bytes: avg-lat = 9.133 (std-dev = 1.226)
For a 64 byte packet: avg-lat = 9.298 (std-dev = 1.268)
For a packet of 256 bytes: avg-lat = 9.605 (std-dev = 1.031)
For a packet of 1024 bytes: avg-lat = 10.791 (std-dev = 1.066)
For a packet of 4096 bytes: avg-lat = 17.107 (std-dev = 1.548)
For packet 16384 bytes: avg-lat = 34.512 (std-dev = 2.098)
For a packet of 65506 bytes: avg-lat = 96.502 (std-dev = 3.181)

Test Sockperf TCP:

For a packet of 16 bytes: avg-lat = 10.509 (std-dev = 1.185)
For a 64 byte packet: avg-lat = 10.506 (std-dev = 1.154)
For a packet of 256 bytes: avg-lat = 11.552 (std-dev = 1.128)
For a packet of 1024 bytes: avg-lat = 12.409 (std-dev = 1.168)
For a packet of 4096 bytes: avg-lat = 18.991 (std-dev = 1.506)
For packet 16384 bytes: avg-lat = 32.937 (std-dev = 1.952)
For a packet of 65506 bytes: avg-lat = 76.926 (std-dev = 3.066)


The final result is 2.36 - 2.37 μs. The best results are about 2 µs. It was possible to achieve more than the customer required.

And even after the tests, I saw the very feeling of deep moral satisfaction experienced by people who know that now their system will work more efficiently. Without replacing servers.

Ethernet and InfiniBand


I must say: one thing is not a replacement for another. It's like comparing a passenger car with a tank: one is convenient and fast, and on the second one you can go to Europe. It is important that from IB you should not expect a large number of services and the possibility of building a complex routing scheme - this is an absolute lot of Ethernet. The IB architecture is “flat”.

In this case, as a result of simplicity - InfiniBand is very easy to upgrade and scale. Our practice of deploying local networks after installing the equipment is to go through and plug in the contacts, and then sit for an hour at the terminal, and from this hour, 40 minutes will go to check the ready-made solution. In the case of Ethernet rebuilding, architecture changes are required where there are so many steps — and because of the flexibility and some kind of vindictiveness of the technology in relation to those who do not know all the subtleties, the procedure is similar to signing a contract with a bunch of items in small print. And there are some chances that complex architecture will bring surprises in the future.

Rebuilding an Ethernet network with parameters like IB will take 3-4 times longer on average. The speed of solving problems of restoring the IB network is a maximum of 4 hours at any time of the day. This indicator can be easily registered in the SLA - we have never been faced with the fact that even in difficult cases, time at least somehow crushed. Of course, you can register the same time for Ethernet - but then an entire team of system administrators must go to the site.

Well, the main difference is speed.
Ethernet, like FibreChannel, is a "limited" technology. Of course, you can provide 100Gb bandwidth on Ethernet by aggregating an unlimited number of 10Gb links. But! First is the limit. Secondly - what will be the cost of this solution? And if you add the cost of operation? We still need to understand how much space, electricity and the cost of heat removal in the data center will require a 100 Gb core-level switch, but they need more than one. At IB - 100 Gb / s is not the limit, and the cost of this solution (both CAPEX and OPEX) will be several times less, since there is less space, less loss, fewer cables, less electricity and air conditioning costs, less man hours. on operation and so on. In general, IB is sick simply faster and cheaper for tasks of this class.

Need a practical example?
You are welcome. Our cloudy data center at some point demanded high speeds. At the same time, the architecture should be very flexible, quickly and painlessly modernized, reconfigured for specific individual client requirements (this is one of the CRIC’s cloud trumps). Having calculated the rate of change, we certainly chose IB.

Manufacturers


There used to be four main manufacturers - Qlogic, Mellanox, Voltaire and Topspin. Mellanox bought Voltaire about 3 years ago. Topspin became part of Cisco in 2005, and Intel acquired Qlogic in 2012 (now the equipment is mainly OEP). As a result, there is now a profile player - Mellanox, providing the most complete range. Its OEM is HP, IBM, Violin and so on.

Perspectives


Among the “space-based” high-speed solutions, InfiniBand turns out to be very economical both in implementation costs and in service. Only 2-3 years ago such solutions were needed only for very serious tasks with industry significance, but now there are more and more enterprise-class projects. Everything is very simple: the amount of data increases, the load from applications grows, server performance grows and, as a result, other data exchange speeds between servers and application speeds are required.

Customers say this: “We are not losing anything, we are not changing the technology, we are not throwing out Ethernet, but we want to complement our InfiniBand infrastructure.”

Why come to us with such questions?


Many customers are very pleased that, firstly, you can look at the equipment live (we are constantly working on our projects and in our data centers), do tests on our sites (as I gave the example above), introduce a pilot system and then everything to introduce. We also have people who can optimize software for specific large-scale tasks taking into account all the features of networks - this is the next round of work with highload. And we try to tell as much as possible in a professional manner about all the possible technological options for solving the problem, without imposing anything on anyone.

UPD. Judging by the comments, it is not entirely clear why such a decision was chosen. Summarizing: the customer had to solve the problem of reducing delays from 7 μs to 3 μs on specific applications. It was possible to solve this by installing low-latency Ethernet equipment, which the customer was able to calculate himself. At the same time, the technical specialists of the customer decided to compare IB with their version.

We conducted tests at the customer’s site, and its specialists confirmed the following:
  • IB fully solves the problem and even exceeds customer expectations.
  • The solution is very easy to deploy on combat servers with continuous services.
  • The solution is convenient to operate.
  • The cost of implementation and maintenance costs are much lower than for Ethernet solutions.

In this case, the customer used IB-compatible modules, which determined the choice of vendor.

PS If you are interested in this topic, then on September 12 I will conduct a webinar, connect . If you need to immediately calculate something or understand whether it is possible to conduct tests directly on your site, write to AFrolov@croc.ru.

Source: https://habr.com/ru/post/191306/


All Articles