How to make friends PCIe with 10-meter copper cables and 100-meter optics

Good day! My name is Anton. In the company YADRO I am engaged in hardware development.

At the end of the review article on the history and development of PCI Express, Alexey mentioned our own adapter for external connection of PCI Express (hereinafter referred to as PCIe for short) devices via a cable. Today I will tell you how we tested and debugged it to work correctly with different types and lengths of cable connections.

The adapter was designed to connect the components of our systems to each other over a PCIe bus through a cable. At the time of development, existing solutions for this did not suit us for a number of reasons - some did not support Gen3 speeds, some used cables that we considered to be unpromising, etc.
')
Next, I will tell you what problems we encountered (and still face) when working with our adapter. But in order to better understand the meaning, first we delve into the theoretical aspects of the interaction of PCIe devices.

I will make a reservation - the description below applies to PCIe in general, but for this article I used the terminology from the PLX documentation for specifics, since our adapter works on the chip of this manufacturer. For other manufacturers, similar methods and entities may be called differently, which does not change the essence.

Tuning

PCIe devices have an equalizer in the receive and transmit paths. Equalizer parameters can and should be changed (tuned) in order to obtain a reliable (BER <10 ^-12 ) connection.

Transmitter tuning

PCIe transmitters have the following equalizer parameters that can be controlled:

De-emphasis or post-cursor
Pre-shoot or pre-cursor
Main or cursor

So that the word "cursor" does not flicker constantly - then I will use the first versions of the name of these parameters.

These parameters determine the relationship between the amplitudes of the signal in the time domain:

The De-emphasis coefficient determines the Vb / Va ratio, the Pre-shoot coefficient is responsible for the Vc / Vb ratio, and the Main is, by and large, the span between the maximum value given the pre-shoot and the minimum taking into account the de-emphasis (span between short top shelf and long bottom) or vice versa.

If you don’t go very deeply into the technical details - De-emphasis amplifies the amplitude of the transmitted bit depending on the value of the previous bit, and Pre-shoot - does the same depending on the value of the next bit. Main determines the overall signal swing.

The values of Main, De-emphasis and Pre-shoot are set by coefficients from 0 to 63. The sum of all coefficients must be equal to 63. That is, if you set Pre-shoot = 6 (3.5 dB), and De-emphasis = 13 (-6 dB ), then only 44 will remain on Main. Thus, the signal energy is redistributed between the HF (bit switching) and LF (several 1 or 0 in a row) components.

Looking ahead, I will say: for a 10-meter cable, in our case, the optimal values are 63 on Main and zero on Pre-shoot and De-emphasis, or 55-57 on Main with a small Pre-shoot value. That is, at such a cable length, the signal fades out so that the receiver is no longer up to the edges - it simply cannot recognize the presence of a signal in the line.

Tuning receiver

When a signal arrives at the receiver, the following tools are used in turn:

ATT (attenuator);
BOOST or CTLE - RF component amplifier;
The DFE is a block that essentially works in a manner similar to the de-emphasis / pre-shoot transmitter stages (disabled by default) and is intended to eliminate intersymbol interference (ISI).

ATT transfer function - almost uniform attenuation of the signal over the entire frequency range:

The CTLE transfer function is a significant gain in the high-frequency region:

ATT and CTLE usually work in antiphase — low ATT values are accompanied by high CTLE values. That is, the receiver first scales the input signal to an acceptable level, and then pumps up the RF component, which undergoes the greatest weakening in the process of following the channel.

If you have to adjust the coefficients with your hands, then you need to keep in mind that a weak attenuation (and as a result - an overly open "eye") can lead to a glut supersaturation in the receiving path. A too large value of BOOST leads to increased RF noise, for example, due to crosstalk. In general - no need to twist these handles to the maximum.

Again, running ahead, in the case of a 10-meter cable, the ATT is calibrated to 0x0F - the maximum value, which means no attenuation. And CTLE in the area of 0x09 - quite a serious RF enhancement. That is, at these distances, the signal fades out so that it just has to turn the knobs almost to the maximum.

PCIe Gen3 Training

PCIe Gen3 is fundamentally different from Gen2 and Gen1 in that there is an iterative phase during the workout, during which the receive and transmit path parameters are adjusted. Very large-scale training process is as follows:

It all starts with Gen1. At this stage there is no tuning, everything starts with standard values and does not change them in the process. In particular, for a Gen1 receiver, this is ATT = 0x09, CTLE = 0x05.
The downstream port informs the Upstream port of the initial values for transmitter adjustment (TX PRESETS) for Gen3 and sets the starting values for itself (usually they are the same).
Upstream port configures its transmitter, and partners go to Gen3 mode. If the connection quality at the same time does not allow getting BER lower than 10 ^-4 - that’s all, nothing works. That is, the starting values of the parameters are often important.
If a link-up has occurred, the Downstream port starts calibrating its receiver and advising the Upstream port the new transmitter settings until it reaches BER 10 ^-12 .
After this, the same procedure occurs, but vice versa - Upstream advises, and DownStream is being calibrated. Well, then the link-up happens.

The mechanism of such information interaction partners connection is called Backchannel Tuning.

Testing and fine-tuning when working with copper cables

At the time of the first inclusion, we, of course, did not carry out any tests (and who conducts them?). We connected two adapters with a copper cable and saw a blinking light bulb. Then, using the lspci command, we checked that the remotely connected PCIe switch is visible in the PCIe system tree, and the parameters of the established connection are as expected, that is, x4 / x8 / x16 (depending on the configuration) at a speed of 8 GT / s.

Then we began to study in more detail the quality of the connection. For half-meter and three-meter cables, BER was zero. Difficulties arose with 10-meter cables, and they had to be thoroughly worked out to solve them.

A link-up on 10 meters occurred right away, but was accompanied by a tremendous rate of increase of errors. Data could be transferred, but very slowly. And no matter how much we played with TX PRESETS, no matter how much the receiver was tuned, nothing helped. We decided to include DFE. And ... nothing happened.

Further research led us to the fact that DFE has two modes of operation - the so-called. EDFE (Edge DFE), which aims to expand the "eye" of the signal along the time axis, and test mode CDFE (Central DFE), the purpose of which is to increase the opening of the "eyes" along the amplitude axis.

EDFE didn't help us.

The CDFE of the PLX family of Capella 1 is turned on by a mysterious mural of unknown registers - very similar to entering some kind of code sequence. The essence of its manufacturer is not disclosed. But he helped us very significantly - BER from 120k x 10 ^-12 dropped to 1.5kx 10 ^-12 . But this is still higher than the standard required by the standard, and we continued to beat our heads against the wall, going over different parameters.

We played with a large number of different parameters, for example, receiver sensitivity, transmitter evaluation time, number of iterations - nothing helped. It must be said that the PLX switch hardware together with specialized software allows obtaining an image of the eye diagram for each line. But for us this function turned out to be useless, since the data capture point used in the construction of the eye diagram is located before the DFE block - that is, we do not see what the DFE does with the signal. And before DFE, we have not just a closed “eye”, but literally 0 in the line (although the link-up still does).

As a result, we have come demoralization. We realized that the aircraft carrier could not be stopped, and decided to look for a button - we took up another problem, which we had temporarily postponed until then. The problem was that, in general, link-up sometimes did not occur on all lines. This indicated that there were some problems with the initial connection at the speed of Gen1 - since it is at this stage that the lines are rejected.

We switched on (again, test) Gen1 calibration mode (it is not provided in standard operation). Gen1 stopped working completely (apparently could not reach an acceptable BER level) - but we saw what parameters the receiver was trying to make a link-up to where it was going, and suddenly realized that they were in a different universe compared to the standard ones. For the sake of interest, we took and interrupted the standard 0x09 / 0x05 to 0x09 / 0x0F (turned off the attenuator for Gen1).

And they got not just a stable link-up Gen1, but also much more pleasant BER values for Gen3 - 0.3 x 10 ^-12 , which already fits into the framework of the standard. How the Gen1 receiving path is connected to Gen3 - the manufacturer does not speak. But so it works.

Work with optical cables

It was necessary to tinker separately with optical cables due to the peculiarities of PCIe, but in comparison with the work of “punching” 10-meter copper cables it was easier.

If we briefly describe the situation, everything is different with optics than for copper.

What is the difference from copper

Receiver Detect
After starting to work, the PCIe transmitters begin to check the line for load. They do this by analyzing the signal reflected from the opposite end of the line. If the line is not terminated by the receiver, the transmitter recognizes this and rejects this line.

The problem with optical receivers is that often their termination does not conform to the PCIe specification, they can be terminated with a different impedance or not at all terminated to earth. Having seen this, the transmitter may decide that it does not have a receiver - and not start the connection establishment procedure. To work correctly with optics, it is recommended to mask the signals from the Receiver Detect unit. In other words, forcing the transmitter to "see" the receiver.

The state of Electrical Idle (hereinafter - IDLE)
Idle periods in a line with no modulation, caused, for example, by a transition to a low consumption state, can lead to an oversaturation of the PIN diode and cause problems when exiting this state. The receiver may mistakenly accept the noise caused by the transient state of the optical receiver as a way out of the IDLE state and tune to an erroneous frequency. This will lead to rejection of this line during the procedure of establishing a connection in the future.

The picture below shows what the PCIe receiver receives when IDLE happens in line with the optics:

Where the "Dead Zone" - should be a constant signal level. And the value of “Settling time” can have a different value on different lines.

Inferred IDLE mode
When using the Inferred IDLE mode, the receiver analyzes the context of the transmitted data and does not rely on signal levels when determining the input to the IDLE state. The use of this mode is preferable when using an optical cable, since without him:

transients in the optical receiver may initiate an erroneous entry into the IDLE state and the subsequent stages of the LTSSM state machine.
the transition to the IDLE state by one partner may not be recognized by the second partner.

In PLX switches, this is the default mode.

Downtrain disable
In the process of establishing a connection, the receiver tries to catch the connection on the highest or lowest line. As soon as he does this, the rest of the lines have a limited time to complete the workout. In view of the reasons described above, setting up different lines may take a lot of different time, so after the connection is established on the at-line, the others may simply not be in time and will be rejected. It is recommended to prevent the receiver from establishing a connection at the channel's width below the specified one, so that he tries to tune all the port lines to the end.

Tuning
Optical devices are non-linear, so relying on the Backchannel Tuning mechanism is impossible, since it relies on linear attenuation of the signal when calculating the calibration coefficients of receivers and transmitters. In some cases, it is better to turn it off, and calibrate the parameters of receivers and transmitters.

In general, there are quite a few parameters that can be controlled to make optics work. To access them on the fly, we put a CPLD chip on the adapter. We began by using all the parameters recommended by the switch manufacturer at once, and nothing happened. Then we discovered that one parameter, namely the turning off the Receiver Detect block, we forgot.

Turned it on, and the optics came to life.

Then all the parameters except Receiver Detect, we returned to the default state. Optics continued to live.

Masking Receiver Detect does not in any way interfere with work on a passive copper cable. Thus, we got a configuration that works on both types of connections.

It must be said that errors sometimes appear in the channel, but the frequency of their occurrence is within the standard. And since we haven’t actually engaged in tuning with respect to optics, that is, a firm belief that these errors can be defeated.

Researchers of this issue strongly and much frighten developers by the fact that due to large production spreads between cables it is rather difficult to assemble an x8 connection from two x4 cables, and even more so x16 from 4 x4 cables. Therefore, having set ourselves the goal of minimizing these risks, we leveled all the tracks on the board at 100 ps. I don’t know if it had any significance, but we didn’t have any problems with the aggregation of four x4 links in x16 on 100-meter optics.

Common problems

The cable connection used by us (both optical and copper) has the following pitfalls:

the reference signal is not transmitted to the line (hereinafter - the clock);
I2C is not transmitted on the line;
Perst #, PRSNT # and other sidebands are not transmitted to the lines.

Shred
In general, PCIe clocks are necessary for normal channel performance. But the switches we use are SRIS-compatible. SRIS - Separate Reference Independent Spread. In simple words, they can not just work on different clocks, but these clocks can be SSC.

Ssc

Spread Spectrum Clocking is the modulation of the frequency of the clock in the range of 0 to -5000 ppm with a frequency of 30–33 kHz. It is used to reduce EMR. The radiation power is smeared in the spectrum, as the frequency of the clock constantly floats a little.

At the same time, normal operation is possible only when two SRIS devices are connected. Operability with direct hostless connection of the host processor to anything else is not guaranteed. Although we observed normal operation with the x86-free cable connection from the processor to the PLX device, provided that the clock used on the host processor does not use SSC. This is called SRNS - Separate Reference No Spread.

Our adapter has its own subsystem of clocks, which allows both clocking everything from the system PCIe clock, as well as working on its own, which in turn can be either SSC or non-SSC.

We didn’t like the work on the system clock as a result of testing, although we cherished the hope of working only on it, and remove all the rest of the schematics related to clocks.

Work on your own cloc both nonSSC and SSC is stable and reliable when using passive copper cables up to 4 meters, as well as with an optical cable. For a 10-meter copper cable, normal channel operation is currently achieved only with nonSSC clocks.

I2C
This is an optional interface, and we generally lose nothing without passing it. It was just necessary to note that we are not submitting it.

Other sideband
The absence of this entails, for example, such problems:

the inability to synchronously reset the entire PCIe tree, as it happens when all PCIe devices are in one box;
problems with the sudden disappearance and the addition of devices on the PCIe bus;

and others like that, which can be called a common word - synchronization.

These problems require a more global approach and will be solved at the stage of system integration. At the moment, we are doing the same by turning on the host system after the target, so that the host system can enumerate PCIe devices normally in its standard boot process.

Also this problem is not relevant when using NTB.

Variants and analogues

Actually, as Alexey mentioned in the last article , we will have an adapter in two versions. The second option will not be equipped with a PCIe switch, but will have a simple redrawer. This is a cheaper option, suitable for those cases when no bifurcation of ports is needed.

I also have to say that there is no DMA controller in the relatively inexpensive PCIe switch chip we use. Therefore, if we connect these two adapters together, the speed will be low and will depend heavily on the operation of the processor, on how and with which data blocks it operates. If you have at least one adapter DMA-controller - the actual speed is usually about 50-60% of the physically available limit - that is, about 8-9 GB / s in one direction. Although, of course, a single data transfer operation can happen at a speed of 16 GB / s. PR users like to write 32 GB / s - this is the physical limit in both sides. But information interaction on the PCIe bus is usually two-way — the target is sent to the Completion Response for each transaction. So working with a performance of 32 GB / s seems to me very doubtful.

Naturally, this choice of chip is quite rational: in the planned usage scenarios, at least one adapter will have a DMA controller. Therefore, overpaying twice for a more advanced PCIe switch chip with built-in DMA support in our case makes no sense.

We do not plan to sell this adapter as a separate product - it is not our field of activity, we will only put it into complete solutions from our own equipment. If you need a similar adapter with a DMA controller, then such solutions appeared on the market. You can look at Dolphin PXH830 (HBA) or PHX832 (Host / Target), although these adapters have slightly different parameters.

The model 830 has the option of partitioning ports into 2 x8 or 1 x16, our 4 x4 mode is not. Model 832 can operate in 4 x4 mode, but only with passive cables up to 5 m. Support for optical cables up to 100 m long is declared for both models.

What's next?

As a result, the adapter earned with copper cables up to 10 meters and with optical cables up to 100 meters, BER in both cases is within the standard limits. We plan to release a second revision with a reduced number of layers, improve the layout, test and play around with fine tuning the parameters for optics for additional polishing of the results.

Thanks for attention.

Source: https://habr.com/ru/post/314364/

All Articles