Hello!
Recently there was a task - to speed up the loading of FPGA. From the appearance of power to the working state, we have no more than 100 ms. Since the chip is not the newest (Altera Cyclone IV GX), it is simply impossible to connect a fast EPCQ flash drive to it. And we decided to use FPP (Fast Passive Parallel) mode, having installed CPLD Intel MAXV outside with FPL (Flash Parallel Loader) outside. When starting, CPLD loads data from a flash drive and generates FPP signals at its outputs.
However, before you make your plans, we collected a DIY layout from what was at hand, and began to experiment "on cats". Unfortunately, because of the snot on the board, the operating frequencies had to be reduced, but the essence of the FPP operation did not change, but the debugging was simplified. I decided to write about what happened and how the FPGA is configured in this article. Who cares, welcome under cat.
What I plan to talk about:
There are two types of FPGA boot - active and passive. Active means that the FPGA clocks the external flash and reads the firmware from there. Passive - that there is a certain host (processor, FPGA, controller) that loads the FPGA firmware.
Active is well known to everyone, probably: we connect the EPCS / EPCQ flash drive, upload firmware via JTAG and enjoy life, since all the rest of the work of reading data for us will be done by FPGA when the power is turned on.
Passive is a bit more dreary, because it requires the implementation of additional logic: the processor must load the firmware from somewhere, someone has to “put” this firmware in it, etc. But passive mode is often faster. For example, Cyclone IV supports DCLK 40 MHz and 1 data bit in Active Serial mode and 100 MHz and 8 data bits in FPP (Fast Passive Parallel) mode, which is 20 times faster.
The main signals that are responsible for configuring FPGAs are nCONFIG, nSTATUS, CONF_DONE:
The timing diagram in these modes is not much different. Here is a model for our case - passive parallel :
The following happens in it:
Read more about configuring here .
The choice of configuration mode is made by special FPGA - MSEL legs. In different families, the number of modes is different. For example, in Cyclone V, the compression mode and the firmware encryption mode are selected in addition to Active / Passive. And in Cyclone IV - only Active / Passive and POR Fast / Standard
FPP (Fast Passive Serial) is a passive mode in which the configuration is loaded into the FPGA with words (1 byte or more) and synchronously, according to the clock cycles of the DCLK signal.
The “light version” of this mode is Passive Serial, where the firmware is loaded bit by bit.
For this mode, Dedicated pins are used, some of which (most significant bits of the data) subsequently become normal I / O:
In higher-end chips (Stratix V, for example), you can load the firmware at 2 (FPPx16) and 4 (FPPx32) bytes per clock.
Let's calculate the FPGA boot time in passive mode. The beginning will be considered the appearance of voltage on the chip (when it reaches a certain value), and the end is the appearance of the CONF_DONE signal.
The time enlarges from the following stages:
tPOR depends exclusively on MSEL settings. Naturally, where the load time is fundamentally and the power scheme allows, you need to set the mode to Fast (see the tables with MSEL above).
With tCFG, everything is a bit trickier, because it depends on:
This is what the firmware size documentation for Cyclone IV, Table 8-2 , says:
For example, for the EP4CGX75 chip, the maximum firmware size without compression is 22.010.888 bits = 2.751.361 bytes. If we assume that the external host is ready to operate at a frequency of 100 MHz, then it will take 2.7e6 / 100e6 = 0.027 seconds = 27 milliseconds to load the maximum firmware. And if you use compression, then this time can be reduced by about half, getting less than 15 ms!
However, to obtain a DCLK = 100 MHz frequency, you need to be able to read the firmware at least at the same frequency. One of the options offered by Intel / Altera is to use CPLD MAX II / V for this purpose. On the one hand, CPLD reads Flash, on the other hand, it writes to FPGA.
And for this task in the collection of ready-made IP-cores there is a Flash Parallel Loader . This core supports a set of Flash chips with different interfaces (QSPI, NxQSPI, CFI, etc ...), allowing you not only to read the firmware from them, but also to write it down by connecting via JTAG.
It also allows you to write to Flash not one firmware, but several, thus organizing a "rollback" to a stable image in case of problems with the update.
See datasheet for more details.
We chose the Flash Programming & FPGA Configuration mode and the EPCQ as the Flash. This allows us to meet the required download time and at the same time have the ability to flash Flash via JTAG or flash EPCQ using another host.
Screenshots of the settings we used:
CPLD firmware occupies 1267 LE and is almost 100% occupied by EPM1270F256C5 (MAX II) or 5M1270ZT144C5 (MAX V). Without the flash firmware option (say, if some other host can do this), the resources used are significantly less and a margin of 754 LE appears.
Here is the CPLD inclusion scheme:
How does FPL work (approximate algorithm):
Let's see what kind of performance we can “squeeze” out of the described configuration with one EPCQ. For this you need a little idea how the EPCQ works.
EPCQ (advanced version of EPCS) is NOR-flash, which has a SPI interface for most commands and for some commands - QuadSPI. It is read very quickly, and it is written and erased very slowly.
Commands (erase, read, write) always arrive in flash using normal SPI, and then the behavior of the flash drive depends on the command. For example:
See, for example, a description of flash macronix
Maximum EPCQ read performance is achieved using all four bits at maximum frequency with DDR. However, the temporal characteristics vary nonlinearly:
That is, throughput will be in the cases given in the example above as follows:
There are flash drives and more brightly, for example S25FL064L , in them the frequency does not fall with an increase in the reading capacity. Therefore, for "squeezing maximum speed" it is better to focus on them.
If you recalculate 4READ / 4DTREAD modes during the time required to read our firmware for Cyclone IV, you get 78 ms / 55 ms. Let me remind you that in order to "catch up" with FPP, you need to meet 27 milliseconds (see above).
It turns out that the bottleneck in our task is the interface for reading the firmware, and not the FPP itself. And if it was required to get not 100 ms, but significantly less, then we would have to use two EPCQ flash drives. But for our case, the normal 4READ is enough.
A list of supported EPCQs is provided in the PFL documentation, section 1.2.1 .
It is curious that Intel / Altera not so long ago refused to release their EPCQ and now officially support Micron's instead of their own .
In order to combine theory with practice, we took into our hands what was:
The chips hooked up as follows:
On a blind mount, a QSPI flash drive was soldered and mounted into a layout:
After completing a small quest (see Doc, section 1.4.1 ) to create the firmware that is used for uploading to EPCS via PFL (link), we took the measure of “time”. When creating the firmware, please note that you need to specify CFI-flash as a target, even if EPCQ is used.
Alas, because of the mounted installation and the "beard", the operating frequency of the FPL turned out to be 6.25 MHz, and the EPCS takes 25 MHz. We adjusted this clock using the "Ratio between input clock and DCLK" option by setting the value to 8 (input clock = 50 MHz). Here is the plot of the FPP (blue) and EPCQ (yellow) clots: it is clear that for each FPP clock cycle there are two EPCQ DCLK cycles, because the FPP bit width is twice as high as the EPCQ (8 vs. 4).
Here is a diagram of the supply voltage (yellow) and the CONF_DONE signal (blue) in the Standard POR mode and without compression: you can see that the total load time is 780 ms.
There is about 100 ms between the appearance of power and the nSTATUS signal, signaling the chip is ready to accept configuration data:
And if you enable the Fast mode (we change just MSEL, see the table above), then the POR occurs in a matter of milliseconds and the download already takes ~ 680 ms:
And the last step - turn on compression. To do this, tick Quartus when adding a firmware file to PFL and switch MSEL (you don’t need to touch MSEL in Cyclone IV). And we get a completely different picture:
Compared with the original version without compression and Fast POR, we have a twofold gain of about 400 ms.
Let's calculate whether the measurements made with the oscilloscope agree with the calculations. Time tPOR is considered equal to zero, because use Fast POR. Therefore, we can only calculate how much the load time of ~ 680 ms converges with the size of the firmware and the FPP DCLK frequency.
The size of our Cyclone V firmware is ~ 4 megabytes (32 megabytes). The clock frequency of the FPP DCLK is 6.25 MHz. Per clock is transmitted 8 bits, that is, 1 byte. Therefore, the estimated time is 4 / 6.25 = 0.64 seconds. Woo-la!
If we proportionally increase the FPP DCLK frequency, for example, 8 times - up to 50 MHz, then we will get a time of 80 ms. And if we also include compression, it is even less.
Theoretically, the FPGA can load in tens of milliseconds, according to our calculations for Cyclone IV - in about 30 ms. However, it must be borne in mind that for this, a host that loads data into FPGA, like Flash, from which data is loaded, must match the bandwidth.
In the more recent FPGA chips, the EPCQ flash drive can be connected directly and use active mode, and if the design does not allow switching to a new family, then you can install an external loader and use passive mode. For this, we use the Flash Parallel Loader, which we used in our layout.
With the described approach with FPL and one EPCQ, it is quite realistic to bring the FPGA load time to 50 ms.
Anyone who read to the end can put a monument thank you very much!
Source: https://habr.com/ru/post/358638/
All Articles