Features of window filtering on FPGA

Hello! In this article we will discuss one important part of digital signal processing - window filtering of signals, in particular on FPGAs. The article will show how to design classic windows of standard length and “long” windows from 64K to 16M + samples. The main development language is VHDL, the element base is modern FPGA Xilinx crystals of the last families: these are Ultrascale, Ultrascale +, 7-series. The article will show the implementation of CORDIC - the base kernel for the configuration of window functions of any duration, as well as the basic window functions. The article describes a design method using high-level C / C ++ languages in Vivado HLS. As usual, at the end of the article you will find a link to the source code of the project.

CDRD: a typical signal flow pattern through DSP nodes for spectrum analysis tasks.

Introduction

Many people know from the “Digital Signal Processing” course that for an infinite time signal of a sinusoidal form, its spectrum is a delta function at the signal frequency. In practice, the spectrum of a real time-limited harmonic signal is equivalent to the ~ sin (x) / x function, and the width of the main lobe depends on the duration of the signal analysis interval T. The time limit is nothing more than multiplying a signal by a rectangular envelope. From the course of DSP, it is known that multiplication of signals in the time domain is a convolution of their spectra in the frequency domain (and vice versa), therefore the spectrum of the limited rectangular envelope of the harmonic signal is equivalent to ~ sinc (x). This is also due to the fact that we cannot integrate the signal on an infinite time interval, and the Fourier transform in discrete form, expressed in a finite sum, is limited in the number of samples. As a rule, the FFT length in modern FPGA digital processing devices takes NFFT values from 8 to several million points. In other words, the spectrum of the input signal is calculated on the interval T , which in many cases is equal to NFFT . By limiting the signal on the interval T , we thereby impose a "window" of a rectangular shape, the duration of T samples. Therefore, the resulting spectrum is the spectrum of a multiplied harmonic signal and a rectangular envelope. In DSP tasks, windows of various shapes have been invented for quite a long time, which, when applied to a signal in the time domain, can improve its spectral characteristics. A large number of various windows is primarily due to one of the main features of any window overlay. This feature is expressed in the relationship between the level of side lobes and the width of the central lobe. Known pattern: the stronger the suppression of side lobes, the wider the main lobe, and vice versa.
')
One of the applications of window functions is the detection of weak signals against the background of stronger ones by suppressing the level of side lobes. The main window functions in DSP tasks are triangular, sinusoidal, Lanczos window, Hannah, Hamming, Blackman, Harris, Blackman-Harris, flat top window, Natall, Gauss, Kaiser window and many others. Most of them are expressed in terms of a finite series by summing harmonic signals with specific weighting factors. Windows of complex shape are calculated by taking an exponent (Gauss window) or a modified Bessel function (Kaiser window), and will not be considered in this article. More information about window functions can be found in the literature, which I traditionally give at the end of the article.

The following figure shows typical window functions and their spectral characteristics, constructed using Matlab CAD tools.

Implementation

At the beginning of the article I inserted the KDPV, which shows in general form a block diagram of multiplying the input data by a window function. Obviously, the easiest way to implement storing a window function in an FPGA is to write it into memory (block RAMB or distributed Distributed - not a big deal), and then cyclically retrieve data at the time of receipt of the input signal samples. As a rule, in modern FPGAs the volumes of internal memory allow you to store window functions of relatively small size, which are then multiplied with the incoming input signals. By small, I mean window functions with a length of up to 64K samples.

But what if the length of the window function is too large? For example, 1M samples. It is easy to calculate that for such a window function, represented in a 32-bit bit grid, you will need NRAMB = 1024 * 1024 * 32/32768 = 1024 cells of the RAMB36K type of FPGA Xilinx crystals. And for 16M counts? 16 thousand memory cells! Not a single modern FPGA has so much memory. For many FPGAs this is too much, but in other cases it is a wasteful use of FPGA resources (and, of course, the customer’s cash).

In this regard, you need to come up with a method for generating samples of window functions directly in the FPGA on the fly, without recording the coefficients from the remote device into the block memory. Fortunately, the basic things have long been invented for us. Using an algorithm such as CORDIC (the “ figure by number ” method), it is possible to design many window functions whose formulas are expressed through harmonic signals (Blackman-Harris, Hanna, Hamming, Nattala, etc.)

CORDIC

CORDIC is a simple and convenient iterative method for calculating the rotation of the coordinate system, which allows you to calculate complex functions by performing primitive addition and shift operations. Using the CORDIC algorithm, you can calculate the harmonic signals sin (x), cos (x), find the phase - atan (x) and atan2 (x, y), hyperbolic trigonometric functions, rotate the vector, extract the number root, etc.

At first, I wanted to take a ready-made CORDIC core and reduce the amount of work, but I have a long-time dislike for Xilinx cores. After studying the repositories on the githaba, I realized that all the kernels represented are not suitable for a number of reasons (poorly documented and unreadable, not universal, made for a specific task or element base, ~~written in verilog~~ , etc.). Then I asked comrade lazifo to do this work for me. Of course, he coped with it, because the implementation of CORDIC is one of the simplest tasks in the field of DSP. But since I am impatient, in parallel with his work, I wrote ~~my own bicycle with~~ my parameterized kernel. The main features are the configurable bitness of the output signals DATA_WIDTH and the input normalized phase PHASE_WIDTH from -1 to 1, the task of the precision of the PRECISION calculations. The CORDIC core is made along a conveyor parallel circuit — at each clock cycle the core is ready to perform calculations and receive input samples. The kernel spends on computing the output sample of N clock cycles, the number of which depends on the bitness of the output samples (the greater the bit depth - the more iterations to calculate the output value). All calculations occur in parallel. Thus, CORDIC is the base kernel for creating window functions.

Window functions

In this article, I implement only those window functions that are expressed through harmonic signals (Hannah, Hamming, Blackman-Harris of a different order, etc.). What is needed for this? In general terms, the formula for constructing a window looks like a series of finite length.

A specific set of coefficients a _k and row members determines the name of the window. The most popular and frequently used is the Blackman-Harris window: of a different order (from 3 to 11). Below is a table of coefficients for Blackman-Harris windows:

In principle, the Blackman-Harris window set is applicable to many spectral analysis problems, and there is no need to try to use complex Gauss or Kaiser windows. The Nattala or flat top windows are just a variation of windows with different weights, but with the same basic principles as Blackman-Harris. It is known that the more members of the series - the stronger the suppression of the side-lobe level (provided a reasonable choice of the window function is used). Based on the task, the developer is simply to choose the type of windows used.

FPGA implementation - traditional approach

All the cores of window functions are designed using the classical approach of describing digital circuits on the FPGA and written in the VHDL language. Below is a list of components made:

bh_win_7term - Blackman-Harris 7th order, a window with maximum suppression of side flakes.
bh_win_5term - Blackman-Harris 5 order, includes a window with a flat top.
bh_win_4term - Blackman-Harris 4 order, includes the window Nattala and Blackman-Nattala.
bh_win_3term - Blackman-Harris 3 orders,
hamming_win - Hamming and Hanna windows.

The source code for the Blackman-Harris window component is 3 orders of magnitude:

entity bh_win_3term is generic ( TD : time:=0.5ns; --! Time delay PHI_WIDTH : integer:=10; --! Signal period = 2^PHI_WIDTH DAT_WIDTH : integer:=16; --! Output data width XSERIES : string:="ULTRA" --! for 6/7 series: "7SERIES"; for ULTRASCALE: "ULTRA"; ); port ( RESET : in std_logic; --! Global reset CLK : in std_logic; --! System clock AA0 : in std_logic_vector(DAT_WIDTH-1 downto 0); -- A0 AA1 : in std_logic_vector(DAT_WIDTH-1 downto 0); -- A1 AA2 : in std_logic_vector(DAT_WIDTH-1 downto 0); -- A2 ENABLE : in std_logic; --! Clock enable DT_WIN : out std_logic_vector(DAT_WIDTH-1 downto 0); --! Output DT_VLD : out std_logic --! Output data valid ); end bh_win_3term;

In some cases, I used the UNISIM library to embed the DSP48E1 and DSP48E2 nodes in the project, which ultimately allows us to increase the computation speed by pipelining inside these blocks, but as practice has shown, it is faster and easier to give up laziness and write something like P = A * B + C and specify the following directives in the code:

 attribute USE_DSP of <signal_name>: signal is "YES";

This works fine and hard for the synthesizer sets the type of element on which the mathematical function is implemented.

Vivado hls

In addition, I implemented all the kernels using the Vivado HLS tools. I will list the main advantages of Vivado HLS - a high speed of design ( time-to-market ) in high-level C or C ++ languages, fast modeling of the developed nodes due to the lack of the concept of clock frequency, flexible configuration of solutions (in terms of resources and performance) by introducing pragmas and directives in the project, as well as a low entry threshold for developers in high-level languages. The main disadvantage is the non-optimal cost of FPGA resources in comparison with the classical approach. Also, it is not possible to achieve the speeds of work that are provided by the classic old RTL methods (VHDL, Verilog, SV). Well, the biggest drawback is dancing with a tambourine, but this is characteristic of all Xilinx CAD systems. (Note: in the Vivado HLS debugger and in the real C ++ model, different results are often obtained, since the Vivado HLS works crookedly when using the advantages of arbitrary precision ).

The following picture shows the log of the synthesized CORDIC core in Vivado HLS. It is quite informative and displays a lot of useful information: the number of resources used, the user interface of the kernel, the cycles and their properties, the delay for calculations, the interval for calculating the output value (important when designing sequential and parallel circuits):

You can also see the way to calculate data in various components (functions). It can be seen that the phase data is read at the zero cycle, and at 7 and 8 cycles the result of the operation of the CORDIC node is displayed.

The result of Vivado HLS: a synthesized RTL core created from C-code. The log shows that, according to temporary analysis, the kernel successfully passes all restrictions:

Another big plus of Vivado HLS is that to check the obtained result, she herself makes a testbench synthesized RTL code based on the model that was used to check the C code. This may be a primitive check, but I think that this is very cool and quite convenient enough to compare the work of the algorithm in C and HDL. Below is a screenshot from Vivado, showing the simulation of the kernel model of a window function obtained by means of Vivado HLS:

Thus, for all window functions, similar results were obtained, regardless of the design method - on VHDL or on C ++. However, in the first case, greater work frequency and fewer resources are used, and in the second case, the maximum design speed is achieved. Both approaches have the right to life.

I specifically calculated how much time I would spend on developing with different methods. I implemented the C ++ project in Vivado HLS about 12 times faster than on VHDL.

Comparison of approaches

Compare source codes in HDL and C ++ for the core CORDIC. The algorithm, as mentioned earlier, is based on the operations of addition, subtraction and shift. On VHDL, it looks like this: there are three data vectors - one is responsible for the angle rotation, and the other two determine the length of the vector along the X and Y axes, which is equivalent to sin and cos (see the picture from the wiki):

By iteratively calculating the value of Z, the values of X and Y are computed in parallel. The process of cyclically searching for output values on HDL:

 constant ROM_LUT : rom_array := ( x"400000000000", x"25C80A3B3BE6", x"13F670B6BDC7", x"0A2223A83BBB", x"05161A861CB1", x"028BAFC2B209", x"0145EC3CB850", x"00A2F8AA23A9", x"00517CA68DA2", x"0028BE5D7661", x"00145F300123", x"000A2F982950", x"000517CC19C0", x"00028BE60D83", x"000145F306D6", x"0000A2F9836D", x"0000517CC1B7", x"000028BE60DC", x"0000145F306E", x"00000A2F9837", x"00000517CC1B", x"0000028BE60E", x"00000145F307", x"000000A2F983", x"000000517CC2", x"00000028BE61", x"000000145F30", x"0000000A2F98", x"0000000517CC", x"000000028BE6", x"0000000145F3", x"00000000A2FA", x"00000000517D", x"0000000028BE", x"00000000145F", x"000000000A30", x"000000000518", x"00000000028C", x"000000000146", x"0000000000A3", x"000000000051", x"000000000029", x"000000000014", x"00000000000A", x"000000000005", x"000000000003", x"000000000001", x"000000000000" ); pr_crd: process(clk, reset) begin if (reset = '1') then ---- Reset sine / cosine / angle vector ---- sigX <= (others => (others => '0')); sigY <= (others => (others => '0')); sigZ <= (others => (others => '0')); elsif rising_edge(clk) then sigX(0) <= init_x; sigY(0) <= init_y; sigZ(0) <= init_z; ---- calculate sine & cosine ---- lpXY: for ii in 0 to DATA_WIDTH-2 loop if (sigZ(ii)(sigZ(ii)'left) = '1') then sigX(ii+1) <= sigX(ii) + sigY(ii)(DATA_WIDTH+PRECISION-1 downto ii); sigY(ii+1) <= sigY(ii) - sigX(ii)(DATA_WIDTH+PRECISION-1 downto ii); else sigX(ii+1) <= sigX(ii) - sigY(ii)(DATA_WIDTH+PRECISION-1 downto ii); sigY(ii+1) <= sigY(ii) + sigX(ii)(DATA_WIDTH+PRECISION-1 downto ii); end if; end loop; ---- calculate phase ---- lpZ: for ii in 0 to DATA_WIDTH-2 loop if (sigZ(ii)(sigZ(ii)'left) = '1') then sigZ(ii+1) <= sigZ(ii) + ROM_TABLE(ii); else sigZ(ii+1) <= sigZ(ii) - ROM_TABLE(ii); end if; end loop; end if; end process;

In C ++, the Vivado HLS code looks almost the same, but the record is several times shorter:

 // Unrolled loop // int k; stg: for (k = 0; k < NWIDTH; k++) { #pragma HLS UNROLL if (z[k] < 0) { x[k+1] = x[k] + (y[k] >> k); y[k+1] = y[k] - (x[k] >> k); z[k+1] = z[k] + lut_angle[k]; } else { x[k+1] = x[k] - (y[k] >> k); y[k+1] = y[k] + (x[k] >> k); z[k+1] = z[k] - lut_angle[k]; } }

Apparently, the same cycle with shift and additions is used. However, by default, all loops in Vivado HLS are “minimized” and executed sequentially, as planned for the C ++ language. Introducing the HLS UNROLL or HLS PIPELINE pragma converts sequential calculations to parallel ones. This leads to an increase in FPGA consumed resources, however, it allows you to calculate and submit new values to the core at each clock cycle.

The results of the project synthesis on VHDL and C ++ are presented in the figure below. As can be seen, logically, the discrepancy is two times in favor of the traditional approach. For the remaining FPGA resources the difference is not significant. I did not really go into the optimization of the project in C ++, but definitely by setting different directives or partially changing the code, the amount of resources used can be reduced. In both cases, the timings converged for a given core frequency of ~ 350 MHz.

Features of the implementation

Since the calculations are performed in a fixed point format, the window functions have a number of features that must be considered when designing DSP systems on a FPGA. For example, the greater the data width of the window function - the better the overlay accuracy of the window. On the other hand, if the window function is insufficient for the window function, distortions will be introduced into the resulting waveform, which will affect the quality of the spectral characteristics. For example, a window function must have at least 20 bits when multiplied by a signal of 2 ^ 20 = 1M samples.

Conclusion

This article shows one of the ways to design window functions without using external memory or block FPGA memory. A method for enabling only the FPGA logical resources (and in some cases DSP blocks) is given. Using the CORDIC algorithm, it is possible to obtain window functions of any bit depth (within reasonable limits), of any length and order, and therefore - to have a set of practically any spectral characteristics of the window.

In one of the works, I managed to get a stable operating Blackman-Harris window function 5 and 7 on 1M samples at a frequency of ~ 375 MHz, and also to make a generator of turning coefficients for a CORDIC-based FFT at a frequency of ~ 400 MHz. FPGA crystal used: Kintex Ultrascale + (xcku11p-ffva1156-2-e).

Link to github project here . The project contains a mathematical model in Matlab, source codes of window functions and CORDIC on VHDL, as well as models of listed window functions in C ++ for Vivado HLS.

Useful articles

I also advise a very popular book on DSP - E. Aificher, Jervis B. Digital Signal Processing. Practical approach

Thanks for attention!

Source: https://habr.com/ru/post/427361/

All Articles