Achieving Maximum Performance Fast Fourier Transform Based on Data Management

Now the problem of productivity of applications with digital signal processing is solved using specialized equipment. At the same time, such equipment also requires programs optimized for it.

There are several approaches for fitting the algorithms to the machine architecture, one of which is data management. Anyone who has experienced manual data programming knows that this is not a simple matter. However, in most cases, you can design a pre-compiler that will significantly simplify the task. The article describes the method for constructing two data-driven FFT algorithms and methods for achieving maximum performance that exceeds the theoretical one.

A bit of math

N time samples of any signal can be converted to N phase-frequencies using the Direct Discrete Fourier Transform.

X _k = _{n = 0} ^{N − 1} ∑ x _n * ω _N ^kn , k = 0..N-1, ω _N = e ^{-2πi / N} , i = ² √ -1;
')
In order to obtain a complex phase-frequency spectrum {X}, it is also necessary to convert time samples {x} into a complex view with zero imaginary part. The rotating multiplier ω is a “rotating vector” on the unit circle in the complex plane.
To simplify the calculations, the Fast Fourier Transform is used, where N is a power of 2. In this case, the complex signal X = (x ₀ , x ₁ _... x _n-1 ) can be divided into two A = (x ₀ , x ₂ _... x _{n- 2} ) and B = (x ₁ , x ₃ _... x _n-1 ) already from N / 2 samples.

A _k = _{n = 0} ^{N / 2-1} ∑ x _{2 n} * ω _N ^{k (2n)} ;
B _k = _{n = 0} ^{N / 2-1} ∑ x _{2 n + 1} * ω _N ^{k (2n + 1)} = _{n = 0} ^{N / 2-1} ∑ x _{2 n + 1} * ω _N ^k * ω _N ^{k ( 2n)} = ω _N ^k * _{n = 0} ^{N / 2-1} ∑ x _{2 n + 1} * ω _N ^{k (2n)} ;

Now we can derive the formula we need: X _k = A _k + ω _N ^k * B _k ;

Here, the N phase frequencies of our signal can be obtained by applying the FFT in turn for N / 2 even counts and for N / 2 odd counts. In the next step, A and B again split in half. It is worth noting that this approach is applicable to the Inverse Fourier Transform, which is not considered here.

C ++ FFT

The recursive implementation [1] of the FFT according to the above formulas is simple and is given below:

#include <complex> #include <vector> #include <algorithm> #include <iostream> #include <math.h> #define M_PI 3.14159265358979323846 using namespace std; typedef complex<double> w_type; static vector<w_type> fft(const vector<w_type> &In) { int i = 0, wi = 0; int n = In.size(); vector<w_type> A(n / 2), B(n / 2), Out(n); if (n == 1) { return vector<w_type>(1, In[0]); } i = 0; copy_if( In.begin(), In.end(), A.begin(), [&i] (w_type e) { return !(i++ % 2); } ); copy_if( In.begin(), In.end(), B.begin(), [&i] (w_type e) { return (i++ % 2); } ); vector<w_type> At = fft(A); vector<w_type> Bt = fft(B); transform(At.begin(), At.end(), Bt.begin(), Out.begin(), [&wi, &n] (w_type& a, w_type& b) { return a + b * exp(w_type(0, 2 * M_PI * wi++ / n)); }); transform(At.begin(), At.end(), Bt.begin(), Out.begin() + n / 2, [&wi, &n] (w_type& a, w_type& b) { return a + b * exp(w_type(0, 2 * M_PI * wi++ / n)); }); return Out; } void main(int argc, char* argv[]) { int ln = (int)floor( log(argc - 1.0) / log(2.0) ); vector<w_type> In(1 << ln); std::transform(argv + 1, argv + argc, In.begin(),[&](const char* arg) { return w_type(atof(arg), 0); }); vector<w_type> Out = fft(In); for (vector<w_type>::iterator itr = Out.begin(); itr != Out.end(); itr++) { cout << *itr << endl; } }

The real parts of the complex values of the input vector consist of input parameters, the number of which is truncated to the nearest degree 2-ki. The first level of recursion breaks the signal In from n-values into 2 signals A and B, each of which consists of n / 2 values. The next level splits the data into 4 signals at n / 4 values. Recursion stops when only a signal of one value remains. Signals A and B are converted to At and Bt, and then to the returned vector of signals Out. At the end, the program outputs the real and imaginary parts of the phase-frequency signal from the output vector.

Levels of abstraction

The higher the level of abstraction of the application language, the less effort is required for programming. But the lower the performance. You can write more efficient code, reducing the level of abstraction, but spending more effort. Let us analyze this on the basis of the elementary FFT operation, which can be represented as C = a + w * b in a complex form. This formula is generally implemented through multiplication with accumulation, therefore it has the abbreviation MAC (multiply – accumulate operation):

Prototype	`void w_mac( w_type* cc, w_type a, w_type w, w_type b)`
C ++	`cc = a + b exp(w_type(0, 2 * M_PI * w / n))`
Classic C	`cc->r = ar + wr * br - wi * bi; cc->i = ai + wr * bi + wi * br;`
Optimized C	`register double reg; reg = mac(ar, wr, br); cc->r = mac(reg, -wi, bi); reg = mac(ai, wr, bi ); cc->i = mac(reg, wi * br );`

The logarithm of base 2 of the size of the signal determines the number of recursions. Several styles of the n2ln function are listed below.

Prototype	`int n2ln( int n )`
C ++	`return (int)floor( log(n) / log(2.0) );`
Classic C	`int ln = 0; while (n >>= 1) ln++;`
Built In	`return n/256 ?n/4048 ?n/16348 ?n/32768?15:14 :n/8096?13:12 :n/1024 ?n/2048?11:10 :n/512?9:8 :n/16 ?n/64 ?n/128?7:6 :n/32?5:4 :n/4 ?n/8?3:2 :n/2?1:0;`

In the latest implementation, the worst case execution is equal to the average case, which the C compiler can implement with the help of 3 shifts and 4 conditional jumps. In addition, this implementation can be represented as a macro to calculate the values of the constants at compile time.

FFT on C

The classic implementation of recursive C FFT is similar to the C ++ variant.

 #include <stdlib.h> #include <stdio.h> #include <math.h> #define M_PI 3.14159265358979323846 typedef struct { double r; double i; } w_type; int n2ln( int n ); void w_mac( w_type* cc, w_type a, w_type w, w_type b ); static void fft0( w_type InOut[], int n ) { int i; w_type w, *A, *B; if (n == 1) return; A = malloc( sizeof(w_type) * n / 2 ); B = malloc( sizeof(w_type) * n / 2 ); for (i = 0; i < n / 2; i++) { A[i] = InOut[i * 2]; B[i] = InOut[i * 2 + 1]; } fft0( A, n / 2 ); fft0( B, n / 2 ); for (i = 0; i < n; i++) { wr = cos(2 * M_PI * i / n); wi = sin(2 * M_PI * i / n); w_mac( &InOut[i], A[i % (n / 2)], w, B[i % (n / 2)] ); } free( A ); free( B ); } void main( int argc, char * argv[] ) { int i; int ln = n2ln(argc - 1); w_type* InOut = malloc( (1 << ln) * sizeof(w_type) ); for (i = 0; i < (1 << ln); i++) { InOut[i].r = atof(argv[i+1]); InOut[i].i = 0; } fft0( InOut, 1 << ln ); for(i = 0; i < (1 << ln); i++) { printf("%.4f %.4f\n", InOut[i].r, InOut[i].i); } }

Regarding the C ++ implementation, several algorithmic changes have been made here. First, complex calculations are made manually. Secondly, to save memory, the input and output vector from the C ++ implementation are combined into one buffer. Buffers are allocated explicitly, because the size of the transformation needs to be transferred to a recursive function.

C FFT implementation based on data management

An example of a recursive FFT based on data management, made from a previous implementation in C.

 #include <stdlib.h> #include <stdio.h> #include <math.h> #define M_PI 3.14159265358979323846 #define LN_FFT 4 /* number of samples is 1 << LN_FFT */ typedef struct { double r; double i; } w_type; int n2ln( int n ); void w_mac( w_type* cc, w_type a, w_type w, w_type b ); static struct tMac { w_type *c, *a, *b, w; } Mac[LN_FFT * (1 << LN_FFT)]; static int nMac; static w_type DataTrace[LN_FFT + 1][1 << LN_FFT]; static int BusyDataTrace[LN_FFT + 1]; static void calculate_macs() { int i; for (i = 0; i < nMac; i++) { w_mac(Mac[i].c, *Mac[i].a, Mac[i].w, *Mac[i].b); } } static void record_mac( w_type** cc, w_type* a, w_type w, w_type *b, int n ) { int ln = n2ln(n); int i = BusyDataTrace[ln]++; *cc = &DataTrace[ln][i]; Mac[nMac].c = &DataTrace[ln][i]; Mac[nMac].w = w; Mac[nMac].a = a; Mac[nMac].b = b; nMac++; } static void fft0( w_type* InOut[], int n ) { int i; w_type w, **A, **B; if (n == 1) return; A = malloc( sizeof(w_type*) * n / 2 ); B = malloc( sizeof(w_type*) * n / 2 ); for (i = 0; i < n / 2; i++) { A[i] = InOut[i * 2]; B[i] = InOut[i * 2 + 1]; } fft0( &A[0], n / 2 ); fft0( &B[0], n / 2 ); for (i = 0; i < n; i++) { wr = cos(2 * M_PI * i / n); wi = sin(2 * M_PI * i / n); record_mac( &InOut[i], A[i % (n / 2)], w, B[i % (n / 2)], n ); } free(A); free(B); } void main( int argc, char* argv[] ) { int i; w_type** InOut = malloc( sizeof(w_type*) * (1 << LN_FFT) ); for (i = 0; i < (1 << LN_FFT); i++) { InOut[i] = &DataTrace[0][i]; DataTrace[0][i].r = atof( argv[i % (argc - 1) + 1] ); DataTrace[0][i].i = 0; } fft0( InOut, 1 << LN_FFT ); calculate_macs(); for(i = 0; i < (1 << LN_FFT); i++) { printf("%.4f %.4f\n", DataTrace[LN_FFT][i].r, DataTrace[LN_FFT][i].i); } free(InOut); }

Here, the computation buffer contains not the complex values themselves, but pointers to them. In a Mac array, deferred elementary BNF operations are actually sequentially written in order to be done later. In other words, this is a bytecode for elementary BNF operations.

A two-dimensional DataTrace array is used to support these operations. After calling the recursive procedure, the user must call calculate_macs to execute the generated bytecode. This procedure has only one cycle, but can be applied repeatedly. This is the maximum performance for the theoretical complexity of the FFT n * log2 (n). But the problem here is in the memory - Mac and DataTrace also have n * log2 (n) elements. And that's too much for low-cost embedded solutions.

Tabular implementation

Now it's time to divide the previous FFT implementation into two programs. The first one will generate C bytecode structures, and the second will execute them. With this approach, the C structures generator is in fact a pre-compiler, in which one can implement various optimization strategies, for example, for optimizing RAM memory, without sparing resources. Earlier, the DataTrace is an array of N * log2 (N) elements; in the program below, its analog XY array has a total of N + 2 elements.

 #include <stdlib.h> #include <stdio.h> #define LN_FFT 4 /* power of 2 is number of fft points */ /* */ #define W_0_02 1.000000000000000 /* angle 0.00 dg */ #define W_1_04 0.000000000000000 /* angle 90.00 dg */ #define W_1_08 0.707106781186548 /* angle 45.00 dg */ #define W_1_16 0.923879532511287 /* angle 22.50 dg */ #define W_3_16 0.382683432365090 /* angle 67.50 dg */ typedef struct { double r; double i; } w_type; static const struct fft_tbl { double wr, wi; unsigned char c, a, b; } tbl[] = { { W_0_02,+W_1_04,16, 0, 8}, {-W_0_02,+W_1_04,17, 0, 8}, { W_0_02,+W_1_04, 0, 4,12}, {-W_0_02,+W_1_04, 8, 4,12}, { W_0_02,+W_1_04, 4, 2,10}, {-W_0_02,+W_1_04,12, 2,10}, { W_0_02,+W_1_04, 2, 6,14}, {-W_0_02,+W_1_04,10, 6,14}, { W_0_02,+W_1_04, 6, 1, 9}, {-W_0_02,+W_1_04,14, 1, 9}, { W_0_02,+W_1_04, 1, 5,13}, {-W_0_02,+W_1_04, 9, 5,13}, { W_0_02,+W_1_04, 5, 3,11}, {-W_0_02,+W_1_04,13, 3,11}, { W_0_02,+W_1_04, 3, 7,15}, {-W_0_02,+W_1_04,11, 7,15}, { W_0_02,+W_1_04, 7,16, 0}, {-W_0_02,+W_1_04,15,16, 0}, { W_1_04,+W_0_02,16,17, 8}, {-W_1_04,-W_0_02, 0,17, 8}, { W_0_02,+W_1_04,17, 4, 2}, {-W_0_02,+W_1_04, 8, 4, 2}, { W_1_04,+W_0_02, 4,12,10}, {-W_1_04,-W_0_02, 2,12,10}, { W_0_02,+W_1_04,12, 6, 1}, {-W_0_02,+W_1_04,10, 6, 1}, { W_1_04,+W_0_02, 6,14, 9}, {-W_1_04,-W_0_02, 1,14, 9}, { W_0_02,+W_1_04,14, 5, 3}, {-W_0_02,+W_1_04, 9, 5, 3}, { W_1_04,+W_0_02, 5,13,11}, {-W_1_04,-W_0_02, 3,13,11}, { W_0_02,+W_1_04,13, 7,17}, {-W_0_02,+W_1_04,11, 7,17}, { W_1_08,+W_1_08, 7,16, 4}, {-W_1_08,-W_1_08,17,16, 4}, { W_1_04,+W_0_02,16,15, 8}, {-W_1_04,-W_0_02, 4,15, 8}, {-W_1_08,+W_1_08,15, 0, 2}, { W_1_08,-W_1_08, 8, 0, 2}, { W_0_02,+W_1_04, 0,12,14}, {-W_0_02,+W_1_04, 2,12,14}, { W_1_08,+W_1_08,12, 6, 5}, {-W_1_08,-W_1_08,14, 6, 5}, { W_1_04,+W_0_02, 6,10, 9}, {-W_1_04,-W_0_02, 5,10, 9}, {-W_1_08,+W_1_08,10, 1, 3}, { W_1_08,-W_1_08, 9, 1, 3}, { W_0_02,+W_1_04, 1,13, 0}, {-W_0_02,+W_1_04, 3,13, 0}, { W_1_16,+W_3_16,13, 7,12}, {-W_1_16,-W_3_16, 0, 7,12}, { W_1_08,+W_1_08, 7,16, 6}, {-W_1_08,-W_1_08,12,16, 6}, { W_3_16,+W_1_16,16,15,10}, {-W_3_16,-W_1_16, 6,15,10}, { W_1_04,+W_0_02,15,11, 2}, {-W_1_04,-W_0_02,10,11, 2}, {-W_3_16,+W_1_16,11,17,14}, { W_3_16,-W_1_16, 2,17,14}, {-W_1_08,+W_1_08,17, 4, 5}, { W_1_08,-W_1_08,14, 4, 5}, {-W_1_16,+W_3_16, 4, 8, 9}, { W_1_16,-W_3_16, 5, 8, 9}, }; static const unsigned char OutOrder[]={ 1,13,7,16,15,11,17,4,3,0,12,6,10,2,14,5,}; static struct { double r; double i; } XY[(1 << LN_FFT) + 2]; void fft_table() { int i; register const struct fft_tbl* t; for (i = 0, t = tbl; i < sizeof(tbl) / sizeof(tbl[0]); i++, t++) { XY[t->c].r = XY[t->a].r + t->wr * XY[t->b].r - t->wi * XY[t->b].i; XY[t->c].i = XY[t->a].i + t->wr * XY[t->b].i + t->wi * XY[t->b].r; } } void main(int argc, char* argv[]) { int i; for (i = 0; i < (1 << LN_FFT); i++) { XY[i].r = atof( argv[i % (argc - 1) + 1] ); XY[i].i = 0; } fft_table(); for(i = 0; i < (1 << LN_FFT); i++) { printf("%.4f %.4f\n", XY[OutOrder[i]].r, XY[OutOrder[i]].i); } }

This is an example of FFT for 16 counts. One element of the tbl array contains the complex value of the turning factor and three offsets for organizing calculations on the cells of the XY array. At the same time, the code itself has only one “for” cycle.

Caterpillar method

The following example is based on the grouping of elementary FFT operations with respect to the turning factor.

 #include <stdlib.h> #include <stdio.h> #define LN_FFT 5 /* power of 2 is number of fft points */ /* */ #define W_0_02 1.000000000000000 /* angle 0.00 dg */ #define W_0_04 0.000000000000000 /* angle 90.00 dg */ #define W_0_08 0.707106781186547 /* angle 45.00 dg */ #define W_0_16 0.923879532511287 /* angle 22.50 dg */ #define W_1_16 0.382683432365090 /* angle 67.50 dg */ #define W_0_32 0.980785280403230 /* angle 11.25 dg */ #define W_1_32 0.831469612302545 /* angle 33.75 dg */ #define W_2_32 0.555570233019602 /* angle 56.25 dg */ #define W_3_32 0.195090322016128 /* angle 78.75 dg */ typedef struct { double r; double i; } w_type; #define X2Y(a) (a + (1 << (LN_FFT - 1))) #define XMAC(c, a, wr, wi) \ c->r = a->r + wr * X2Y(a)->r - wi * X2Y(a)->i; \ c->i = a->i + wr * X2Y(a)->i + wi * X2Y(a)->r; static w_type XY[2][(1 << LN_FFT)]; static const w_type* pOut = LN_FFT % 2 ? &XY[1][0] : &XY[0][0]; static const unsigned char OutOrder[]={31,15,23,14,27,13,22,12,29,11,21, 10,26,9,20,8,30,7,19,6,25,5,18,4,28,3,17,2,24,1,16,0,}; void fft_caterpillar() { int i, j, lim; register w_type *pc, *pa; /* pb = a + (1 << (LN_FFT - 1)) */ for (i = 1; i <= LN_FFT; i++) { pc = i % 2 ? &XY[1][0] : &XY[0][0]; pa = i % 2 ? &XY[0][0] : &XY[1][0]; lim = 1 << (LN_FFT - i); for (j = 0; j < lim; j++) { switch (i) { case 5: XMAC(pc, pa, W_0_32, -W_3_32); pc++; pa += 1; XMAC(pc, pa, W_1_32, -W_2_32); pc++; pa += 1; XMAC(pc, pa, W_2_32, -W_1_32); pc++; pa += 1; XMAC(pc, pa, W_3_32, -W_0_32); pc++; pa += 1; XMAC(pc, pa, -W_3_32, -W_0_32); pc++; pa += 1; XMAC(pc, pa, -W_2_32, -W_1_32); pc++; pa += 1; XMAC(pc, pa, -W_1_32, -W_2_32); pc++; pa += 1; XMAC(pc, pa, -W_0_32, -W_3_32); pc++; pa += 1; pa -= 8; XMAC(pc, pa, -W_0_32, +W_3_32); pc++; pa += 1; XMAC(pc, pa, -W_1_32, +W_2_32); pc++; pa += 1; XMAC(pc, pa, -W_2_32, +W_1_32); pc++; pa += 1; XMAC(pc, pa, -W_3_32, +W_0_32); pc++; pa += 1; XMAC(pc, pa, W_3_32, +W_0_32); pc++; pa += 1; XMAC(pc, pa, W_2_32, +W_1_32); pc++; pa += 1; XMAC(pc, pa, W_1_32, +W_2_32); pc++; pa += 1; XMAC(pc, pa, W_0_32, +W_3_32); pc++; pa += 1; case 4: XMAC(pc, pa, W_0_16, -W_1_16); pc++; pa += 1; XMAC(pc, pa, W_1_16, -W_0_16); pc++; pa += 1; XMAC(pc, pa, -W_1_16, -W_0_16); pc++; pa += 1; XMAC(pc, pa, -W_0_16, -W_1_16); pc++; pa += 1; pa -= 4; XMAC(pc, pa, -W_0_16, +W_1_16); pc++; pa += 1; XMAC(pc, pa, -W_1_16, +W_0_16); pc++; pa += 1; XMAC(pc, pa, W_1_16, +W_0_16); pc++; pa += 1; XMAC(pc, pa, W_0_16, +W_1_16); pc++; pa += 1; case 3: XMAC(pc, pa, W_0_08, -W_0_08); pc++; pa += 1; XMAC(pc, pa, -W_0_08, -W_0_08); pc++; pa += 1; pa -= 2; XMAC(pc, pa, -W_0_08, +W_0_08); pc++; pa += 1; XMAC(pc, pa, W_0_08, +W_0_08); pc++; pa += 1; case 2: XMAC(pc, pa, -W_0_04, -W_0_02); pc++; pa += 1; pa -= 1; XMAC(pc, pa, W_0_04, +W_0_02); pc++; pa += 1; case 1: XMAC(pc, pa, -W_0_02, +W_0_04); pc++; pa += 1; pa -= 1; case 0: XMAC(pc, pa, W_0_02, +W_0_04); pc++; pa += 1; } } } } void main(int argc, char* argv[]) { int i; for (i = 0; i < (1 << LN_FFT); i++) { XY[0][i].r = atof( argv[i % (argc - 1) + 1] ); XY[0][i].i = 0; } fft_caterpillar(); for(i = 0; i < (1 << LN_FFT); i++) { printf("%.4f %.4f\n", pOut[OutOrder[i]].r, pOut[OutOrder[i]].i); } }

This is an example of FFT for 32 counts. The number of elementary FFT operations is N. The XY array is organized according to the swap scheme and its size is 2 * N. This makes it possible for conditional XMAC instructions to operate with 3 memory banks simultaneously, although in practice this is ignored by compilers. However, XMAC can theoretically be implemented even by a single machine instruction.

The best-known FFT algorithm uses complex address permutation logic, which graphically resembles butterfly wings. But in this example, the new addresses are obtained by a simple increment from the old ones, so the name is simple - this is the caterpillar method. It is worth noting that the long switch statement also resembles a caterpillar.

Further optimizations

The elementary complex FFT operation (c = a + ω * b) consists of 4 ordinary operations:

 reg = a.real + w.real * b.real; c.real = reg - w.imag * b.imag; reg = a.imag + w.real * b.imag; c.imag = reg + w.imag * b.real;

In the FFT literature [2] for these calculations, several optimization strategies can be found, which are classified below:

Optimizations based on FFT in / out features
- The imaginary parts of the input signal are zero.
- The output phase-frequency signal is duplicated, in fact from it you need N / 2 + 1 values
Optimizations based on a degenerate turning factor
- For ω with values of 0 and 1, the addition operation is sufficient.
- For ω with a value of 0.7071 (45 degree angle), one multiplication operation can be saved.
Optimization based on consumer features FFT
- FFT can take and output data in the required order specified in the pre-compiler settings
- The normalization or adjustment of the input / output window can be embedded in the FFT

Almost all of these FFT optimizations are applicable to the caterpillar method.

Architectural and machine optimizations are also possible. For example, the elementary FFT operation in the code above is implemented as an XMAC macro. It can be parallelized using SIMD instructions for x86 Intel AVX processors:

 #define XMAC(c, a, wr, wi) \ vec1 = _mm_setr_pd(wr, wi); \ vec2 = *( __m128d*)&X2Y(a)->r; \ vec1 = _mm_hadd_pd(_mm_mul_pd(vec1, _mm_mul_pd(vec2, neg)), \ _mm_mul_pd(vec1, _mm_shuffle_pd(vec2, vec2, 1))); \ *( __m128d*)c = _mm_add_pd(*( __m128d*)a, vec1);

The above macro supports two floating point operations at the same time using 128-bit registers. But it is worth looking at the caterpillar again - because of the peculiarities of addressing, the nearest XMACs can be combined together, for example, for implementation using 512 registers (AVX3).

In conclusion, it should be said that the directions of development of the pre-compiler are much more than possibilities. Therefore, the purpose of this article is to collect additional requirements and identify areas where this approach may be useful.

Sources

[1] Notes on Recursive FFT (Fast Fourier Transform) algorithm, Fall 2010, COSC 511, Susan Haynes
[2] Notes on the FFT, CS Burrus, Department of Electrical and Computer Engineering
Rice University, Houston, TX 77251-1892
[3] Caterpillar Implementation Based on Generated Code
[4] Some fft_caterpillar examples, https://github.com/r35382/fft_caterpillar

Source: https://habr.com/ru/post/345650/

All Articles