The result of pre-processing of speech signals is to obtain a set of spectral vectors characterizing this signal and are used for further recognition.
The fundamental assumption that is made in modern discriminators is that the speech signal is regarded as stationary (that is, its spectral characteristics are relatively constant) over an interval of several tens of milliseconds. Therefore, the main function of preprocessing is to divide the input speech signal into intervals and to obtain smoothed spectral estimates for each interval.
The typical value of a single interval is 25.6 ms. Neighboring intervals are taken with an offset from the previous interval. The applied overlap interval is 10 ms. As a result of preliminary study of each of these intervals, we obtain a vector of several tens of spectral values.
')
The block diagram of the speech pre-processing algorithm is shown in Fig.1.
The steps that are necessary to perform for the preliminary study of each interval of the speech signal are described in detail below.
As an example, we consider speech samples sampled at a frequency of 16 KHz and with a bit width of 16 bits. The discretized speech signal is divided into intervals of 25.6 ms duration, that is, 409 samples. Intervals overlap with a shift of 10 ms (160 samples).
Fig.1. The block diagram of the algorithm pre-processing of the speech signalFurther stages of pre-processing of speech signals.
- Digitized (sampled in time and quantized by level) speech signal is divided into blocks of 25.6 ms with an offset every 10 ms, that is, blocks of 409 samples of each block, with an offset of 160 samples.
- As a rule, high-frequency amplification is used to compensate for the attenuation caused by scattering from the lips. For this, the signal blocks are passed through a first order filter.
S (1) = 0; S (n) = y (n) -y (n-1), n = 2 ... 409,
where y n is the n-th countdown in the block. - For treatments of this type, a window function is applied to each block.
In this case, the Hamming window is taken according to the expression
D (n) = (0.54-0.46 • cos (2π • (n-1) / 408)) • S (n) for n = 1, ..., 409.
- To obtain spectral estimates using the discrete Fourier transform . In this case, we increase the block length to 512 elements by adding to the right with the necessary number of zeros. After that, we apply the fast Fourier transform with a length of 512 points and we obtain 512 spectral complex values. Since the 512 values to which we apply the Fourier transform are real, the resulting spectral complex values are pairwise conjugate: the second value with 512 m, the third with
511th, etc. Therefore, the last 256 complex values of the transformation are ignored, because they are complexly linked with the previous ones and do not carry new information. - For the first 256 complex spectral values, we find their amplitudes. The Fourier amplitude spectrum is smoothed (averaged) by adding the amplitudes of the spectral coefficients within the “triangular” frequency bands located on a nonlinear (logarithmic-like) Mel scale . For the limiting frequency of a language equal to 16 KHz, 24 such frequency bands are taken.
Mel-scale is introduced to approximate the frequency separation of the human ear, which is linear up to 1000 Hz and logarithmic over 1000 Hz.
The first amplitude coefficient — the constant component of the spectrum — is ignored, and the amplitudes of the remaining 255 spectral values are averaged. Averaging is implemented as 24 triangular band-pass filters. The lower, middle and upper frequencies of such bands are presented in Table 1.
Each triangular filter finds a weighted average of those amplitude spectral values corresponding to frequencies between the lower and upper frequency for a given filter. If the amplitude corresponds exactly to the center frequency of the band, then it is multiplied by a factor of one. When moving the corresponding amplitude value of the frequency from the middle to the lower or upper limit, the coefficient decreases from one to zero.
The resulting amplitude products by coefficients are added and divided by the number of amplitude values. As a result, we find the weighted average for this frequency band.
256 amplitudes correspond to frequencies from 0 Hz to 8000 Hz, i.e. the step of movement is equal 8000/256 = 31,25 Hz. This means that the first amplitude corresponds to the frequency of 0 Hz, the second to 31.25 Hz, the third to 62.5 Hz, etc.
For example, for the first Mel-scale frequency band: the lower frequency is 0 Hz, the average frequency is 74.24 Hz, the upper frequency is 156.4 Hz.
So, the first (0 Hz), the second (31.25 Hz), the third (62.5 Hz), the fourth (93.75 Hz), the fifth (125 Hz) and the sixth (156.25 Hz) fall into the first frequency band amplitudes.
According to Fig.2. the third amplitude corresponds to a coefficient equal to 62.5 / 74.24 ≈ 0.84; and the fifth amplitude - the coefficient is (156.4-125) / (156.4-74.24) ≈ 0.38.
Fig.2.Table 1 Mel-frequency scale
As a result of the described actions, we obtain a 24-element spectral (acoustic) vector.
In conclusion, we perform the normalization of acoustic vectors within a single language sample. To do this, we find the greatest length of the vector and the values of all vectors are multiplied by the reciprocal of this length.
For the simulation of the speech pre-processing algorithm, the MATLAB environment was chosen.
clear all; close all; signal = wavread('example.wav') ; subplot(3,3,1); plot(signal);title('example.wav');
An illustration of stages 1-5 of pre-processing a speech signal is shown in Fig.3.
Fig.3. Speech pre-processing steps
Illustrations of processing steps separately The first illustration shows the language signal of example.wav, Discretized at 16 KHz and 16 bits.
In the second illustration, we have one block (interval) of the specified speech signal with a duration of 25.6 ms. This block corresponds to 409 samples.
In the third illustration we see one speech signal block after processing it with a first-order filter.
The fourth model shows us one block after applying the Hamming window.
The fifth illustration gives us 512 amplitude values of the fast Fourier transform of this single block.
Since these amplitude values of the fast Fourier transform coincide in pairs (for the corresponding complex values of the fast Fourier transform are pairwise complex conjugate), it is possible to take only 256 first amplitude values. These 256 amplitude values are reflected in the sixth illustration.
The seventh illustration gives the value of a 24-element vector, the components of which are obtained after averaging 256 amplitude values within 24 “triangular” frequency bands.