Public discussion of the GOST project on digitized audio data compression

Dear users!

Continuing the recently started tradition of publishing draft standards developed by our company as part of the activity of the technical committee for standardization TK-234 “Alarm systems and anti-crime protection” , we present to your attention the standard “Security television systems. Compression of digitized audio data. General technical requirements and algorithms evaluation methods.

We will be extremely grateful for constructive criticism of the project, and all valuable comments and suggestions will be made in the next edition of the standard. Standard text under the cut.
')
For a better understanding of the structure of this standard and the general approach, we recommend that you first familiarize yourself with the already accepted standard for compression of digitized video data that we developed in 2011.

NATIONAL STANDARD OF THE RUSSIAN FEDERATION

TV security systems. Compression of digitized audio data.

Classification. General technical requirements and algorithms evaluation methods

Introduction
Active use in the systems of security television (COT) methods of compression of digitized audio data borrowed from multimedia television applications, has led to the impossibility of carrying out investigative measures, as well as operational functions, using most of the existing COT.
An important distinguishing feature of the compression methods for digitized audio data for a COT is the need to ensure high quality sound in the reconstructed audio data. This standard allows you to streamline the existing and developed methods of compression of digitized audio data, intended for use as part of anti-crime protection systems.
As a criterion for the classification of compression algorithms for digitized audio data, this standard establishes the values of quality metrics that characterize the degree of deviation of the original and corresponding restored digitalized audio data .
This standard should be applied in conjunction with GOST R 51558-2008 “Television security tools and systems. Classification. General technical requirements. Test methods.

1 area of use
This standard applies to digital television security systems (hereinafter referred to as COTS) and establishes general technical requirements and methods for evaluating compression algorithms for digitized audio data in a COTS.
This standard applies to compression (decompression) algorithms, regardless of their implementation at the hardware level.
This standard establishes a classification of compression (decompression) algorithms for digitized audio data.
This standard establishes a method for comparing various compression and decompression algorithms for digitized audio data.
This standard is used in conjunction with the standards of GOST R IEC 60065, GOST R 51558, GOST 13699, GOST 15971, GOST R 52633.5-2011

2 Normative references
This standard uses normative references to the following standards:
GOST R 51558-2008 Television security systems and systems. General technical requirements and test methods
GOST R IEC 60065-2009 Audio, video and similar electronic equipment. Safety requirements
GOST 13699-91 Recording and playback of information. Terms and Definitions
GOST 15971-90 Information processing systems. Terms and Definitions
GOST R 52633.5-2011 Information security. Information security techniques. Automatic training of neural network converters biometrics-access code

3 Terms and definitions
In this standard, the terms used in accordance with GOST 15971-90, GOST 13699, GOST R 51558, GOST R 52633.5-2011, GOST R IEC 60065-2009, as well as the following terms with the corresponding definitions are used:
1. audio data (audio data), audio signal (audio signal), mono channel audio signal (monophonic audio): an analog signal that carries information about a change in the amplitude of sound over time.
2. multi-channel audio signal (multi-channel audio): an audio signal consisting of combining a certain number of audio signals (channels) that carry information about the same sound; designed for better sound transmission taking into account spatial orientation.
3. stereo two-channel audio signal (stereophonic audio signal), stereo audio signal (stereo audio signal), two-channel audio signal (stereo audio signal): a multichannel audio signal consisting of two mono channel audio signals.
4. digitized audio data (digitized audio data): data obtained by analog-to-digital conversion of audio data, which is a sequence of bytes in a certain format (WAV or other).
5. Analog-to-digital converter, ADC (Analog-to-digital converter, ADC): a device that converts the input analog audio signal into digitized audio data.
6. Sample rate: sampling frequency of a continuous signal during its analog-to-digital conversion to digitized audio data.
7. digit resolution of ADC (resolution of ADC): the number of bits by which each signal sample is encoded in the ADC process.
8. frame : a fragment of a sound signal with a specified number of values (frame length).
9. Format of digitized audio data (digitized audio data format): a representation of digitized audio data, ensuring their processing by digital computing means.
10. Compression (compression) of digitized audio data (audio compression): processing of digitized audio data, designed to reduce their volume.
11. compressed audio data: data obtained by compressing digitized audio data.
12. Compression of digitized audio data with lossy (audio compression): compression of digitized audio data, in which there is a loss of information, and as a result, the recovered (as a result of decompression) digitized audio data differ from the original digitized audio data.
13. Compression of digitized audio data without loss (lossless audio compression): compression of digitized audio data, in which there is no loss of information, and as a result, the restored (as a result of decompression) digitized audio data does not differ from the original digitized audio data.
14. Decompression of compressed audio data (audio decompression): recovery of digitized data from compressed audio data.
15. decoded audio data: data obtained from compressed audio data after decompression.
16. audio encoder (audio encoder): software, hardware or hardware-software tools with which the digitized audio data is compressed.
17. audio decoder: software, hardware, or hardware and software tools used to decompress compressed audio data.
18. audio codec (software codec): a software, hardware or hardware-software module capable of performing both compression and decompression of audio data.
19. Compression ratio: The reduction ratio of digitized audio data as a result of compression.
20. bit rate: an estimate of the amount of compressed audio data expressed in bits, determined for a certain time interval and related to the duration of the selected time interval in seconds.
21. quality of restored audio data (decoded audio data quality): an objective assessment of the compliance of the restored audio data with the original digitized audio data based on the calculated quality metrics.
22. quality metric: analytically determined parameters characterizing the degree of deviation of the recovered audio data from the original digitized audio data.
23. method of evaluating the compression algorithm (method of evaluating compression algorithm): a method for analytically determining the values of quality metrics for compliance with the requirements for audio compression algorithms.
24. compression algorithm (compression algorithm): a precise set of instructions and rules, describing the sequence of actions according to which the original audio data is converted to compressed, implemented using an audio encoder.
25. decompression algorithm: a precise set of instructions and rules describing the sequence of actions according to which compressed audio data is converted into recovered, implemented using an audio decoder.
26. time-frequency metric: a quality metric based on comparing the spectrograms of digitized and reconstructed audio data.
27. amplitude-time metric (time-amplitude metric): a quality metric based on comparing digitized and reconstructed audio data in a waveform.
28. resampling the audio signal: changing the sampling rate of the audio signal.
29. psychoacoustic model (psychoacoustics model): a model for compressing lossy audio data using the perception of sound by the human ear.
30. psychoacoustic masking (psychoacoustics masking): hiding under certain conditions one sound by another sound due to the peculiarities of the sound perceived by the human ear.
31. masking threshold: the threshold level of a signal that is not distinguishable by humans due to the effect of psychoacoustic masking.
32. noise (noise): a set of aperiodic sounds of varying intensity and frequency, not carrying useful information.
33. signal spectrum (frequency spectrum): the result of the decomposition of the signal into simple sinusoidal functions (harmonics).
34. discrete Fourier transform, DFT (discrete Fourier transform, DFT): a transformation that associates N samples of a discrete signal with N samples of a discrete signal spectrum
35. Fast Fourier transform algorithm (FFT): an algorithm for quickly calculating a discrete Fourier transform.
36. Spectrogram (spectrogram): a characteristic of the power density of a signal in time-frequency space.
37. window (window function): a weighting function that is used to control the effects caused by the presence of side lobes in the spectral estimates (spreading of the spectrum). It is convenient to consider the available final data record or the existing final correlation sequence as some part of the corresponding infinite sequence visible through the applicable window.
38. Hanna window transform (short-time Fourier transform with Hann window): DFT with weight function - Hanna window.
39. artificial neural network (artificial neural network, ANN): a mathematical model, as well as its software or hardware implementations, built in a certain sense in the image of a network of nerve cells of a living organism and used to approximate continuous functions. An artificial neural network consists of an input layer with neurons and an output layer with neurons. Between these layers is one or more intermediate, hidden, layers with neurons.
40. distorted frame : a frame for which the maximum ratio of noise to masking threshold exceeds 1.5 dB.
41. peak signal-to-noise ratio (peak-to-peak signal-to-noise ratio): the ratio between the maximum possible signal value and the noise power.
42. differentiation (from Latin differentia - the difference) - the selection of the quotient from the general population on some grounds.

4 General Specifications
The requirements for the compression of digitized audio data are aimed at assessing the quality of the recovered audio data, which is determined by the quality of each individual sound fragment of the recovered audio data. The size of the sound fragment is determined in seconds, or the number of digitized values within the fragment.
The quality of the sound fragment of the recovered audio data is determined by the values of the quality metrics that characterize the degree of distortion of the recovered audio data as compared to the original digitized audio data. The procedure for calculating metrics is given in Chapter 6 of this document.
According to the quality metrics of the recovered audio data, the compression algorithms for digitized audio data belong to one of three classes (see Chapter 5 of this document).
The belonging of the compression algorithm of digitized data to a specific class is determined by the values of the quality metrics calculated for it and Table 1 given in Chapter 5.

5 Classification of compression algorithms
5.1 To assess the quality of the recovered audio data and classify the compression algorithms, the following quality metrics are used: peak signal-to-noise ratio (PSNR); waveform difference factor; a metric based on an objective assessment of audio data from the point of view of human perception (perceptual evaluation of audio quality, PEAQ).
5.2 Classification of compression algorithms of digitized audio data is based on the values of quality metrics, which reflect those aspects of changes in digitized audio data after their processing by compression and decompression algorithms that can have a critical impact on the ability to use reconstructed audio data to determine the presence of sound signals, differentiation of sounds and speech.
5.3 Depending on the values of the quality metrics calculated during the assessment, the compression algorithms for digitized audio data can be assigned to one of the following classes (see Table 1):

Class I - full-featured compression algorithms that ensure the quality of the recovered audio data is indistinguishable from the quality of the original audio data;
Class II - compression algorithms that ensure the quality of the recovered audio data, sufficient to establish the presence of sound signals, differentiate sounds, speech and is not inferior in this quality to the original audio data, but distinct from the quality of the original audio data;
Class III - compression algorithms that ensure the quality of the recovered audio data, sufficient to establish the presence of sound signals and not inferior to the quality of the original audio data, but interferes with the differentiation of sounds, the understanding of speech.

Table 1 - Classification of compression algorithms
5.4 The values of the quality metrics are determined for each audio fragment (five seconds long) of the digitized audio data, and as the resultant evaluation the following values are chosen: the smallest value for the PSNR and PEAQ metrics; the largest value for the waveform difference coefficient.
To calculate the PSNR metrics and waveform difference factor, the original and recovered digital audio data must be presented at a sampling frequency of 44,100 Hz, 16 bits of memory per discrete sampling value, and one audio channel. The length of the sound fragment in five seconds in this case is 220500 digitized values.
To calculate the PEAQ metric, the original and reconstructed digital audio data must be presented at a sampling rate of 48,000 Hz, 16 memory bits per sampled sampling value, and one or two audio channels. The length of the sound fragment in five seconds in this case is 240,000 digitized values for each channel.
For signals with a frequency other than the required one, you must first perform an oversampling of the audio signal.

Methods for evaluating compression algorithms

6.1 General description of assessment methods
The general scheme of work of the CCO in terms of the use of compression and decompression algorithms is presented in Figure 1.

Figure 1 - The general scheme of the work of the TsSOT

Analog audio data is subjected to analog-to-digital conversion, which results in digitized audio data with a specific sampling rate and number of bits per discrete digitized value. On a computer, digitized audio data must be stored in one of the formats for storing digitized audio data.
Digitized audio data is compressed, resulting in compressed audio data.
Compressed audio data is used to store the archive or to transfer over the network, after which they are subjected to decompression. As a result of the decompression of compressed audio data, recovered audio data is generated, which are used to reproduce the operator and are fed to the input of software modules for analyzing audio data.
In accordance with the general scheme of work of the DSC, the classification of digitized audio data compression algorithms is performed by evaluating the quality metrics of the restored audio data from the original digitized audio data. Depending on the characteristics of the technical implementation of a particular DSS, there are two methods of assessment:
- based on the separation of digitized audio data;
- based on the separation of audio data.
Before evaluating the quality metrics, both audio signals (source and reconstructed) must be converted to signals with a sampling frequency of 44,100 Hz and 48,000 Hz. For both frequencies (44100 Hz and 48000 Hz), the number of bits per discrete digitized value must be equal to 16.

6.1.1 Algorithm estimation method based on digitized audio data separation
To use this method, the technical implementation of the DSP should allow to obtain digitized audio data before they are processed by compression and decompression algorithms.
The general scheme for implementing the assessment method based on the separation of digitized audio data is presented in Figure 2.

Figure 2 - General scheme of the implementation of the evaluation method based on the separation of digitized audio data
The evaluation algorithm is performed by the following sequence of actions:
- to the input of the tested TsSOT serves the sequence of audio data;
- using the capabilities of the CCTV, digitized and restored audio data is stored on storage devices;
- calculate the quality metrics values and classify the compression algorithm according to Table 1.
- calculate the quality metrics values and classify the compression algorithm according to Table 1.

6.1.2 Algorithm estimation method based on the separation of audio data
The evaluation method based on the separation of audio data should be used only in if the technical implementation of the DCSF does not allow for the application of an evaluation method based on the separation of digitized audio data. The use of this method requires the presence of an additional DSP in the test bench, which is designed to store digitized audio data.
The general scheme for implementing an assessment method based on audio data separation is presented in Figure 3.

Figure 3 - The general scheme for implementing an assessment method based on an
algorithm for evaluating this method involves the following actions:
- Serial audio data that is duplicated to another TsOTL is input to the tested DSC using another audio signal divider (from the test bench);
- using the capabilities of the CCTV, the recovered audio data is stored on storage devices;
- using the capabilities of the DSC from the test bench, the digitized audio data is stored on the storage devices;
- perform the calculation of the values of the quality metrics and carry out the classification of the compression algorithm according to Table 1.

6.2. PEAQ calculation algorithm

( ). ITU-R BS 1387.1.
:
• ( ) PEAQ 48 . 48 ;
• ( ).

>

— ;

— , ( );

— ,

, , 50%;

— ;

— .

5 .

I Preprocess Signal Processing
Applying a window transform The original digitized data is divided into frames. The digitized data of each frame is subjected to Hann's scaled window transformation using the formula (2). Hanna's window function is:

(one)
Scaled version of the Hanna window function:

(2)
The transition to the frequency domain is carried out by applying the discrete Fourier transform (DFT):

(3)

Model of the outer and middle ear
The frequency response of the outer and middle ear should be calculated using the following formula:

(four)
By the formulas (4) the weight coefficient vector is calculated as follows:

(five)

Using these weights (5), the weighted DFT energy is calculated:

(6)
Decomposition of the critical hearing band
Below are the formulas required for the transformation to the Bark scale (7) and the inverse transformation (8):

(7)
where z is measured in Barks.

(eight)
Frequency bands
Frequency bands are determined by specifying the lower, center and upper frequencies of each band. These values in the Bark scale are given as follows:

(9)
The inverse transformation is performed by the following formulas:

(ten)
The value of i = 1, 2, ...,

.
Bandwidth energy
For the i-th frequency band, the energy contribution from the k-th fundamental frequency of the DFT is calculated by the following formula:

(eleven)
Then the energy of the i-th frequency band is equal to:

(12)
Below is the final formula for the energy of the i-th frequency band:

(13)
Internal ear noise
To compensate for the internal noise in the ear itself, we introduce a surcharge value for the energy of each frequency band:

(14)
where the internal noise is modeled as follows:

(15)
Energies

will be called further images of height .
Distribution energy within one frame
The characteristic of the propagation energy in the Bark scale is calculated as follows:

(sixteen)
Where

(17)
The function S (i, l, E) has the following form:

(18)
Where

(nineteen)
Below are the formulas for calculating the terms

and

(20)
and

(21)
Energies

- images of uncommon excitations .
Energy filtration
Let n be the frame index (frames are indexed starting from n = 0). Then the energy of the n-th frame corresponding to formula (16) is denoted as:

Energy filtration is performed according to the following formula:

(22)
Where

- time constant for extinct energy. Initial filtering condition:

End values

- images of excitement .
Time constants
The time constant for filtering the i-th band is calculated as follows:

(23)

can be calculated as:

(24)

Ii. Image processing

The figure 4 below shows the scheme of preliminary calculations described in the previous chapter.

Figure 4 Signal preprocessing circuit
The indices R and T denote the original and recovered audio signals, respectively. The index k denotes the index of the frequency band (total frequency bands - 109), and the index n - the frame number. For recurrent formulas at this and the next stage (stage III), zero initial conditions are always chosen.
Processing images of excitations
The inputs for this stage of the calculation are the images of the excitations

and

calculated by the formula (22) for the original and tested audio signals, respectively.
Correction images of arousal
First, filtering is performed for both audio signals by the formula:

(25)
Time constant

calculated by formulas (23) and (24), but with

. The initial condition for filtering is set to 0.
Next, the correction factor is calculated:

(26)
Images of excitations are adjusted as follows:

(27)
Adaptation of excitation images
Using the same time constants and initial conditions as in the correction of the images of excitations, the output signals calculated by the formula (27) are smoothed in accordance with the following formulas:

(28)
Based on the ratio between the values calculated in (28), a pair of auxiliary signals is calculated:

(29)
If in the previous formula (29) the numerator and denominator are equal to zero, then it is necessary to perform the following actions:

.
If k = 0, then

For the formation of factors for image correction, the auxiliary signals are filtered, using the same time constants and initial conditions as in (25):

(thirty)
Where

(31)

(32)
As an end result of this stage of processing, on the basis of formula (30), spectrally adapted images are obtained:

(33)

Processing modulation images
The inputs to this stage of the computation are images of unpredicted excitations.

and

calculated by the formula (16) for the original and tested audio signals, respectively. The purpose of this section is to calculate the modulation measures for the spectral envelopes .
First, the average volume is calculated:

(34)
Next, you need to calculate the following differences:

(35)
The time constants and initial conditions are the same as in the previous section.
The modulation measures of the spectral envelopes are calculated as follows:

(36)
Volume calculation
Volume images are calculated according to the following formulas:

(37)
Where

(38)
and

(39)
The parameter c = 1.07664.
The total volumes for both signals are calculated as follows:

(40)

Iii. Calculation of the output values of the psychoacoustic model
The output characteristics from chapter I are used to calculate the output characteristics of chapter II in accordance with the diagram below (see figure 5).

Figure 5 Image Processing Scheme
In turn, the values of the previous chapter (II) are used to calculate the output values of the variables of the psychoacoustic model (see table 1 and figure 6).

Figure 6 Scheme for calculating the output variables of the psychoacoustic model
In total, the values of 11 variables of the psychoacoustic model are calculated. They are listed in Table 2.

Table 2. Output variables of the psychoacoustic model
For two-channel audio signals, the variable values for each channel are calculated separately, and then averaged. The values of all variables (except for the values of the ADBB and MFPDB variables) for each signal channel are calculated independently of the second channel.
General description of the process of calculating parameters
All values of the output variables of the model are obtained by averaging over all frames of the functions of time and frequency obtained at the previous step (as a result, a scalar value).
The values to be averaged must lie within the limits determined by the following condition: the beginning or end of the data to be averaged is defined as the first position from the beginning or from the end of the sequence of amplitudes of the audio signal, for which the sum of five consecutive absolute values of amplitudes exceeds 200 volts any of the audio channels. Frames that lie outside these bounds should be ignored when averaging. The threshold value 200 is used in case the amplitudes of the input audio signals are normalized in the range from -32,768 to + 32767. Otherwise, the threshold value

is calculated as follows:
(41)
Where

- the maximum amplitude of the audio signal.
Further, the frame index n: starts from zero for the first frame that satisfies the conditions for checking borders with a threshold

and counts the number of frames N up to the last frame satisfying the above mentioned condition.
Modulation window difference 1 (WinModDiff1B)
Below is the formula for calculating the instantaneous modulation difference :

(42)
The value of the instantaneous modulation difference is averaged over all frequency bands.

in accordance with the following formula:

(43)
The final value of the output variable is obtained by averaging formula 43 with a sliding window L = 4 (85 ms, since each step is equal to 1024 digitized values):

(44)
In this case, the so-called delay averaging is used - the first 0.5 seconds of the signal does not participate in the calculations. The number of skipped frames is:

(45)
In formula 45, operation denotes discarding the fractional part.
Thus, in the formula 44, the frame index includes only frames that go after a delay of 0.5 seconds.
Average Modulation Difference 1 (WinModDiff1B)
The value of this output variable of the psychoacoustic model is calculated by the following formula:

(46)
Where

(47)
Delay averaging is also used to calculate this value.
Average Modulation Difference 2 (WinModDiff2B)
First, the value of the instantaneous modulation difference is calculated by the formula:

(48)
Then, the modulation difference averaged over the frequency bands is calculated:

(49)
The final variable value of the psychoacoustic model is calculated as follows:

(50)
Where

(51)
Delay averaging is also used to calculate this value.
Noise Volume (RmsNoiseLoudB)
Below is the formula for finding the values of the instantaneous volume of noise:

(52)
Where

(53)
Where:

(54)

(55)

(56)
but
Further, if the instantaneous volume is less than 0, then it is set to 0:

(57)
The value of the final output variable of the psychoacoustic model is averaged by the instantaneous volume:

(58)
Delay averaging is used to calculate this value. Together with averaging with a delay, the volume threshold is used to find the value of the instantaneous volume of the noise from which the averaging process starts. Thus, averaging starts from the first value determined by the condition of exceeding the volume threshold, but no later than 0.5 seconds from the beginning of the signal (in accordance with averaging with a delay).
The condition for exceeding the threshold volume
The instantaneous loudness values of the noise at the beginning of both signals (source and test) are ignored until 50 ms passes after the total volume exceeds one of the signals in one of the signals, the threshold value is 0.1.

The condition of exceeding the threshold can be represented as:

(59)
The following formula is used to calculate the number of frames to be skipped after exceeding the threshold:

(60)

The bandwidth of the original and restored audio signals (BandwidthRefB and BandwidthTestB)
The operations of calculating the bandwidths of the original and recovered audio signals are described in terms of operations on the output values of the DFT, expressed in decibels (dB). First of all, for each frame the following operations are performed:
• For recovered signal: the largest component is located after the 21.6 kHz frequency. This value is called the threshold level.
• For source signal: performing a down search starting at 21.6 kHz, the first value is found, which is 10 dB above the threshold level. The frequency corresponding to this value is called the bandwidth for the original signal.
• For recovered signal: performing a search downwards, starting with the bandwidth value of the original signal, is the first value that exceeds the threshold level value by 5 dB. We denote the frequency corresponding to this value as the bandwidth for the recovered signal.
If the found frequencies for the original signal do not exceed 8.1 kHz, then the bandwidth for this frame is ignored.
The bandwidths for all frames are called the DFT base frequencies.
The basic frequency of the DFT for the nth frame is denoted as

for the original signal and how—

for the recovered signal. To calculate the final values of the psychoacoustic model variables, the widths of the bands of the original and restored signals, it is necessary to perform the following formulas, respectively:

(61)

(62)
where the summation is carried out only for those frames in which the main frequency of the DFT exceeds 8.1 kHz.
The ratio of the noise level to the masking threshold (Total NMRB)
The masking threshold is calculated using the following formula:

(63)
Where

(64)
The noise level is calculated as follows:

(65)
where k denotes the index of the fundamental frequency of the DFT.
The ratio of the noise level to the masking threshold in the k-th frequency band is expressed by the following formula:

(66)
The final ratio of the noise level to the masking threshold (in decibels) is calculated as:

(67)
Frame relative distortion (RelDistFramesB)
The maximum ratio of noise to the frame masking threshold is calculated as follows:

(68)
Distorted is considered to be the frame in which the maximum ratio of noise to the masking threshold exceeds 1.5 dB.
The final value of the output variable of the psychoacoustic model is the ratio of the number of distorted frames to the total number of frames.

Maximum probability of detecting distortion (MFPDB)
First of all, let's calculate asymmetric excitation:

(69)
Where

(70)
Next, a step is calculated to detect the distortion :

(71)
Where

(72)
The probability of detection is calculated as follows:

(73)
where b is calculated as:

(74)
We calculate the number of steps above the detection probability threshold:

(75)
Characteristics (73) and (75) are calculated for each channel of the signal. For each frequency and time, the total detection probability and the total number of steps above the threshold are selected as the larger value from all channels:

(76)
where indices 1 and 2 denote the channel number.
For single-channel signals, the above values are calculated as:

(77)
The following computational procedure is performed:

(78)
Where

and the initial condition is zero.
The maximum probability of detecting distortion is calculated by the recurrence formula:

(79)
The final value of the output variable of the psychoacoustic model is calculated as follows:

(80)
Average block distortion (ADBB)
First, the sum of the total number of steps above the detection threshold is calculated:

(81)
Moreover, the summation is carried out for all values for which

The final characteristic is:

(82)
Harmonic error structure (EHSB)
The DFT outputs for the original and reconstructed signals are denoted as

and

respectively.
Calculate the characteristic:

(83)
A vector of length M is formed from the values of D [k]:

(84)
Normalized autocorrelation is calculated by the formula:

(85)
Where
Let —C [l] = C [l, 0]. Next, you need to calculate:

(86)
When calculating (85) in case the signals are equal, it is necessary to set the normalized autocorrelation equal to one in order to avoid dividing by 0.
A window function of the following form is introduced:

(87)
The window transform (87) is applied to the normalized autocorrelation:

(88)
Where

(89)
The power spectrum is calculated by the formula:

(90)
The search for the maximum peak of the power spectrum starts with k = 1 and ends when

The maximum peak value found is denoted as

Then the final value of the output variable of the psychoacoustic model is calculated using the following formula:

(91)
When calculating this value, low energy frames are excluded. To define low energy frames, a threshold value is entered:

(92)
Where

for amplitudes stored as a 16-bit integer.
The frame energy is estimated using the following formula:

(93)
When calculating the harmonic structure of the error, the frame is ignored if:

(94)

Iv. Rationing of the output variables of the psychoacoustic model
The normalization of the output variables of the psychoacoustic model obtained at the previous step is performed in accordance with the following formula:

(95)
Where

- value of the i-th output variable of the psychoacoustic model, values

and

are shown in Table 3 below.

Table 3. Constants for rationing the values of the output variables of the psychoacoustic model

V. Evaluation of the quality of the recovered signal using an artificial neural network

(96)
where bmin = −3.98 and bmax = 0.22, and the function sig (x) is an asymmetric sigmoid:

(97)
Value

is calculated as follows:

(98)
Where

- the normalized value of the i-th output variable, I - the number of output variables (equal to 11), J - the number of neurons in the hidden layer (equal to 3),

- the weights and offsets of the neural network are shown in Tables 4-6 below.

Table 4 Neural Network Weights
<

Table 5 Neural Network Offsets

Table 6 The weights and displacements of the neural network
This metric value (PEAQ) is a real number belonging to the [-3.98; 0.22].

6.3 Algorithm for calculating PSNR

Peak signal to noise ratio between the original audio signal

and restored

calculated by the formulas:

(99)

(100)
where, in turn:

(101)

and

- i-th digitized values of the original and restored audio signals, respectively, i = 1, 2, ..., n, and

- the maximum value among the digitized values of the original audio signal.

6.4 Algorithm for calculating the metric "waveform difference factor"

Let be

- source mono channel audio signal (or one channel of the original multichannel audio signal). Similarly

- reconstructed mono channel audio signal (or one channel of reconstructed multichannel audio signal). Both signals consist of the same number of N values.
Arrays of amplitude values of signals

and

represented as a relative change in the signal amplitude values:

(102)
The value of the “waveform difference factor” metric K is calculated as the standard deviation of the amplitude value arrays

and

(103)

7 Comparison methods for compression of digitized audio data
7.1 Two or more compression algorithms are comparable with each other if they belong to the same class according to Table 1.
7.2 Of the two or more comparable compression algorithms, the algorithm that provides the best values of at least two of the three metrics given in Table 1 is considered the best . The best value of the metric is recognized as a larger value — for the PSNR and PEAQ metrics, and a smaller value — for the metric “form difference coefficient signals. "

References:
1. P.Kabal, An Examinationa and Intrapretation of ITU-R BS.1387 Perceptual Evaluation of Audio Quality
2. PQevalAudio

Source: https://habr.com/ru/post/181922/

All Articles

Public discussion of the GOST project on digitized audio data compression

TV security systems. Compression of digitized audio data.

Classification. General technical requirements and algorithms evaluation methods

Methods for evaluating compression algorithms

More articles: