Audio watermark for Second Screen applications

For Second Screen applications, there are two main ways to synchronize content by audio (Automatic Content Recognition, ACR): based on audio fingerprints and using digital watermarks (digital watermark). These technologies have fundamentally different approaches.

Fingerprints is a compact and distortion-free representation of the content itself. The recognition process consists in creating an audio fingerprint and searching the base of reference samples, followed by extracting the required data, for example, the track name and the offset of the query from its beginning. In the case of an audio CEV, the information necessary and sufficient for recognition is hidden directly inside the audio signal itself.

I already wrote about the results we achieved on audio recognition based on fingerprints. In this post I want to talk about the audio of CEH and the problems that we faced when building an ACR based on them.

Disadvantages fingerprints

Before moving on to CEH, we note a number of problems that arise when using ACR based on fingerprints.
')
The Second Screen application produces a permanent recording of an audio stream and sending requests with fingerprints to the server. Database storage and search through it, as a rule, is implemented on the server side. If we consider that projects with high popularity are usually interested in the Second Screen, we come to the need to have enough resources to withstand high loads.

Successful recognition based on fingerprints is possible only if the audio fragment is unique. However, in real content there can be audio duplicates, for example, the same background music of the titles or the same background music in the background. Under conditions of noise distortion, the recognition of such areas is accompanied by a high probability of errors of the second kind (false positive positives). Therefore, for the Second Screen application to work properly, these areas need to be able to identify and make additional adjustments to the ACR system in advance.

As a possible solution to these problems, the transition to digital watermarking technology looks quite attractive. Since all the necessary data is already pre-built into the audio stream, recognition can be fully implemented on the client side, and the uniqueness of the CEH would eliminate the problems of duplicates and similar audio fragments.

Requirements for digital watermarking algorithms

Depending on whether the DVR detector requires an original signal or not, the algorithms are divided into non-blind watermarking and blind watermarking. In the context of ACR, we were interested in blind algorithms, i.e. allow the extraction of the mark without the presence of the original audio signal. An overview of such methods can be found in [1].

Watermark audio injected should be transparent (Inaudibility) - should not introduce any distortion that significantly affects the quality of the original signal. A simple quantitative characteristic of transparency is the SNR (Signal-to-Noise Ratio) parameter, which determines the ratio of the power of the original signal to the power of the distortion caused by the CEH. In accordance with the recommendations of the IFPI (International Federation of the Phonographic Industry) SNR should be more than 20 dB. Along with SNR, the ODG (Objective Difference Grade) parameter, calculated in accordance with the PEAQ algorithm and changing from 0, with absolute stealth, to -4, for distortions causing severe irritation, is used to assess transparency. Unlike SNR, the ODG parameter takes into account the features of the human auditory system, for example, such effects as frequency and time masking.

For successful recognition, the watermark must be robust (robustness) to signal processing techniques called attacks. It should not be erased by lossy compression, filtering, D / A / A / D conversion, adding noise, etc. Resistance to attacks is assessed by the number of mistakenly decoded bits per unit time BER (Bit Error Rate).

An important characteristic of digital watermarking algorithms is throughput, i.e. The maximum amount of information that can be embedded in a unit of time (data rate).

The requirements for transparency, stability and throughput are mutually opposite - an increase in one necessarily leads to a decrease in the two remaining ones (Figure 1).

Looking ahead, I note that the feature of automatic content recognition systems is their relatively high, ceteris paribus, requirements for sustainability of CEH. At the same time, the algorithm must be fast enough to use it on mobile devices in real-time applications.

Audio Watermarking Techniques

Audio CEs can be thought of as modulated noise added to the original signal. Watermarking algorithms are reduced to determining the spectral characteristics of the information being introduced, so that this watermark meets the requirements for it.

Many schemes for embedding and extracting audio CEV have a block-oriented approach and take into account the peculiarities of the original signal (Figure 2). Watermark, if necessary, is pre-converted to a one-dimensional sequence of bits. The structure of the embedded stream is shown in Figure 3. For increased resiliency, redundancy can be introduced in the CEH — simple bit replication or adding error correcting codes (ECC), for example, Reed-Solomon codes, LDPC code (Low-density parity-check code), etc. Each bit of the stream is embedded in a separate time block or segment of the audio signal.

Each signal block is described by a coefficient vector.

. In the simplest case, the signal samples themselves can be the coefficients themselves. In this case, talking about labeling in the time domain. But, as a rule, to increase the resistance of CEs to attacks, the frequency representation of the signal is used using the Fourier transform coefficients ( DFT ) [2], cosine transform ( DCT ) [3], and wavelet transform ( DWT ) [4]. There are works in which the description of signal segments based on empirical mode decomposition ( EMD ) [5], kepstrum transformations [6], combinations of frequency transformations [7] and a number of others is used. For example, in [8], singular numbers obtained as a result of the singular value decomposition ( SVD ) of the matrix of DWT coefficients of a signal segment act as the vector of coefficients.

The process of embedding CEH or labeling is to change the vector

in accordance with the selected bit coding technique. Techniques such as Spread-Spectrum (SS), Quantization Index Modulation (QIM) and Patchwork are widely used in audio steganography.

SS coding is described by the equation

where

- perception mask (may be absent in a number of algorithms), defined as a function

human auditory system (HAS)

;

- pseudorandom sequence;

- the value of the embedded bit;

- elementwise product of two vectors.

In the QIM method, the coefficients of the tagged block are determined by the modulation function

displaying the original value of the coefficient

on the nearest value of the sets

and of

where

- quantization step [9].

Two varieties of the QIM method are widely distributed: distortion-compensation (DC) and dither modulation (DM). The function of embedding bits in the case of DC is described by the equation

where

. With DM, one quantizer is used, and the embedding function is defined as

where

- noise intended for masking the distortions arising during quantization, one of its synthesis algorithms can be found in [9].

The patchwork method is to split a set of block coefficients.

into two disjoint subsets

and

. It is assumed that the difference of the elements of these subsets has a distribution with a close to zero mean value, i.e.

. The embedding of bits is reduced to such a change in the coefficients, so that when

while

where

- the value of the detection threshold.

The relatively simple strategies shown in Figure 4 can also be used for watermark insertion and detection. Bits are encoded based on the functional relationship between coefficients.

. Illustrations are given for three or four signs, but technicians can easily be generalized to a larger number. Examples of algorithms using such approaches can be found in [2, 10].

After the bit is embedded, the inverse transforms are calculated, displaying the modified coefficients of the corresponding labeling space in the time sequence of the signal. The final stage of embedding the CEH is gluing all the marked blocks into a single signal containing a watermark.

A generalized scheme for extracting a CEH is presented in Figure 5. One of the important issues in the block-oriented approach is the issue of synchronization. It is necessary to determine the exact position of each block in the decoding process of the CEH. For this, the bit stream is restored at different initial offsets and a sync code is searched. As a detection criterion, as a rule, the threshold value of the correlation coefficient or Hamming distance is used. As soon as the sync code is found, they begin to restore the watermark - they eliminate the redundancy introduced at the marking stage.

The results of our research

In the English-speaking segment of the Internet, various publications on the audio of CEH are quite different. Having tried some of the techniques, we did not achieve results that satisfied us. With an ODG transparency of about -1, the implementation of the algorithms [10 - 12] could not recognize the watermark on a mobile device even at distances of about 5 cm from the sound source. We also noted a strong dependence of the transparency of the CEV under study on the nature of the original signal. For example, watermark, not audible in rock music, was heard in speech. It was possible to fight this only by reducing the aggressiveness of embedding (by increasing SNR), which reduced the already low resistance of CEH.

We decided to implement our own algorithm, which would allow us to dynamically adjust to the spectral characteristics of the sound.

Embedding of the CEH occurs in the frequency domain using the window Fourier transform (Short-time Fourier transform, STFT ). The method is based on the effect of temporal masking - a weak signal arising earlier or later a strong one remains unnoticed for some time. The masking time depends on the frequency and amplitude of the signal and can reach hundreds of milliseconds.

We hide the watermark in the “shadow” of the local peaks of the spectrogram - at each particular time interval we select those STFT coefficients that can be relatively harmlessly changed for coding bits (Figure 6).

The method allows you to achieve bandwidth of 50 bits / s, which is quite sufficient for ACR. True, with this solution, some signal blocks remain unmarked, but, as our studies have shown, their share is about 0.01%. As an example, Table 1 shows the results of the comparative tests. The attacks were carried out using the SMFA utility (StirMark for Audio) version 1.3.2. The parameters of the compared algorithms were chosen so as to ensure the same transparency value as possible; ODG was in the range of -0.5 to -0.1.

Table 1 - Comparison of digital watermarking algorithms by Bit Error Rate

Type of attack	Algorithm
Type of attack	[ten]	[eleven]	[12]	Our solution
AddNoise (9 dB)	0.11	0.09	0.2	0.007
AddDynNoise (10 dB)	0.51	0.4	0.51	0.010
AddFFTNoise (12 dB)	0.06	0.01	0.13	0.008
MP3 compression 128 kbps	0.01	0.01	0.10	0.005
MP3 compression 32 kbps	0.36	0.3	0.48	0.005
AAC compression 32 kbps	0.36	0.3	0.46	0.005
Note: SNR parameter is shown in brackets after the attack is launched; The AddFFTNoise attack was carried out with the FFTSIZE = 128 parameter.

With BER <1%, our method maintains low-frequency filtering to 4.5 kHz, high-frequency filtering to 1.9 kHz, a change in the sound signal level (from 1% to 150%), a decrease in the sampling rate to 8 kHz.

The main test, in the context of the problem being solved, of course, was the watermark stability test during acoustic signal propagation. The tracks with DVR were played on TV and recorded on various mobile devices (LG-P705, Samsung GT-P7510, HTC Desire 601, etc.). Monosignals were recorded at a sampling frequency of 44.1 kHz at different distances from the sound source.

Unlike algorithms [10 - 12], our CEHs are recognized on mobile devices, but their stability is still insufficient to speak of an equivalent replacement of fingerprints. For example, the LG-P705 at a distance of 40 cm successfully recognized about 85% of requests, while the Samsung GT-P7510 recognized 80% only at a distance of up to 5 cm. In the signals recorded at distances of more than 50 cm, the watermark is no longer detected.

Only with the CEH outside the acoustic frequency range, we managed to synchronize at distances of more than 1 m. For LG-P705 and HTC Desire 601 at distances of 1.5 meters, the proportion of correctly detected watermark was 80%. A simple amplitude modulation of harmonics with frequencies greater than 20 kHz was used for coding bits (Figure 7).

However, not all mobile devices are equally good at recording high frequencies. The same Samsung GT-P7510 has never recognized this type of watermark. But the main disadvantage of such ceramics is their inability to survive compression with losses, which severely limits the possibilities of application.

Conclusion

Of course, much remained outside the scope of our experiments, but the experience made us seriously doubt the possibilities of practical application of ACR based on CEH. At least, if we talk about watermark in the field of acoustic frequencies.

Most of the work we have found on audio watermarking in our research is limited to synthetic SMFA attacks and an assessment of compression resistance. There are only a few publications that investigate the sustainability of a CEH with acoustic signal propagation and recording it on a microphone, and they quite superficially affect this issue, without sufficient specificity.

Our solution is relatively simple and quite successfully copes with sample tests, and also withstands lossy compression. However, we were not able to guarantee at the same time transparency and stability of CEH, sufficient for Second Screen applications, comparable to fingerprints technologies.

Bibliography

Harleen Kaur "Blind Audio Watermarking schemes: A Literature Review"
Mehdi Fallahpour "High capacity robust audio watermarking scheme based on fft and linear regression"
Baiying Lei "A multipurpose audio watermarking algorithm with synchronization and encryption"
Hong Oh Kim "Wavelet-based audio watermarking techniques: robustness and fast synchronization"
Shaik Jameer "A scheme for digital audio watermarking using empirical decomposition mode with IFM"
Alok Kumar Chowdhury "A roust audio watermarking"
Hooman Nikmehr “A new approach to audio watermarking using discrete wavelet and cosine transforms”
Vivekananda Bhat K "An adaptive audio watermarking based on the singular value decomposition in the wavelet domain"
Brian Chen "Quantization Index Modulation: A Model of Provably Good Methods for Digital Watermarking and Information Embedding"
Shijun Xiang "Audio watermarking robust against D / A and A / D conversions"
Hong Oh Kim "Wavelet-based Audio Watermarking Techniques: Robustness and Fast Synchronization"
Jong-Tzy Wang "Adaptive Wavelet Quantization Index Modulation Technique for AudioWatermarking"

Source: https://habr.com/ru/post/254379/

All Articles

Audio watermark for Second Screen applications

Disadvantages fingerprints

Requirements for digital watermarking algorithms

Audio Watermarking Techniques

The results of our research

Conclusion

Bibliography

More articles: