Some methods for finding fuzzy duplicate video

There is a fairly wide range of tasks where analysis is required, audio-visual models of reality. This applies to both static images and video.

Below is a brief overview of some of the existing methods for finding and identifying fuzzy duplicates of video, and their advantages and disadvantages. Based on the structural presentation of the video, a combination of methods is built.
Overview of a very small, for details, it is better to refer to the original sources.

The term “fuzzy duplicate” means an incomplete or partial coincidence of an object with another object of a similar class. Duplicates are natural and artificial. Natural duplicates are similar objects under similar conditions. Artificial fuzzy duplicates - derived from the same original.
')
Finding fuzzy duplicates can be useful for optical navigation of unmanned aerial vehicles (unfortunately, not always for peaceful purposes), for determining the nature of the terrain landscape, compiling video directories, grouping snippets of search engines, filtering video ads, and searching for pirated video.

The problem of finding fuzzy duplicate video (NDV) is closely related to the problems of classification (HF) and video search (PV). The task of searching for NDV can be reduced to classification, and that, in turn, to annotation and search by video. | {search NDV} | <| {KV} | <| {PV} |. But these tasks are independent.

General algorithm

To search for NDV video is divided into segments. Keyframes are extracted from each segment. Keyframe characteristics are used to represent the entire video. The similarity between the videos is calculated as the similarity of the sets of these characteristics.

Approaches

WFD search methods form two categories: methods using global characteristics (GC); methods using local characteristics (LH). The distinction between categories is conditional, very often mixed techniques are used. GC methods highlight frame level signatures to model spatial, color, and temporal information. GC summarizes global statistics for low-level traits. The similarity between the video is defined as the correspondence of sequences of signatures (signatures). They can be useful for searching for “almost identical” videos and can reveal minor edits in the space-time domain. GC ineffective when working with artificial NDV, which were obtained as a result of cosmetic editing. For such groups, methods using the low-level characteristics of a segment or frame are more useful. Typically, these methods are affected by changes in temporal order and the insertion or deletion of frames. Compared to global methods, segment level approaches are slower. They are more demanding from memory, although they are capable of identifying copies that have undergone substantial editing.

Local characteristics

Track Trajectories

LH methods, reduce the task of finding similar videos to the task of finding duplicate images. The main steps when comparing images: the definition of singular points; selection of neighborhoods of singular points; construction of feature vectors; selection of image descriptors; comparison of descriptors for a pair of images. In the work, they identify particular points of the frame (using the Harris detector) and track their position throughout the video. Then form a set of trajectories of points.

Pattern matching takes place on the basis of a fuzzy search. The approach facilitates the localization of fuzzy duplicates of fragments. However, the method of roads due to the allocation of special points of frames. And the fact that the trajectories of the points are sensitive to the movement of the camera, make the algorithm applicable only for finding exact copies of the video.

Average value

Authors of Non-identical Duplicate Video Detection Using The SIFT method highlight specific points of keyframes, and estimate the similarity of frames based on SIFT. The similarity of frames is calculated as the arithmetic average of the number of matched singular points. But to determine the similarity of a video, a full conformity assessment (PIC) is used as the mean value of the similarity of key frames throughout the video. It is important that the average value is calculated, not for all possible pairs of frames, but only for some of them. This saves computing resources.

Longitudinal frames

The work is also interesting because the selection of frames in it occurs along the time axis, and not across, as in a normal video. Such a slice allows you to extract temporal information from the video, and apply spatial comparison methods to it, the same as for ordinary frames: SIFT is applied to the two video slices and PIC is calculated.

According to the results of experiments, the method of longitudinal frame allocation is worse than usual. This is especially evident if there are a lot of sharp camera movements in the video. The disadvantages of the general approach include the use of SIFT. The experiments were conducted on video with a small resolution (320 × 240). If you increase the size of frames, singling out special points will be very expensive. If you apply the PIC only to regular frames, the time information of the video will not be taken into account.

Visual words

Methods using visual words are an improved version of the direct comparison of special points of frames. They are based on the quantization of singular points - the formation of "words". Comparison of frames (and video) entirely occurs in frequency dictionaries, as for texts. The work of the Inria-Learar's video copy detection system demonstrates excellent method performance. Keyframes are represented by features that are obtained using SIFT. Then these characteristics are quantized into visual words. A visual signature is built from the visual words. But, for the application of visual words, frequency dictionaries must be built for a previously known subject area.

Global characteristics

Using GC methods, video, color, spatial and temporal information is separated from the video, presented as a sequence of characters, and methods of searching for matching strings are used.

Local-Sensitive Hashing

Efficiently matching sets of features with random histograms use locally sensitive hashing (LCH). It is used to display the color histogram of each keyframe on a binary vector. Characteristics of the frame stand out from the local features of the image. These characteristics are represented as sets of points in the space of characteristics. With the help of LCH points are mapped to discrete values. The frame histogram is built according to the set of features. Next, the histograms are compared as normal sequences.

Experimental results confirmed effectiveness. But, as indicated in the Large-scale work near-duplicate Web video search: challenge and opportunity method suffers from the potential problem of large memory consumption. Temporary information is not taken into account.

Ordinal signatures

Robust video signature based on ordinal measure uses ordinal signatures to simulate the relative intensity distribution in a frame. The distance between two fragments is measured using the temporal similarity of signatures. The approach allows you to search for fuzzy duplicates of video with different resolutions, frame rates, with minor spatial changes of frames. The advantage of the algorithm is the ability to work in real time. The disadvantages include instability to large inserts of extra frames. The method is poorly applicable to the search for natural fuzzy duplicates, for example, if the object was filmed at different light levels.

Comparison of motion detection describes the motion signature that captures the relative change in intensity over time. Color signatures, movement signatures and ordinal signatures are compared. Experiments show that ordinal signature is more efficient ...

Video as DNA

The approach proposed in the work Near-duplicate Video Retrieval Based on Clustering by Multiple Sequence Alignment reduces the task of finding fuzzy video duplicates to the task of classifying video. The method is based on multiple sequence alignment (MSA). A similar approach is used in bioinformatics to search for alignment of DNA sequences. The authors use heuristic alignment and iterative methods proposed in the work of Muscle: multiple sequence alignment . The method can be described in the following sequence of steps.

For the video is built DNA representation.
Video from the database, each is compared with each, and a distance matrix is constructed.
A guide tree is built on the basis of the distance matrix using the neighbor joining method (neighbor joining).
Based on the guide tree, progressive video alignment is performed.
Alignment results are used to form video clusters.
When searching the database, the query is compared with the cluster centers. If the similarity between the request and the cluster center exceeds a certain threshold, then all the clips in the cluster are considered fuzzy duplicates of the request.

For the presentation of the video as genomes, keyframes are selected, translated into a halftone color space, and the frames are divided into 2 × 2 blocks. Perhaps only 24 = (2 ⋅ 2)! spatial ordinal pattern. Each such pattern can be assigned a letter of the Latin alphabet.

The distance matrix is constructed using a sliding window of size n. Thus, the video comparison is based on n-grams. Comparison between two DNA representations of a video occurs only in a window.
The window size and step size (how much to shift the window after comparison) are set as parameters.

The guide tree is constructed by a greedy heuristic algorithm for joining neighbors. According to the distance matrix, the two closest DNA views are distinguished and combined into one tree node as a video profile. In this model, each video DNA is the leaves of the tree, profiles appear in the nodes of the tree. Profiles are also compared with each other, and glued similarly to a sheet view. The result is a decision tree.

Below is a guide tree.

Advantages of the approach: it has high accuracy and completeness and does not require special computational costs. Cons of the approach - the method does not take into account the temporal information of the video.

Scene change

A suffix array approach to socialnetworks introduced a signature based on the definition of the boundaries of scenes (shootings).
There are three different concepts. A frame or photographic frame is a static picture. A scene or a scene is a multitude of frames connected by the unity of place and time. Shooting or cinematic frame, a lot of frames related to the unity of shooting. The scene may include several shots. In the literature, filming is often called a “scene.” Next, we will consider shooting, but we will call it the same - “scene.”

The video contains more information than just a series of frames. Events in a video uniquely define its temporal structure, which can be represented by a set of key frames.

In this case, key frames are understood not as an I-frame, but just some special video frame. By events is not a semantic component of the video plot, but only a change in the frame content. Keyframing is based on finding differences in brightness histograms. After extracting the keyframes, the length between the current keyframe and the previous one is calculated. All lengths are recorded as a one-dimensional sequence. This sequence of scene lengths is the caption of the video. Experiments show that two unrelated video clips do not have a long set of consecutive key frames with the same scene lengths. The paper proposes an effective way to map video signatures. To do this, use the suffix array. The problem comes down to finding common substrings. The method does not work well if there are a lot of camera or object movements in the video and with smooth transitions between scenes. The disadvantage is the inability to work in real time, for comparison, you must have the entire video.

In this paper , the Duplicate Search Algorithm in the base of video sequences based on the comparison of the scene change hierarchy suggests an algorithm for comparing the scene tree (shooting). Based on the scene changes found, a binary tree is built. Subtrees correspond to video fragments. Root vertices of subtrees store the values and position of the dominant scene change in the current fragment. When comparing, in one tree (longer video) look for the vertices closest in size to the root of the second tree, and compare their coordinates. The procedure is performed recursively for the remaining scene changes. The advantages of the method are the resistance to most methods of creating artificial duplicates (because only temporary information is used), the low complexity of comparing two clips and the possibility of creating hierarchical film indexes. This allows you to catch the mismatch in the initial stages of verification. The disadvantages of the approach as the previous one is that the characteristics of the scenes themselves are not taken into account. To determine fuzzy duplicates, it is required to have a video request completely, which is not always possible.

Conclusion

To search for NDV used a variety of methods. At the moment, the most promising seem GC-methods. They allow solving the problem approximately without special computational costs. However, LH methods may be required to clarify. As shown in the Large-scale near-duplicate Web video search: challenge and opportunity, the use of combined approaches gives higher accuracy than each of the methods separately.

Combination

Of interest are the works in which the structural model of the video is built. Video can be viewed as a sequence of facts, developing in time. Moreover, in different videos both the facts and their order can be different. The properties of the facts form the spatial characteristic of the video, and the duration and order of the facts are temporary. The easiest way to extract facts from a video is to use scene change points (filming). It is important to note that the time in two different videos may go differently. We propose to use the relationship of the lengths of scenes to the lengths of adjacent scenes ... The relative lengths of the scenes of two fuzzy duplicates will rarely coincide. This is due, among other things, to the recognition errors of scene boundaries. To solve this problem, you can apply sequence alignment algorithms. But, so, we compare only the order of the video facts. To compare the facts themselves, internal characteristics of the scenes are required, for example, characteristics of the initial and final frames. Here it is convenient to use visual words as a technique from LH-methods. So we got a scene descriptor.

Formally, the scene as "shooting", cinematic frame - a set of multiple photographic frames inside the time domain, frames, which is significantly different from the frames of neighboring areas. The picture below shows the first frames of some video scenes.

If the original video is compressed with different codecs, we will get fuzzy duplicates of this video. Selecting the scenes in each video, we see that the scene reversal points for these two files do not match.

To solve this problem, it is proposed to use relative lengths of scenes. The relative length of the scene is calculated as a vector of ratios of the absolute length of the scene to the absolute lengths of the remaining video scenes. In practical problems, it is more convenient to calculate the length relations for the three previous scenes, and not for all. This is convenient even if all the video is completely inaccessible to us and we are dealing with a video stream, for example, in real-time tasks.

The relative length of the scenes of two fuzzy duplicates will rarely coincide. Moreover, many scenes may simply not be recognized. This is due, among other things, to the recognition errors of scene boundaries.

If the relative length of the scene of one video is no more than twice the length of the scene of another video, and all previous scenes are aligned, then the current pair of scenes expresses the same phenomenon, provided that both videos are indistinct duplicates of each other ( hypothesis Heila Church ). A similar approach is used in mathematical linguistics for the alignment of parallel bodies of texts in different languages. The less the relative length of the scenes, the more likely the scenes are similar. If the lengths differ more than twice, then the length of the smaller scene is added to the length of the next scene of the same video, and the combined scene is considered as one. In case of coincidence of the relative lengths of the video scenes, a comparison of the internal properties of the scene is applied.

Thus, the proposed scene descriptor can be formally described. It consists of a vector of ratios of the length of the scene to the lengths of the other scenes and the characteristics of the initial and final frames. It is convenient to store it immediately, with associations of neighboring scenes (the three previous ones), taking into account the Gale-Church hypothesis.

So what is next …

Thanks to this descriptor and various technical tricks (semantic hashing), a fairly obvious algorithm was built.

But the practical implementation of it is still in its infancy. Some results were still obtained * , but for some external reasons, the work was temporarily stopped.

I would be very happy with any comments and additions.

UPD 1:

Some material additions can be found in the presentation: Fuzzy Duplicate Video Search Elements

More on the topic:

Search for similar video clips based on the analysis of video content (Anatoly Borisov)

Thanks

Thank goodok for correcting grammatical errors.

Source: https://habr.com/ru/post/230489/

All Articles