With this post, we continue the cycle of articles about how we did the porn filter. Now we will talk about trying to classify pornographic content by characteristic movements in the frame.
It all started as just a joke from a conversation. After all, to classify pornographic movements is quite difficult - they are too different to find in them something in common. But we tried it, we were completely satisfied with the result, and the motion detector took its place in the general classifier of pornographic video content.
Once again about the classification
The principle of operation of most machine learning systems is quite simple. To classify objects into classes A and B, we describe them as a combination of some features (features) that can be measured in some way. Next, we statistically derive a formula or expression that, if we substitute for it specific values ​​of attributes for a particular object, gives the value> 0 for objects of class A, and the value <0 for objects of class B.
For example, we want to automatically distinguish ... let's say red and black caviar. Signs - the color and size of the eggs. Choose several eggs of black and red caviar, measure them, and reflect the situation on the chart.
')
The dotted line separates well the two existing classes of objects. We see that red caviar is larger and lighter than black. Let's build the formula for this line, for example:
z = size * c1 - color * c2 + c0,
Where:
c1 and c2 are some coefficients, statistically matched by the results of observations,
c0 is a constant.
Further, having some unknown egg, we substitute its size and color into our formula, and for z> 0 we say that the caviar is red, and for z <= 0 - that it is black.
This is all, of course, widely known. There is a huge number of classification algorithms. For example, when creating our detectors, we used such methods as:
So, we proceed to our first feature, according to which we will classify porno fragments - this is the nature of the movement. Probably no one will deny that the scenes depicting sexual intercourse are characterized by rhythmic, repetitive movements of objects in the frame. We will look for them.
Motion analysis
One of the well-known and commonly used methods of motion analysis is optical flow. For example, the implementation of optical flow is present in the well-known OpenCV library. The principle of operation resembles the search for motion vectors when encoding video in mpeg format — in a single frame, some fragments of the image are selected that are searched for in the next frame (for example, using the SAD method). The movement of objects corresponds to the displacement of image fragments between frames.
However, having tried to realize the optical flow at home, we found that:
when there are rounded, softly lit forms in the image (naked body), the optical flow often incorrectly determines the direction of movement;
the results of the work — that is, the motion vectors — are difficult to classify by machine learning methods;
even when using our own optimized implementation of optical flow instead of OpenCV, the time spent on calculations turned out to be unacceptably large;
besides, we are from Inventos , our logo is a bicycle.
That is, we decided to go our own way.
Like we have
To determine the direction of motion, we used spatiotemporal filters based on the use of the convolution and summation operation of the signal. This approach is applied, for example, in this development . This method is only beginning to receive widespread, and we are one of the first researchers to use it in practice. In particular, we had the opportunity to communicate with people participating in the project, which is described in the link above. Taking this opportunity, we express our gratitude for their detailed advice and assistance in implementation.
Let us explain the application of convolution on a simple, two-dimensional example. Suppose you apply the “detect edges” operation to the image in the graphical editor.
The graphics editor creates a 3x3 mask, overlays it on the image, starting with each pixel, and multiplies the corresponding numbers and summarizes the multiplication results. The result is a single number - it can be said that it is all the greater, the more the signal under the mask looks like the mask itself.
In a similar way, but in three-dimensional space (two-dimensional coordinates of a pixel in a frame + time or a frame number as the third dimension), the filters we used also work.
What are we doing:
We collect frames of the video in the "stack".
Apply to the resulting data structure the operation of three-dimensional convolution. In this case, we can create a mask that will give out large numbers if there is movement in some pre-selected direction with a pre-selected speed.
Applying several such masks, we can estimate the amount of movement in each frame, in each pixel, in each of several preselected directions.
Summing up the values ​​of the signal convolution results in all pixels, we can qualitatively (in direction) and quantify the motion in the whole frame at any time.
The figure below shows the result of the operation of our motion filter in a small excerpt of a porn movie. The video itself we put on our site in order to avoid banning on video hosting sites. The picture in the center is the current frame from the clip. The pictures around him are the result of filtering a sequence of frames with our filters. Each result corresponds to one of the twelve selected driving directions. Green curves is a graph of the amount of movement in each direction over several dozen frames.
It is noticeable that the movements characteristic for pornographic clips are expressed by a characteristic, easily recognizable curve. Also, according to this curve, you can estimate the number of characters in the video and the direction and speed of their movements.
In the example above, the video is two participants moving in opposite directions. Large periodic bursts on the green curve correspond to the strong movements of the man to the left. Small bursts correspond to the reciprocal movements of the woman and the weaker, return movements of the man.
In the case of the presence of only one participant in the video (this is often found in video chats), the curve has no second bursts, somewhat resembles a sinusoid, and is easy to analyze. In the case of three or more participants, the situation is significantly complicated. You will certainly agree with us that some of the rarely occurring actions of partners cannot be modeled either mathematically, or even described verbally.
The speed of movement can be used to estimate the time of the act. On our data it is noticeable that approaching the end of the roller, the amplitude of movements increases, and the period is shortened. (This assessment is interesting, but unreliable, and we do not use it in practice).
Movement classification
After we received the curves shown above, it remains to take the last step - to teach the computer to distinguish curves corresponding to pornographic materials from other curves. We tried two methods:
support vector machines (SVM, support vector machines).
You can use existing verbal spam filtering systems based on Baise's method to evaluate the curves. Each curve can be turned into a sequence of "words" as follows:
Choose some word length. Each word will correspond to, say, 3 seconds from the video.
Find the average amplitude of the curve at each time interval.
On each frame, for each amplitude of the curve above the average, we add to our “word” a “letter” 1, and for each amplitude below the average - a “letter” 0.
Thus, our video will turn into a set of “words”. Say, words similar to 0110011 are often seen in porn movies.
After turning the video into a description in the form of a set of such “words”, training a regular spam filter to filter pornography is simply a matter of technology.
We also tried SVM, but due to the specific nature of the source data, we still stopped at spam filters.
Classification accuracy
None of the automatic classification systems provide 100% correct results. When using only motion estimation, we achieved a classification accuracy of 78.3%.
A lot or a little. On the one hand, it is not so much, yet the error is quite high. But here it is worth noting some points:
We are talking about the classification of a separate video. Testing used videos that users upload to video hosting sites. If we talk about the accuracy of the classification of pornographic scenes (that is, fragments in which sexual intercourse is present), here the accuracy was higher than 95%.
The motion detector perfectly complements other pornography detectors ( by color , or by objects in the frame), since it has little in common with them.
Of course, in the process of work, it was not without curiosities. Here is an example of a movie that detects motion as strongly pornographic. We immediately called him “Mechanic having sex with a gimbal”.
On the same page, where the result of the motion detector is shown, there are more examples of clips on which the detector is mistaken.