Accelerate Viola-Jones Method

Recently, the Viola-Jones method, which for a long time has been the main method for detecting objects in an image, has been retreating under the onslaught of newer and more advanced algorithms. Nevertheless, the relevance of this method is still preserved in the present tense.

Yes, the cascade classifier based on the Haar attributes (Viola-Jones method) is inferior in speed to the cascade LBP classifier. It is less accurate than a detector based on HOG signs, and even more so a detector based on convolutional neural networks. And yet it has a certain niche, when accuracy is required higher than that of the LBP cascade, but the speed of operation of more accurate detectors is not high enough. An equally important factor is that for the cascade Haar classifier, there are a large number of cascades already trained, including in the standard package of the OpenCV library. Therefore, the speed of this algorithm is very important. What prompted the author in his time to do his optimization.

Well, what article about the detection of faces can do without a photo of Lena?

Ways to increase productivity

There are quite a few detailed descriptions of how the Viola-Jones method works, including on Habrahabr: source 1 , source 2 . Therefore, I will not dwell on his description, who will be interested in reading this in the above sources. I would like to say a few words about the possible ways in which the Viola-Jones method can be accelerated.
')

Viola-Jones method is not some kind of rigid fixed algorithm. His cascades are formed as a result of training. Therefore, the speed of work can vary greatly depending on the complexity of the object and the size of the object that we want to detect. The simplest way to speed up its work is to set a limit on the minimum object or increase the step of the pyramid. However, these methods have side effects in the form of reduced accuracy of operation or an increase in the number of false positives. Re-training cascades can sometimes have a very significant effect in terms of the accuracy and speed of the work. However, this is a rather costly process associated with collecting a large training sample. And the learning itself is not to say that a very fast process.
A fairly obvious way to accelerate is to process different parts of the image in parallel streams. This method is already implemented in OpenCV, because here you do not need to do anything.
Use a parallel accelerator for parallel processing. Here, too, there is already an implementation in OpenCV. The downside is the requirement for the presence of a graphics accelerator.
Use to accelerate the algorithm vector instructions processor SIMD. Vector instructions exist in almost all modern processors. Unfortunately, they are different in different processors - so you have to have different implementations for different processors. In addition, the effective use of vector instructions requires a significant alteration of the algorithm. This method is devoted to this article.

Original (scalar) version of the algorithm

If you take the source code of this algorithm from OpenCV, then you can see numerous attempts to speed it up using various SIMD (SSE, AVX, NEON) instructions. I do not have accurate numerical estimates of how successful these attempts have been. However, judging by the source codes, the acceleration was hardly significant, since either the vector instructions are not used in the optimizations (only their scalar counterparts) , or the data is loaded into vector registers element by element , which fatally affects the overall performance of the algorithm.

And so, let's consider a scalar version of the cascade Haar classifier algorithm (the simplified version of it is given below) for a given scale:

void HaarDetect(const HaarCascade & hid, const Image & mask, const Rect & rect, int step, Image & dst) { for (ptrdiff_t row = rect.top; row < rect.bottom; row += step) { size_t p_offset = row * hid.sum.stride / sizeof(uint32_t); size_t pq_offset = row * hid.sqsum.stride / sizeof(uint32_t); for (ptrdiff_t col = rect.left; col < rect.right; col += step) { if (mask.At<uint32_t>(col, row) == 0) continue; float norm = Norm(hid, pq_offset + col); if (Detect(hid, p_offset + col, 0, norm) > 0) dst.At<uint32_t>(col, row) = 1; } } }

Here an image is scanned over all its points (on a small scale, for acceleration, as a rule, an image is scanned in increments of 2). For each point, the normalization factor is calculated (depending on the brightness and contrast of the image in the vicinity of this point):

 uint32_t Sum(uint32_t * const ptr[4], size_t offset) { return ptr[0][offset] - ptr[1][offset] - ptr[2][offset] + ptr[3][offset]; } float Norm(const HaarCascade & hid, size_t offset) { float sum = float(Sum(hid.p, offset)); float sqsum = float(Sum(hid.pq, offset)); float q = sqsum*hid.windowArea - sum *sum; return q < 0.0f ? 1.0f : sqrtf(q); }

And also the classification itself:

 float WeightedSum(const WeightedRect & rect, size_t offset) { int sum = rect.p0[offset] - rect.p1[offset] - rect.p2[offset] + rect.p3[offset]; return rect.weight*sum; } int Detect(const HaarCascade & hid, size_t offset, int startStage, float norm) { const HaarCascade::Stage * stages = hid.stages.data(); const HaarCascade::Node * node = hid.nodes.data() + stages[startStage].first; const float * leaves = hid.leaves.data() + stages[startStage].first * 2; for (int i = startStage, n = (int)hid.stages.size(); i < n; ++i) { const HaarCascade::Stage & stage = stages[i]; const HaarCascade::Node * end = node + stage.ntrees; float stageSum = 0.0; for (; node < end; ++node, leaves += 2) { const HaarCascade::Feature & feature = hid.features[node->featureIdx]; float sum = WeightedSum(feature.rect[0], offset) + WeightedSum(feature.rect[1], offset); if (feature.rect[2].p0) sum += WeightedSum(feature.rect[2], offset); stageSum += leaves[sum >= node->threshold*norm]; } if (stageSum < stage.threshold) return -i; } return 1; }

Here, the Haar traits are calculated using pre-calculated integral images .

Algorithm vectorization methods

Analysis of the above algorithm shows that the main computational resources are spent on the calculation of weighted integral sums ( WeightedSum function). Therefore, they must first vectorize. There are only 2 ways to do this:

We are trying to simultaneously calculate several Haar-signs in one SIMD vector at once.
We are trying to simultaneously cheat several image points in one SIMD vector at once.

Vectorized featured

At first glance, it is most optimal to perform vectorization by calculating Haar-signs (by the way, so it was done in OpenCV ). However, there is one thing: the data over which you need to carry out calculations are randomly scattered in memory. Therefore, they must first be collected from there using scalar read operations, and then the calculation results should be scattered using scalar write operations. You can of course use the gather-scatter operations (they are in AVX2 and AVX-512), but their speed is comparable, and sometimes slower than if you just use a scalar code.

Point vectorization

Vectorization by image points is not so obvious (in fact, we perform vectorization almost along the outermost cycle). However, the data in the integral image for neighboring points lie side by side, which allows them to be effectively loaded using vector operations. In addition, the method is easily scaled for different lengths of SIMD vectors. This is where the pluses end, and then the list of problems:

The effectiveness of the Viola-Jones algorithm is based on the fact that obviously incorrect points are rejected at each stage. The proportion of points dropped depends on the learning parameters, and is 50% by default. If we process a SIMD vector with several points, then we will have to continue calculating the stages until all its elements are discarded. This reduces the effectiveness of the method in the calculation of subsequent stages. Fortunately, the stages at which neighboring points are discarded correlate quite strongly. In addition, if there is only one point in the vector for verification, then we can go on to the scalar version of the algorithm, which in this case will be more efficient.
The original algorithm from OpenCV in order to optimize for small scale checks the points in steps of 2. This speeds up the scalar code, but reduces the efficiency of vector instructions, since we already have half of the calculations to do in the first stage in vain. Fortunately, this problem has a rather elegant solution. But we will write it in more detail later.

SIMD version of the algorithm

In this article, I will provide a simplified version of the algorithm for SSE4.1 , versions for AVX2 , AVX-512 and NEON use the same approach.

First, consider the case when we scan all points of the image without gaps:

 void DetectionHaarDetect32fp(const HidHaarCascade & hid, const Image & mask, const Rect & rect, Image & dst) { size_t rightAligned = rect.left + Simd::AlignLo(rect.Width(), 4); for (ptrdiff_t row = rect.top; row < rect.bottom; row += 1) { size_t p_offset = row * hid.sum.stride / sizeof(uint32_t); size_t pq_offset = row * hid.sqsum.stride / sizeof(uint32_t); size_t col = rect.left; for (; col < rightAligned; col += 4) { __m128i result = _mm_loadu_si128((__m128i*)(mask.Row<uint32_t>(row) + col)); if (_mm_testz_si128(result, _mm_set1_epi32(1))) continue; __m128 norm = Norm32fp(hid, pq_offset + col); Detect32f(hid, p_offset + col, norm, result); _mm_storeu_si128((__m128i*)(dst.Row<uint32_t>(row) + col), result); } for (; col < rect.right; col += 1) { if (mask.At<uint32_t>(col, row) == 0) continue; float norm = Norm32f(hid, pq_offset + col); if (Detect(hid, p_offset + col, 0, norm) > 0) dst.At<uint32_t>(col, row) = 1; } } }

Here, unlike the scalar version, the image is scanned not by single points, but by blocks of 4 points (only the edge points are processed in a scalar way). The normalization factor for a block of four points is calculated as follows:

 inline __m128 ValidSqrt(__m128 value) { return _mm_blendv_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(value), _mm_cmpgt_ps(value, _mm_set1_ps(0.0f))); } inline __m128i Sum32ip(uint32_t * const ptr[4], size_t offset) { __m128i s0 = _mm_loadu_si128((__m128i*)(ptr[0] + offset)); __m128i s1 = _mm_loadu_si128((__m128i*)(ptr[1] + offset)); __m128i s2 = _mm_loadu_si128((__m128i*)(ptr[2] + offset)); __m128i s3 = _mm_loadu_si128((__m128i*)(ptr[3] + offset)); return _mm_sub_epi32(_mm_sub_epi32(s0, s1), _mm_sub_epi32(s2, s3)); } inline __m128 Norm32fp(const HidHaarCascade & hid, size_t offset) { __m128 area = _mm_set1_ps(hid.windowArea); __m128 sum = _mm_cvtepi32_ps(Sum32ip(hid.p, offset)); __m128 sqsum = _mm_cvtepi32_ps(Sum32ip(hid.pq, offset)); return ValidSqrt(_mm_sub_ps(_mm_mul_ps(sqsum, area), _mm_mul_ps(sum, sum))); }

The classification itself for a block of four points:

 inline int ResultCount(__m128i result) { uint32_t SIMD_ALIGNED(16) buffer[4]; _mm_store_si128((__m128i*)buffer, _mm_sad_epu8(result, _mm_setzero_si128())); return buffer[0] + buffer[2]; } inline __m128 WeightedSum32f(const WeightedRect & rect, size_t offset) { __m128i s0 = _mm_loadu_si128((__m128i*)(rect.p0 + offset)); __m128i s1 = _mm_loadu_si128((__m128i*)(rect.p1 + offset)); __m128i s2 = _mm_loadu_si128((__m128i*)(rect.p2 + offset)); __m128i s3 = _mm_loadu_si128((__m128i*)(rect.p3 + offset)); __m128i sum = _mm_sub_epi32(_mm_sub_epi32(s0, s1), _mm_sub_epi32(s2, s3)); return _mm_mul_ps(_mm_cvtepi32_ps(sum), _mm_set1_ps(rect.weight)); } inline void StageSum32f(const float * leaves, float threshold, const __m128 & sum, const __m128 & norm, __m128 & stageSum) { __m128 mask = _mm_cmplt_ps(sum, _mm_mul_ps(_mm_set1_ps(threshold), norm)); stageSum = _mm_add_ps(stageSum, _mm_blendv_ps(_mm_set1_ps(leaves[1]), _mm_set1_ps(leaves[0]), mask)); } void Detect32f(const HidHaarCascade & hid, size_t offset, const __m128 & norm, __m128i & result) { typedef HidHaarCascade Hid; const float * leaves = hid.leaves.data(); const Hid::Node * node = hid.nodes.data(); const Hid::Stage * stages = hid.stages.data(); for (int i = 0, n = (int)hid.stages.size(); i < n; ++i) { const Hid::Stage & stage = stages[i]; if (stage.canSkip) continue; const Hid::Node * end = node + stage.ntrees; __m128 stageSum = _mm_setzero_ps(); if (stage.hasThree) { for (; node < end; ++node, leaves += 2) { const Hid::Feature & feature = hid.features[node->featureIdx]; __m128 sum = _mm_add_ps(WeightedSum32f(feature.rect[0], offset), WeightedSum32f(feature.rect[1], offset)); if (feature.rect[2].p0) sum = _mm_add_ps(sum, WeightedSum32f(feature.rect[2], offset)); StageSum32f(leaves, node->threshold, sum, norm, stageSum); } } else { for (; node < end; ++node, leaves += 2) { const Hid::Feature & feature = hid.features[node->featureIdx]; __m128 sum = _mm_add_ps(WeightedSum32f(feature.rect[0], offset), WeightedSum32f(feature.rect[1], offset)); StageSum32f(leaves, node->threshold, sum, norm, stageSum); } } result = _mm_andnot_si128(_mm_castps_si128(_mm_cmpgt_ps(_mm_set1_ps(stage.threshold), stageSum)), result); int resultCount = ResultCount(result); if (resultCount == 0) //        ,  : { return; } else if (resultCount == 1) //      1 ,     : { uint32_t SIMD_ALIGNED(16) _result[4]; float SIMD_ALIGNED(16) _norm[4]; _mm_store_si128((__m128i*)_result, result); _mm_store_ps(_norm, norm); for (int j = 0; j < 4; ++j) { if (_result[j]) { _result[j] = Detect32f(hid, offset + j, i + 1, _norm[j]) > 0 ? 1 : 0; break; } } result = _mm_load_si128((__m128i*)_result); return; } } }

As you can see, the algorithm is almost the same as for the scalar version (of course, adjusted for the use of SSE vectors instead of the usual real numbers).

Now consider the situation when we need to perform the same algorithm in steps of 2. The simplest option is to load two vectors of 4 numbers from memory, and then form one of them (leaving only even numbers). However, it shows a completely unsatisfactory performance. Fortunately, you can do a little trick with reordering the source data - in the source integral images we move even points to the beginning of the line, and odd points to the end. Then the loading of vectors with even / odd elements turns into a trivial operation. In addition, we can use the same code for scanning.

Optimization data is easy to implement in the same way for other vector sizes (8 for AVX2 and 16 for AVX-512 ) and for other platforms (for example, NEON for the ARM platform).

Optimization results

The table below shows the scan times (in milliseconds) of the image for different implementations of the algorithm. An image of 1920x1080 was used for testing. Scanning was performed without scaling the source image:

Function	Scalar	SSE4.1	AVX2	AVX-512
Common	227.499	111.509	74.952	54.579
DetectionHaarDetect32fi (0)	84.792	45.771	32.716	25.961
DetectionHaarDetect32fi (1)	162.779	79.007	50.996	36.841
DetectionHaarDetect32fp (0)	329.904	159.355	109.725	80.862
DetectionHaarDetect32fp (1)	588.270	268.298	172.396	114.735

Here, numbers in round brackets (0) and (1) - labeled results for different cascades used for testing. DetectionHaarDetect32fp - a function that scans an image in increments of 1, DetectionHaarDetect32fi - in increments of 2. Common - geometric mean.

The following table shows the acceleration achieved relative to the baseline (scalar) version:

Function	SSE4.1 / Scalar	AVX2 / Scalar	AVX-512 / Scalar
Common	2.04	3.04	4.17
DetectionHaarDetect32fi (0)	1.85	2.59	3.27
DetectionHaarDetect32fi (1)	2.06	3.19	4.42
DetectionHaarDetect32fp (0)	2.07	3.01	4.08
DetectionHaarDetect32fp (1)	2.19	3.41	5.13

Well, for a snack, the relative acceleration, when the vector size is doubled:

Function	SSE4.1 / Scalar	AVX2 / SSE4.1	AVX-512 / AVX2
Common	2.04	1.49	1.37
DetectionHaarDetect32fi (0)	1.85	1.40	1.26
DetectionHaarDetect32fi (1)	2.06	1.55	1.38
DetectionHaarDetect32fp (0)	2.07	1.45	1.36
DetectionHaarDetect32fp (1)	2.19	1.56	1.50

The table clearly shows the decreasing utility from each subsequent doubling of the vector.

Software implementation

In conclusion, I would like to say a few words about the software implementation of the above algorithms. In the frame of the Simd project, the C ++ class Simd :: Detection was implemented. It hides under the hood a low-level work with the Simd bilingual API. One of the important advantages is the compatibility of the used cascades (HAAR and LBP) with the cascades of OpenCV , as well as ease of use. Below is an example of face detection using this framework:

 #include <iostream> #include <string> #include "opencv2/opencv.hpp" #define SIMD_OPENCV_ENABLE #include "Simd/SimdDetection.hpp" #include "Simd/SimdDrawing.hpp" int main(int argc, char * argv[]) { if (argc < 2) { std::cout << "You have to set video source! It can be 0 for camera or video file name." << std::endl; return 1; } std::string source = argv[1]; cv::VideoCapture capture; if (source == "0") capture.open(0); else capture.open(source); if (!capture.isOpened()) { std::cout << "Can't capture '" << source << "' !" << std::endl; return 1; } typedef Simd::Detection<Simd::Allocator> Detection; Detection detection; detection.Load("../../data/cascade/haar_face_0.xml"); bool inited = false; const char * WINDOW_NAME = "FaceDetection"; cv::namedWindow(WINDOW_NAME, 1); for (;;) { cv::Mat frame; capture >> frame; Detection::View image = frame; if (!inited) { detection.Init(image.Size(), 1.2, image.Size() / 20); inited = true; } Detection::Objects objects; detection.Detect(image, objects); for (size_t i = 0; i < objects.size(); ++i) Simd::DrawRectangle(image, objects[i].rect, Simd::Pixel::Bgr24(0, 255, 255)); cv::imshow(WINDOW_NAME, frame); if (cvWaitKey(1) == 27)// "press 'Esc' to break video"; break; } return 0; }

Source: https://habr.com/ru/post/336338/

All Articles