Comparison of Elbrus-4С and Elbrus-8С in several problems of computer vision

In this article, we will show how pattern recognition technologies work on the Elbrus-4C and on the new Elbrus-8C: consider several problems of machine vision, tell you a little about the algorithms for solving them, present the results of benchmarking, and finally show the video.

The Elbrus-8C is a new 8-core MCST processor with a VLIW architecture. We tested an engineering sample with a frequency of 1.3 GHz. Perhaps in the serial release it will increase.

Let's compare the characteristics of Elbrus-4 and Elbrus-8.

	Elbrus-4C	Elbrus-8C
Clock frequency, MHz	800	1300
Number of cores	four	eight
Number of operations per clock (per core)	up to 23	up to 25
L1 cache, per core	64 Kb	64 Kb
L2 cache, per core	2 MB	512 Kb
L3 cache, total	-	16 MB
Organization of RAM	Up to 3 channels DDR3-1600 ECC	Up to 4 channels DDR3-1600 ECC
Technological process	65 nm	28 nm
Number of transistors	986 million	2730 million
SIMD width instructions	64 bits	64 bits
Multiprocessor support	up to 4 processors	up to 4 processors
Year of production start	2014	2016
operating system	OS Elbrus 3.0-rc27	OS Elbrus 3.0-rc26
Lcc compiler version	1.21.18	1.21.14

In Elbrus-8C, the clock frequency increased more than one and a half times, the number of cores doubled, and the architecture itself was improved.

So, for example, Elbrus-8C can execute up to 25 instructions per 1 clock without SIMD (against 23 for Elbrus-4C).

Important : we have not carried out any special optimization for Elbrus-8C. The EML library was involved, but the volume of optimizations for Elbrus in our projects is now clearly less than for other architectures: there it was gradually built up over several years, and we are engaged in the Elbrus platform not so long ago and not so actively. The main time-consuming functions, of course, were optimized, but the rest have not yet reached the hands.

Recognition of the passport of the Russian Federation

Of course, we decided to start the development of a new platform for us with the launch of our product Smart IDReader 1.6 , which provides opportunities for recognition of passports, driving license, bank cards and other documents. It should be noted that the standard version of this application can effectively use no more than 4 streams when recognizing one document. For mobile devices, this is more than enough, but when benchmarking desktop processors, this can lead to an underestimation of the performance of multi-core systems.

The version of Elbrus OS provided to us and the lcc compiler did not require any special changes in the source code and we collected our project without any difficulties. Note that the new version has full support for C ++ 11 (it also appeared in the newer versions of lcc for Elbrus-4C), which is good news.

To begin with, we decided to check how the recognition of the passport of the Russian Federation, which we have already written about, on Elbrus-8C, works. We conducted testing in two modes: search and recognition of a passport on a separate frame (anywhere mode) and on a video taken from a webcam (webcam mode). In anywhere mode, passport reversal recognition is performed on one frame, and the passport can be located in any part of the frame and be arbitrarily oriented. In webcam mode, only the passport page with a photo is recognized, and a series of frames is processed. In this case, it is assumed that the passport lines are horizontal and the passport is slightly shifted between frames. Information obtained from different frames is integrated to improve the quality of recognition.

For testing, we took 1000 images for each of the modes and measured the average recognition time (i.e. time without taking into account the loading of the image) when launching into 1 stream and starting with parallelization. The resulting operation time is given in the table below.

Mode	Elbrus-4C, s / frame	Elbrus-8C, s / frame	Acceleration, times
Anywhere mode, 1 thread	4.57	2.78	1.64
Anywhere mode, max streams	3.09	1.78	1.74
webcam mode, 1 stream	0.81	0.49	1.65
webcam mode, max streams	0.58	0.34	1.71

The results for the single-threaded mode are quite consistent with those expected: in addition to acceleration due to the increase in frequency (and the multiplicity of frequencies 4C and 8C is equal to 1300/800 = 1.625), a slight acceleration is noticeable due to the improvement of the architecture.

In the case of starting at the maximum number of flows, the acceleration for both modes was 1.7. It would seem that the number of nuclei in Elbrus-8C is twice as large as in 4C. So where is the acceleration due to the additional 4 cores? The fact is that our recognition algorithm actively uses only 4 threads and does not scale much further, so the performance gain is quite insignificant.

Then we decided to achieve a full load of all the cores of both processors and launched several passport recognition processes. Each recognition call was parallelized in the same way as in the previous experiment, however, here the passport processing time included loading the image from the file. Time measurements were performed on all the same thousand passports. The results for the full load of Elbrus are given below:

Mode	Elbrus-4C, s / frame	Elbrus-8C, s / frame	Acceleration, times
Anywhere	1.38	0.43	3.2
webcam	0.47	0.19	2.5

For the anywhere mode, the resulting acceleration approached the expected acceleration by ~ 3.6 times, not reaching it due to the fact that we took into account the loading time of the image from the file. In the case of the webcam mode, the effect of load time is even greater and therefore the acceleration was more modest - 2.5 times.

Car Detection

Detection of objects of a given type is one of the classical problems of technical vision. This may be the detection of persons, people, abandoned objects, or any other type of object with obvious distinctive features.

For our example, we decided to take on the task of detecting vehicles moving in the same direction. Such a detector can be used in automatic vehicle control systems, in license plate recognition systems, etc. Without hesitation, we shot a video for training and testing with the help of an auto recorder not far from our office. As a detector, we used the Viola-Jones cascade classifier [1]. Additionally, we applied the exponential smoothing of the positions of the found cars for those of them that we observe several frames in a row. It is worth noting that detection is performed only in the ROI (region of interest) rectangle, which does not take up the entire frame, since it is little intelligent to try to detect the insides of our car, as well as machines that do not completely fall into the frame.

Thus, our algorithm consisted of the following steps:

Cut a ROI rectangle in the center of the frame
Convert color image ROI to gray.
Pre-account signs of Viola-Jones.
At this stage, the image is scaled, auxiliary signs are drawn (for example, directional boundaries), and cumulative sums are calculated on all grounds to quickly calculate Haar wavelets.
Launch Viola Jones classifier on multiple windows.
Here, with a certain step, rectangular windows are searched, on which the classifier is launched. If the classifier gave a positive response, then the object was detected, i.e. The image inside the window matches the car. In this case, the image area in which the object is located is refined: in the vicinity of the primary detection, windows of the same size are selected, but with a smaller step and are also fed to the input of the classifier. All found objects are saved for further processing. This procedure is repeated for several scales of the input image.
This stage actually constitutes the main computational complexity of the task, and paralleling was done for him. We used the tbb library to automatically select the effective number of threads.
Processing of the array of detections obtained after applying the detector. Since the number of detections obtained can be very close and correspond to the same object, we combine detections with a sufficiently large area of intersection. As a result, we obtain an array of rectangles that indicate the position of the detected vehicles.
Comparison of detections at the previous and current frames. We believe that the same object was detected, if the intersection area of the rectangles is more than half of the area of the current rectangle. We perform the smoothing of the position of the object according to the formulas:
x _i = x _i + (1-α) x _{i -1}
y _i = y _i + (1-α) y _{i -1}
w _i = w _i + (1-α) w _{i -1}
h _i = h _i + (1-α) h _{i -1}
where ( x , y ) are the coordinates of the upper left corner of the rectangle, w and h are its width and height, respectively, and α is a constant coefficient chosen experimentally.

Input data: a sequence of color frames of 800x600 pixels.
Hereinafter, to estimate fps (frame per second), the average operating time for 10 program launches was used. In this case, only the processing time of the images was taken into account, since now we were working with the recorded video, and the images were simply loaded from a file, and in a real system they can, for example, come from the camera. It turned out that the detection works at a very decent speed, yielding 15.5 fps on Elbrus-4C and 35.6 fps on Elbrus-8C. On Elbrus-8C, the processor load is far from complete, although all cores are involved in the peak. Obviously, this is due to the fact that not all the calculations in this problem were parallelized. For example, before applying the Viola-Jones detector, we perform fairly heavy auxiliary transformations of each frame, and this part of the system works sequentially.

Now it's time to demonstrate. The application interface and rendering are performed using standard Qt5 tools. No additional optimization was performed.

Elbrus-4C

Elbrus-8C

Visual localization

In this application, we decided to demonstrate visual localization based on special points. Using the Google Street View panorama with GPS-binding, we taught our system to find the location of a camera without using data about its GPS coordinates or other external information. Such a system can be used for drones and robots as a backup navigation system, to clarify the current location or to work in systems without GPS.

First, we processed the base of the panoramas with GPS coordinates. We took 660 images covering approximately 0.4 km ^ 2 Moscow streets:

Then we created a description of the images using special points. For each image we:

We found singular points for 3 frame scales (the frame itself, the frame reduced by 4/3 times and the frame reduced by half) by the YAPE (Yet Another Point Detector) algorithm and calculated RFD descriptors for them [2].
We saved its coordinates, a set of singular points, their descriptors. Since then we will compare the descriptors of the singular points of the current frame with the values of the descriptors from our database, it is convenient to store the descriptors in the tree, using the Hamming distance as the metric. The total size of the saved data turned out to be slightly more than 15 MB.

This is the end of the preparations, now let's move on to what happens directly during the program:

Convert a color image to gray.
Perform autocontrast.
Search for special points for the three frame scales (also with coefficients 1, 0.75 and 0.5) using the YAPE algorithm and counting RFD descriptors for them. These algorithms are partially parallelized, but a fairly large part of the calculations remained consistent. In addition, they have not yet been optimized for the Elbrus platform.
For the resulting set of descriptors, similar descriptors are searched among those stored in the tree, and some of the most similar frames are detected. For various descriptors, the search in the tree is parallelized using tbb. In this case, for the first 5 frames of the video, we select the 10 nearest frames, and then take only 5 frames.
Selected frames undergo additional filtering to remove “outliers”, because the trajectory of the vehicle is usually continuous.

Input data: a sequence of color frames of 800x600 pixels.

This system gives 3.0 fps to Elbrus-4C and 7.2 fps to Elbrus-8C.

Let us show how it works:

Elbrus-4C

Elbrus-8C

Conclusion

For convenience, the main characteristics of Elbrus and the results of our programs are collected in the table:

Test	Elbrus-4C	Elbrus-8S	Acceleration, times
Car Detection	15.5 fps	35.6 fps	2.3
Visual localization	3.0 fps	7.2 fps	2.4
Passport, anywhere mode, s / frame	3.09	1.78	1.74
Passport, webcam mode, c / frame	0.58	0.34	1.71
Passport, anywhere mode, s / frame, full CPU	1.38	0.43	3.2
Passport, webcam mode, c / frame, full CPU	0.47	0.19	2.5

The results for passport recognition turned out to be rather modest, since our application in its current form cannot effectively use more than 4 streams. A similar situation with the detection of cars and visual location: the algorithms have non-parallel sections, so you do not have to expect linear scaling with an increase in the number of cores. However, where there are no restrictions on the load of all processor cores by applications, we observe an increase of 3.2 times, which is close to the theoretical limit of 3.6 times. On average, the performance difference between the generations of MCST processors on our task set is about 2-3 times, and this is very encouraging. Only by increasing the frequency and improving the architecture, we observe a gain of more than 1.7 times. The MCST is quickly catching up with Intel with its strategy of adding 5% per year.

In the process of tests under full load, we did not experience problems with freezes and crashes, which indicates the maturity of the processor architecture. The VLIW approach developed in Elbrus-8C allows for real-time work on various computer vision algorithms, and the EML library contains a very solid set of mathematical functions that save time for those who are not going to optimize the code itself. In conclusion, we conducted another experiment, running 3 demonstrations at once (localization, machine search and search for individuals) on a single Elbrus-8C processor and receiving an average processor load of about 80%. There is no comment.

We want to say a big thank you to the company and the employees of the MCST and INEUM Brooke for the opportunity to try Elbrus-8C and congratulate them - the eight are more than a decent processor and wish them success!

Used sources

[1] P. Viola, M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”, Proceedings of CVPR 2001.
[2] B. Fan, Q. Kong, T. Trzcinski, ZH Wang, C. Pan, and P. Fua, “Receptive fields selection for binary feature description,” IEEE Trans. Image Process., Pp. 2583-2595, 2014.

Source: https://habr.com/ru/post/329858/

All Articles