In this article, we will show how pattern recognition technologies work on the Elbrus-4C and on the new Elbrus-8C: consider several problems of machine vision, tell you a little about the algorithms for solving them, present the results of benchmarking, and finally show the video.
The Elbrus-8C is a new 8-core MCST processor with a VLIW architecture. We tested an engineering sample with a frequency of 1.3 GHz. Perhaps in the serial release it will increase.
Let's compare the characteristics of Elbrus-4 and Elbrus-8.
Elbrus-4C | Elbrus-8C | |
---|---|---|
Clock frequency, MHz | 800 | 1300 |
Number of cores | four | eight |
Number of operations per clock (per core) | up to 23 | up to 25 |
L1 cache, per core | 64 Kb | 64 Kb |
L2 cache, per core | 2 MB | 512 Kb |
L3 cache, total | - | 16 MB |
Organization of RAM | Up to 3 channels DDR3-1600 ECC | Up to 4 channels DDR3-1600 ECC |
Technological process | 65 nm | 28 nm |
Number of transistors | 986 million | 2730 million |
SIMD width instructions | 64 bits | 64 bits |
Multiprocessor support | up to 4 processors | up to 4 processors |
Year of production start | 2014 | 2016 |
operating system | OS Elbrus 3.0-rc27 | OS Elbrus 3.0-rc26 |
Lcc compiler version | 1.21.18 | 1.21.14 |
In Elbrus-8C, the clock frequency increased more than one and a half times, the number of cores doubled, and the architecture itself was improved.
So, for example, Elbrus-8C can execute up to 25 instructions per 1 clock without SIMD (against 23 for Elbrus-4C).
Important : we have not carried out any special optimization for Elbrus-8C. The EML library was involved, but the volume of optimizations for Elbrus in our projects is now clearly less than for other architectures: there it was gradually built up over several years, and we are engaged in the Elbrus platform not so long ago and not so actively. The main time-consuming functions, of course, were optimized, but the rest have not yet reached the hands.
Of course, we decided to start the development of a new platform for us with the launch of our product Smart IDReader 1.6 , which provides opportunities for recognition of passports, driving license, bank cards and other documents. It should be noted that the standard version of this application can effectively use no more than 4 streams when recognizing one document. For mobile devices, this is more than enough, but when benchmarking desktop processors, this can lead to an underestimation of the performance of multi-core systems.
The version of Elbrus OS provided to us and the lcc compiler did not require any special changes in the source code and we collected our project without any difficulties. Note that the new version has full support for C ++ 11 (it also appeared in the newer versions of lcc for Elbrus-4C), which is good news.
To begin with, we decided to check how the recognition of the passport of the Russian Federation, which we have already written about, on Elbrus-8C, works. We conducted testing in two modes: search and recognition of a passport on a separate frame (anywhere mode) and on a video taken from a webcam (webcam mode). In anywhere mode, passport reversal recognition is performed on one frame, and the passport can be located in any part of the frame and be arbitrarily oriented. In webcam mode, only the passport page with a photo is recognized, and a series of frames is processed. In this case, it is assumed that the passport lines are horizontal and the passport is slightly shifted between frames. Information obtained from different frames is integrated to improve the quality of recognition.
For testing, we took 1000 images for each of the modes and measured the average recognition time (i.e. time without taking into account the loading of the image) when launching into 1 stream and starting with parallelization. The resulting operation time is given in the table below.
Mode | Elbrus-4C, s / frame | Elbrus-8C, s / frame | Acceleration, times |
---|---|---|---|
Anywhere mode, 1 thread | 4.57 | 2.78 | 1.64 |
Anywhere mode, max streams | 3.09 | 1.78 | 1.74 |
webcam mode, 1 stream | 0.81 | 0.49 | 1.65 |
webcam mode, max streams | 0.58 | 0.34 | 1.71 |
The results for the single-threaded mode are quite consistent with those expected: in addition to acceleration due to the increase in frequency (and the multiplicity of frequencies 4C and 8C is equal to 1300/800 = 1.625), a slight acceleration is noticeable due to the improvement of the architecture.
In the case of starting at the maximum number of flows, the acceleration for both modes was 1.7. It would seem that the number of nuclei in Elbrus-8C is twice as large as in 4C. So where is the acceleration due to the additional 4 cores? The fact is that our recognition algorithm actively uses only 4 threads and does not scale much further, so the performance gain is quite insignificant.
Then we decided to achieve a full load of all the cores of both processors and launched several passport recognition processes. Each recognition call was parallelized in the same way as in the previous experiment, however, here the passport processing time included loading the image from the file. Time measurements were performed on all the same thousand passports. The results for the full load of Elbrus are given below:
Mode | Elbrus-4C, s / frame | Elbrus-8C, s / frame | Acceleration, times |
---|---|---|---|
Anywhere | 1.38 | 0.43 | 3.2 |
webcam | 0.47 | 0.19 | 2.5 |
For the anywhere mode, the resulting acceleration approached the expected acceleration by ~ 3.6 times, not reaching it due to the fact that we took into account the loading time of the image from the file. In the case of the webcam mode, the effect of load time is even greater and therefore the acceleration was more modest - 2.5 times.
Detection of objects of a given type is one of the classical problems of technical vision. This may be the detection of persons, people, abandoned objects, or any other type of object with obvious distinctive features.
For our example, we decided to take on the task of detecting vehicles moving in the same direction. Such a detector can be used in automatic vehicle control systems, in license plate recognition systems, etc. Without hesitation, we shot a video for training and testing with the help of an auto recorder not far from our office. As a detector, we used the Viola-Jones cascade classifier [1]. Additionally, we applied the exponential smoothing of the positions of the found cars for those of them that we observe several frames in a row. It is worth noting that detection is performed only in the ROI (region of interest) rectangle, which does not take up the entire frame, since it is little intelligent to try to detect the insides of our car, as well as machines that do not completely fall into the frame.
Thus, our algorithm consisted of the following steps:
Input data: a sequence of color frames of 800x600 pixels.
Hereinafter, to estimate fps (frame per second), the average operating time for 10 program launches was used. In this case, only the processing time of the images was taken into account, since now we were working with the recorded video, and the images were simply loaded from a file, and in a real system they can, for example, come from the camera. It turned out that the detection works at a very decent speed, yielding 15.5 fps on Elbrus-4C and 35.6 fps on Elbrus-8C. On Elbrus-8C, the processor load is far from complete, although all cores are involved in the peak. Obviously, this is due to the fact that not all the calculations in this problem were parallelized. For example, before applying the Viola-Jones detector, we perform fairly heavy auxiliary transformations of each frame, and this part of the system works sequentially.
Now it's time to demonstrate. The application interface and rendering are performed using standard Qt5 tools. No additional optimization was performed.
Elbrus-4C
Elbrus-8C
In this application, we decided to demonstrate visual localization based on special points. Using the Google Street View panorama with GPS-binding, we taught our system to find the location of a camera without using data about its GPS coordinates or other external information. Such a system can be used for drones and robots as a backup navigation system, to clarify the current location or to work in systems without GPS.
First, we processed the base of the panoramas with GPS coordinates. We took 660 images covering approximately 0.4 km ^ 2 Moscow streets:
Then we created a description of the images using special points. For each image we:
This is the end of the preparations, now let's move on to what happens directly during the program:
Input data: a sequence of color frames of 800x600 pixels.
This system gives 3.0 fps to Elbrus-4C and 7.2 fps to Elbrus-8C.
Let us show how it works:
Elbrus-4C
Elbrus-8C
For convenience, the main characteristics of Elbrus and the results of our programs are collected in the table:
Test | Elbrus-4C | Elbrus-8S | Acceleration, times |
---|---|---|---|
Car Detection | 15.5 fps | 35.6 fps | 2.3 |
Visual localization | 3.0 fps | 7.2 fps | 2.4 |
Passport, anywhere mode, s / frame | 3.09 | 1.78 | 1.74 |
Passport, webcam mode, c / frame | 0.58 | 0.34 | 1.71 |
Passport, anywhere mode, s / frame, full CPU | 1.38 | 0.43 | 3.2 |
Passport, webcam mode, c / frame, full CPU | 0.47 | 0.19 | 2.5 |
The results for passport recognition turned out to be rather modest, since our application in its current form cannot effectively use more than 4 streams. A similar situation with the detection of cars and visual location: the algorithms have non-parallel sections, so you do not have to expect linear scaling with an increase in the number of cores. However, where there are no restrictions on the load of all processor cores by applications, we observe an increase of 3.2 times, which is close to the theoretical limit of 3.6 times. On average, the performance difference between the generations of MCST processors on our task set is about 2-3 times, and this is very encouraging. Only by increasing the frequency and improving the architecture, we observe a gain of more than 1.7 times. The MCST is quickly catching up with Intel with its strategy of adding 5% per year.
In the process of tests under full load, we did not experience problems with freezes and crashes, which indicates the maturity of the processor architecture. The VLIW approach developed in Elbrus-8C allows for real-time work on various computer vision algorithms, and the EML library contains a very solid set of mathematical functions that save time for those who are not going to optimize the code itself. In conclusion, we conducted another experiment, running 3 demonstrations at once (localization, machine search and search for individuals) on a single Elbrus-8C processor and receiving an average processor load of about 80%. There is no comment.
We want to say a big thank you to the company and the employees of the MCST and INEUM Brooke for the opportunity to try Elbrus-8C and congratulate them - the eight are more than a decent processor and wish them success!
[1] P. Viola, M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”, Proceedings of CVPR 2001.
[2] B. Fan, Q. Kong, T. Trzcinski, ZH Wang, C. Pan, and P. Fua, “Receptive fields selection for binary feature description,” IEEE Trans. Image Process., Pp. 2583-2595, 2014.
Source: https://habr.com/ru/post/329858/