In the
previous article, we talked about upgrading one of the most popular features of Macroscop video analysis - the function of counting visitors.
We decided to make it better, more accurate and more convenient for the user. There was one small question: how to do it? In our case, the procedure was as follows:
1. Read scientific articles and publications;
2. Discuss, analyze and choose ideas;
3. Prototype and test;
4. Choose and develop the only one.
')
After completing the first two steps, we decided to create a new counting of visitors, which will be based on information about the depth. Depth is the vertical distance from the camera to objects that fall within its field of view. It gives information about the height of the one who crosses the entry-exit line, therefore, allows to distinguish people from other objects.
Depth can be obtained in several different ways. And we had to choose exactly how we would do it in the framework of the new calculation. We have identified 4 priority areas for further development.
1. Use a stereo nozzle for the camcorder.One of the options for obtaining the depth data is to create a stereo image on the lens of a video camera, which will double the image. Next, write an algorithm that will process it, combine the corresponding reference points and get a stereo image that will allow you to know the data about the distances and build a depth map.
Looking ahead, we note that the options involving the design of something hardware (be it a nozzle or a device) attracted us to a lesser extent. We are developers and are experts in writing algorithms and programs. And we didn’t really like to take on what we are not professionals at.
First, we looked for ready-made tips to test our reasoning in practice. But to find an option that would suit us in all respects, failed. Then we printed our own sample nozzle on a 3D printer, but it was also unsuccessful.

In the process it also became clear that the use of stereo nozzles reduces the versatility of the solution, since not every camera can be worn. Therefore, it would be necessary either to produce a line of different attachments, or to limit users in the choice of cameras.
Another limitation was the fact that the stereo image significantly narrows the field of view of the camera (2 times). Sometimes the inputs and outputs are wide enough, such that even ordinary camcorders install a few. And if the camera's field of view is narrowed down by half, it will also make life more difficult for installers and increase the cost of the system. In addition, some users want not only to count visitors, but also to receive an overview picture from their video camera.
2. Synchronize images from two cameras.In fact, we almost immediately abandoned this option. First, it would lead to a significant increase in the cost of the solution for the user. Secondly, it seemed inexpedient for us to solve the problem of synchronizing frames from two different cameras. Its implementation required us to assemble a device that would include 2 identical cameras installed in certain positions and at a certain distance from each other. But most importantly, we needed to simultaneously receive the same frames from these cameras. It was already more difficult: there could be internal delays in the camera, and if you do frame processing at the software level, how do you understand what millisecond a particular frame came in relative to the frame from another camera?
We decided to work out other options.
3. Use the Microsoft experience.Studying the theme of depth images, we found a
study from Microsoft . It described the method of using infrared camera illumination to determine the distance. In general, this option was very interesting: we take any camera with IR illumination, we light up the area, we estimate the brightness and we get the necessary depth data. The more illuminated the area, the closer it is to the camera. But it turned out that this method only solves narrow problems well, for example, recognizes gestures. Because different materials reflect light to varying degrees.
Therefore, objects can actually be at the same distance, but due to the difference in materials and the ability to absorb and reflect light, the distance to them will be interpreted differently. We checked this experimentally in the office, when we received a card for different items on a single level from different materials:

As a result, this algorithm works well only on any one material, but in our case we are talking about a completely heterogeneous medium.
4. Use structural lighting.Structural illumination in fact also works on infrared rays. Only in the previous version, the rays are radiated and the level of illumination of the surfaces on which they fall is measured, and in the current one - a picture is radiated to the surface (for example, circles). The reflected images are read, and by their size and distortion it is possible to understand how far this or that object is located from the emitter. In the variant of item 3, the map is based on the intensity of the rays (which directly depends on the reflectivity of the surfaces), and in the current one - on the basis of data on the structure of the reflected image, and the brightness is not taken into account here.
This option seemed to us the most advantageous. In addition, we were able to find a suitable ready-made hardware device with structural illumination. And this meant that we do not need to do something that is not our specialization (to construct this device itself). We had to do our job - to write a processing algorithm.
The first prototype we wrote on Kinect (this is the touch controller from Microsoft for gesture recognition). The expectations were confirmed, the chosen approach turned out to be workable - the device produced a map of acceptable depth and accuracy. However, later it turned out that for our specificity, Kinect was not convenient in everything. First of all, it is a USB device that does not fit into the infrastructure of our users (IP video system). Therefore, we would have to build something over it or supply the adapter with USB to the network input. The second significant limitation was that Kinect does not have the computing power. Considering that the pure depth map itself weighs quite a lot, without processing and compression onboard the device, it was quite problematic to transmit it over the network.
The final implementation of 3D - counting includes another device with structural illumination. It has independent computational power, which in the current implementation is used to compress the depth map, thereby unloading the network. We already wrote in detail in the
article “Deep Calculation. How do 3D technologies help people to count and make life easier? ”PS:Developing a new solution is not always impenetrable writing code. Search, read, try, print on a 3D printer, bring something from home to test your theory - this is the real development process. And often to create something breakthrough, new, you need to move away from the usual patterns of work.