How we won the Intel RealSense Hackathon

Once I wrote on Habr about various technologies for obtaining 3D images from a single camera. I ended the article with the words: "I myself, however, have not yet come across any of these cameras, which is a pity and annoyance."
And now, suddenly, less than a year, Intel is holding a seminar and hackathon in Moscow on a new generation of its 3D cameras (Intel RealSense). Curiosity leaped: my colleague and I signed up for the event. As it turned out, not in vain. Hackathon we won and got the Developer-version of the camera, which we are now tormenting.

The article tells about two things:

About the camera, its pros and cons; what can be done with it, and for what tasks it is not suitable.
About the concept that we proposed on the hackathon and for which we won first place.

Camera

You can talk a lot about the camera. It turned out to be more interesting than expectations, but not in terms of quality, but in terms of embedded mathematics. From global things:
')
First, instead of the Time of flight technology, there was a return to structured backlighting . I think that the accuracy for the same price is higher. Of course, ToF has more potential, but the cost of a high-quality sensor is too high there. Structured lighting has an interesting pattern, moving in time:

To be honest, I do not understand how they level this picture in the IR stream. Probably some kind of interframe approximation, but obviously this is not noticeable.

Secondly, very cool math for highlighting fingers and facial contours. Face contours are distinguished through active appearance models, but the algorithms are extended to the 3D area. This solution allowed to increase the stability and accuracy of work. An interesting point: the selection of the face and works on ordinary cameras. Accuracy, of course, less, but in bright light is not bad. I liked the idea. As far as I know, there was simply no stable solution for active models that could be taken and used for free (although, of course, the 4-gigabyte SDK stops).

The selection of the face occurs in two stages. First, with some sort of fast algorithm (probably Haar), a region with a face is searched for, then a face is stretched through the active model of the form. The process is resistant to facial expressions and glasses, not resistant to turns more than 15-20 degrees.

The solution with the selection of fingers I also liked. There is no ideal stability, but the system is predictable, on its basis it is possible to create quite working applications. Probably, it is less accurate than the Leap Motion, but it has a large field of view. The solution is not ideal, ambiguities arise, models are crookedly stretched. Some of the embedded gestures are recognized once, others are not recognized unless the system first sees the deployed hand. In the video below, I tried to highlight the problems.

In my opinion, the potential is already there. And if the quality of the selection of hands will increase one and a half more times, then such control will be comparable to the touchpad.

Thirdly, I would like to note that Kinect and RealSense have different niches. Kinect is aimed at large spaces, for working with a person from afar. And RealSense - for direct interaction with the system where it is installed. In many ways, this determines the parameters of the 3D sensor.

Minuses

Not without its drawbacks. Since the current version is not yet final, I would like to hope that they will be corrected.
The first awkward moment is that the drivers are still raw. A colleague with whom we went, at some point the camera completely refused to work. It helped only the demolition of all drivers and reinstalling them several times.

During the initialization of the video stream, an error periodically occurs, the whole application hangs for 20-30 seconds and does not start. The error is fixed on two computers.

The second point relates to face recognition. In my opinion, a large amount of information is missing:
1) The eyes are the window to the soul. Why does the clear direction of the eye not stand out? There is a direction of the face, there is a selection of the position of the pupils (from where, theoretically, you can get a direction). But when the angles of rotation of the head more than 5 degrees, the position of the pupils begins to be approximated by the center of the eye, although this is clearly not indicated. Of course, I would like the opportunity to use the direction to be explicitly made in the API.
2) Face selection works only in two modes, in FACE_MODE_COLOR_PLUS_DEPTH and in FACE_MODE_COLOR. Why is there no FACE_MODE_IR_PLUS_DEPTH or at least FACE_MODE_IR? In low light conditions, the selection stops working. Why can not be used to highlight the mode where the face is always visible well and stable. Many people like to sit in front of a computer in a dark room.

The third point is architecture. Maybe we didn’t understand something completely, but we didn’t manage to start face recognition and hand recognition at the same time. Any of these systems need to be initialized separately.

The fourth negative - not all of the declared frequencies work. Of course, you can try to pick up 640 * 480 * 300fps video from the camera. But it is not selected in this quality and is not saved. I would like to list the operating modes.

The fifth minus is a bit personalized for the subject where we often work - “biometrics”. If the laser wavelength was 800-900 nm, rather than 700-600, as in a camera, more biometric features of a person would be more visible and the recognition system could be done directly on this camera.

About hakaton and our victory

Hackathon began after one and a half hours of lectures and examples. There were 40 people in 13 teams. Format: “we give the camera, after 6 hours show the project”. Given that any video analytics is very complicated, this is not much. On the other hand, it is strange to hold such events in a different format (in total, the whole event lasted 8 hours, at the end the participants were very exhausted).

Someone was thoroughly preparing for the hackathon. Someone took the projects made for other purposes and strongly modified them. All of Vasya’s preparation for the hackathon took about three hours. We sat and drank tea and thought what could be done. There were several ideas:

Try to take some of our project and adapt a 3D camera to it. The problem was that all the projects were for 2D recognition / analytics. And screwing the camera to them was clearly more than 4 hours. Plus, it is not clear how this could be beautifully arranged.
Try to show a demonstration effect, some simple and classic application. Manage your mouse / airplane with your eyes / hand. Drawing on the face of mustache / sideburns. The disadvantage of this option is that it is rather empty and uninteresting. To bring to a beautiful state for a long time, but the simple version will not be of interest to the public.
Show a qualitatively new idea, achievable only on this product. It is clear that in 4 hours of a hakaton such a thing cannot be pile up to the end. But there is an opportunity to show a demonstration effect. This option liked the most. Here the main problem is to come up with such an idea.

One of the things that I like is a person’s state analyst. Perhaps you read one of my past articles on Habré. And of course, I was carried in the same direction. But to make movement gain through 3D is ineffective. But through 3D, you can remove a lot of the characteristics of the person sitting in front of the camera. But how and where to apply it?

The answer turned out to be surprisingly trivial and obvious, as soon as Vasya asked me: “how can a 3D camera help a car?”. And here we just suffered. After all, a 3D camera in a car can:

Watch that the driver falls asleep or not.
To monitor the attention of the driver, for example, that when rebuilding the driver does not look in the rear-view mirrors.
Recognize the gestures of the driver: the driver can not break away from the road to move the map / control music.
Automatic tuning of the mirror before the trip.

And the most surprising: no one has done this yet. There are systems for determining fatigue, in the last couple of years they have even started using video cameras. But with a 3d camera, you can more accurately determine the position of the driver's head, which allows you to control his reactions. In addition, one thing, falling asleep, and another - the analysis of actions and driver assistance.
For four hours, we gathered a simple demonstration:

For this idea and demonstration we were suddenly given the first place.

If suddenly someone needs our sketches for 6 hours - here they are. EmguCV + is connected with a vermicel code. The idea is awesome, but how to approach a problem of such a scale and level of integration is not clear. But such technologies may well become transitional to auto robots.

Source: https://habr.com/ru/post/249009/

All Articles

How we won the Intel RealSense Hackathon

Camera

Minuses

About hakaton and our victory

More articles: