📜 ⬆️ ⬇️

How we developed a device for monitoring the attention of drivers. Experience Yandex.Taxi



Taxi should be comfortable and safe. And this depends not only on the quality of the car and service, but also on the concentration of the driver’s attention, which falls when overwork. Therefore, at the service level, we limit the time that the driver spends behind the wheel.

But sometimes drivers get on the line already tired - for example, a person was busy at another job all day, and in the evening decided to “steer”. What to do with it? How to understand that the driver steps in for a shift without getting enough sleep? You can, for example, try to assess how closely he monitors the road, and determine signs of fatigue, for example, by the nature of blinking. Does that sound simple? Everything is more complicated than it seems.
')
Today we will first tell the readers of Habr how we came up with and developed a camera that knows how to do this.

So, it is given: the frequency and duration of blinks depend on the degree of fatigue. When we are tired, the head is less mobile, the direction of our gaze changes less often, we blink more often and leave our eyes closed for long periods of time - the difference can be measured in fractions of a second or several degrees of rotation, but it does exist. Our task was to design a device that allows us to analyze blinks, as well as the direction of our gaze, yawns and head movements, in order to assess the level of attention and driver fatigue.

First, we decided: let's make a laptop application, put it on the volunteers from among the employees, and will it use the built-in camera to track the signs we need? So we will immediately collect a large amount of information for analysis and quickly test our hypotheses.

Spoiler: nothing happened! Pretty quickly it became clear that most people when working at a computer constantly look at the keyboard and tilt their heads. That is, the eyes are not visible, and it is not even clear whether they are closed or open, a person blinks, or simply looks from the screen to the keyboard and vice versa.



Then we realized that even in order to make a prototype, we need some kind of device. We bought the first available IP camera model, which works in the infrared range.

Why do we need infrared? Lighting can be different, sometimes the user is in the shade, sometimes the light is from behind, from above, or there is none at all. If we make a measuring device, then it should work the same under any conditions.

For the experiment, a fairly popular camera from Xiaomi came up - CHUANGMI.



It turned out that she shoots at a frequency of 15 frames per second, and we need twice as much: blinking lasts from 30 to 150 ms, at 15 frames per second we risked not seeing blinking shorter than 60–70 ms. Therefore, we had to modify its firmware in order to forcibly turn on the IR illumination, get direct access to the video stream and pick up the necessary 30 frames per second. Having connected the camera to the laptop and configured to receive the video stream via the RTSP protocol, we began to record the first videos. The camera was placed 15 cm below the laptop’s camera, and this made it possible to better “see” the user's eyes.

Success? And again, no. After collecting several hundred videos, we realized that nothing was happening. The behavior of the laptop user during the day is different from the behavior of the driver: a person can get up at any time, move off to eat, just walk and do a warm-up, while the driver spends much more time in a sitting position. Therefore, such data does not suit us.

It became clear that the only way is to make or buy a suitable camera and install it in the car.

It would seem that everything is elementary: we buy a DVR, we turn towards the driver, fasten in the car and once a week we pick up SD-cards with video recordings. But here, in reality, everything turned out to be not so simple.

Firstly, it is extremely difficult to find a DVR with IR illumination, and we need to see the face well, especially at night.

Secondly, all DVRs have a wide-angle lens, so the area with the driver’s face turns out to be rather small and there is nothing to make out on the recording. Yes, and distortion from the lens pretty much spoils the analysis of the position of the head and the direction of view.

Thirdly, this venture does not scale well on ten, one hundred or more machines. After all, we need to collect a lot of data from different drivers in order to analyze them and draw conclusions. Manually changing memory cards on a hundred machines every week or every day is a huge waste of time. We even tried to find a camera that would upload videos to the cloud, but there was nothing similar in the market.

There was even an idea to make "your own DVR" from the Raspberry Pi, a camera with IR illumination and mounts.



The result was not quite what we expected: cumbersome, it is impossible to install the camera separately from the computer. The fact is that with a cable length of more than 50 cm, problems with the signal started, and the CSI cable itself is quite fragile, too wide and therefore not very suitable for installation in a machine.

We must go to Hong Kong, we decided. The purpose of the trip was quite abstract: to see what different manufacturers are doing in the field of analyzing driver behavior, buy product samples if we find, and look for suitable technical solutions / components that we could install in cars.

We went immediately to two popular exhibitions of electronics and components. In the automotive electronics pavilion, we saw an unprecedented dominance of DVRs, rear-view cameras and ADAS systems, but almost no one was engaged in analyzing the driver's behavior. The prototypes of several manufacturers determined falling asleep, distraction, smoking and talking on the phone, but no one even thought about fatigue.

As a result, we bought several samples of cameras and single-board computers. It became clear that 1) there are no suitable finished products for us; 2) it is necessary to separate the computer and the camera so as not to obscure the view of the driver. Therefore, we took a camera board with a USB interface and, as a computing unit, a single-board Banana Pi computer, and at the same time several Android players based on Amlogic processors.



“Why players?” You ask. In fact, the S912 and even the S905 are quite powerful in terms of performance and they can easily pull video recording for our purposes even with image analysis right on the spot. On-site image analysis was needed in order not to send the entire video stream to the server.

Let's count: a minute of well compressed video in H.264 resolution of 640 × 480 (30 FPS) takes at least 5 megabytes. So, in an hour there will be 300 megabytes, and for a standard 8-hour shift - about 2-3 gigabytes.

Uploading 3 gigabytes of video every day using an LTE modem is very “expensive”. Therefore, we decided to periodically record 5-minute videos, and analyze everything that happens in the car right there and upload it to our servers in the form of a parsed stream of events: a set of face points, a direction of gaze, a head turn, etc.

We returned from the exhibitions in a good mood, brought a bunch of necessary (and unnecessary) junk and realized how we would continue to make the prototype.

The USB camera we found in Hong Kong was almost perfect for us: size 38 × 38 mm, standard lenses (12 mm), the ability to solder IR illuminators directly onto the board.



Therefore, we immediately asked the manufacturer to make us a prototype with the necessary components. Now we understood: we need a USB camera with a backlight and a single-board PC for video processing. We decided to try everything that was presented on the market, and arranged a shopping session on AliExpress. We bought four dozen different cameras, a dozen single-board PCs, Android players, a collection of 12mm lenses and many other strange devices.



The issue with the hardware was resolved. And what about software?

Quite quickly, we were able to get a simple prototype based on OpenCV , which writes a video, finds the driver’s face, analyzes it, marks 68 key points on the face, recognizes blinking, yawning, turning the head, etc.

The next task was to make our prototype work on a single-board PC. Raspberry PI fell off immediately: few cores, a weak processor, more than seven frames per second can not be pulled out of it. And about how to simultaneously write a video, recognize a face and analyze it, there was no question. For the same reasons, set-top boxes and single-board computers on the Allwinner (H2, H3, H5), Amlogic S905 and Rockchip RK3328 did not fit us, although the latter was very close to the desired performance. As a result, we still have two potential SoCs: Amlogic S912 and Rockchip RK3399.

At Amlogic, the choice of devices was small: a TV box or Khadas VIM2. Everything worked the same on the TV box and Khadas, but the cooling of the set-top boxes left much to be desired, and setting up Linux on them is often not for the faint of heart: getting Wi-Fi, BT to work, making the OS see all the memory, - It is long, difficult and unpredictable. As a result, we chose Khadas VIM2: it has a standard cooling radiator, and the board is compact enough to hide it behind the dashboard of the machine.



By this moment, the manufacturer of the camera board had already sent us a test batch of one hundred pieces, and we were eager for battle: making a prototype, putting it in a car and collecting data.

We had a camera, there was software, there was a single-board PC, but there was not the slightest idea how to put all this in the car and connect it to the on-board power supply.

Obviously, the camera needed a body and mount. We bought two 3D printers at once to print parts, and the contractor made us the first primitive model of the case.



Now the difficult task of choice has arisen: where to mount the camera in the car to get a good picture, but not to obscure the driver’s vision. There were exactly three options:

  1. In the middle of the windshield.
  2. At the left rack.
  3. On the rearview mirror.



At that moment, it seemed to us that it is best to attach the camera directly to the rear view mirror: it is always directed in the driver’s face, so the camera will shoot exactly what we need. Unfortunately, manufacturers of rear-view mirrors did not make sure that something could be conveniently and reliably attached to them. The cameras did not hold well, fell and closed the review.



Nevertheless, we equipped several machines and began to collect data from them. It became clear that the design was imperfect, and problems related to performance and heating climbed while simultaneously recording and analyzing the face.

Then we decided to mount the camera at eye level at the left rack: we close the review less and a good angle for the camera so that the driver can be seen. The case had to be redone, as fasteners with hinges proved extremely unreliable: they break apart when shaking, break, and the suction cups peel off from the glass.



We decided that for the prototype and data collection it is better to glue the cameras tightly to the glass so that no shaking and external influences can change their position. We slightly modified the case and at the same time carried out load testing of the installation using a special double-sided tape. For testing, complex and high-precision equipment was used.



Due to performance issues, we decided to change the SoC to a more powerful one, so we chose the NanoPI M4 single-board PC on the Rockchip RK3399 processor.

Compared to Khadas VIM2, it is about a third more productive, it has hardware compression and video decoding, and it behaves much more stable in difficult temperature conditions. Yes, we tried to run cameras and circuit boards in the freezer, heated them in the oven and carried out many other inhuman tests.



Since we record video not just like that, but in dynamics throughout the day, it was important that the system time on the device was accurate. Unfortunately, most single board computers are not equipped with a self-powered clock. We were lucky that our NanoPI had a battery connector.

I had to design a case for a computer that would physically protect it and act as a holder for WiFi and BT antennas. There we also provided a place for mounting the watch battery with a holder.



Further, we planned to equip one hundred machines with prototypes that will record video and transmit all telemetry to the cloud online: is there a driver, how often and for a long time he blinks, yawns, is distracted from the road, turns his head, etc. All these ( and not only) the parameters allow us to train a model that evaluates how focused the driver is on the road, whether he is distracted or tired. To do all this right on the device in the car, we had to completely rewrite the code, do hardware video compression, rotate logs and video recordings, regularly send it to the server, remotely update software, and much more.

At the same time, it became clear to us that our calculations and algorithms would work much better with a more accurate basic facial analysis. In the first prototypes, we used the face detector built into OpenCV based on the haar cascading model and the model for marking 68 face points based on the dlib library. We calculated the position of the head ourselves by calculating the projection of the face points on the focal plane. Open-source solutions for recognition and marking of faces work well on frames where the face is shot in front or profile, but in intermediate conditions they are often mistaken.

Therefore, we decided to license a good third-party face recognition and marking solution - VisionLabs SDK. Compared to previous algorithms, it is more resource-intensive, but it gives a noticeable increase in the quality of face recognition and marking, which leads to a more accurate extraction of factors for machine learning. With the help of colleagues from VisionLabs, we were able to quickly switch to their SDK and get the performance we needed: 30 frames / sec. at a resolution of 640x480.

VisionLabs SDK uses neural networks for face recognition. The technology processes each frame, finds the driver’s face on it and gives out the coordinates of the eyes, nose, mouth and other key points. The obtained data is used to create a normalized frame 250x250 in size, where the face is located strictly in the center. This frame can already be used to calculate the head position in degrees along three axes: yaw, pitch and roll. To track the status of the driver’s eyes, the system analyzes the image of the eyes and for each eye decides whether it is closed or open. The system is able to determine using IR Liveness technology whether a person is living in front of the camera or the driver attached a photo. For analysis, a normalized frame is used, and at the output we get the result alive or notalive.

Conclusion


While we were rewriting and debugging software, our 3D printers printed cases for cameras and single-board PCs day and night. Printing the kit (camera body + PC case) took about 3-4 hours of printer operation, so we had to expand production capacities: we used four printers. But we managed to do everything on schedule.



In two weeks, we have fully equipped the first hundred cars in several taxi fleets - Yandex.Taxi partners. Now with their help we collect videos, analyze driver behavior, signs of fatigue, improve algorithms and train models that evaluate the level of attention and fatigue. And only after that (taking into account all the data, feedback from drivers and passengers) we will be ready to move on to the next stage - mass production and implementation.

Unfortunately, to scale to several thousand or tens of thousands of installations, the current technical solution is not very suitable for a number of reasons. All that we talked about in this article is a quick experiment, the purpose of which was to quickly learn to collect data directly from the machines in order to train models. The next big stage for us is to develop and start producing a device of the same dimensions, but consisting of one unit: the camera, sensors, and modem will be located in one compact case, which we will massively install in machines.

Source: https://habr.com/ru/post/461137/


All Articles