The camera's voice guidance function has become more accessible - the universal solution SmartCam A12 Voice Tracking

The topic of tracking a talking videoconference participant over the past few years has gained a lot of momentum. Technology has enabled the implementation of complex algorithms for processing audio / video information in real time, which prompted Polycom, almost 10 years ago, to present the world's first mass solution with intelligent automatic speaker tracking. For several years, they managed to remain the only owners of such a solution, but Cisco did not have to wait long, and brought to market their own version of an intelligent two-chamber system, which was an honest competition to the solution from Polycom. For many years, this segment of the video conferencing was limited to the capabilities of several proprietary products, but this article is devoted to the first universal camera pointing solution for voice compatible with both hardware and software of the video conferencing.
Before turning to the description of solutions and demonstration of opportunities, I want to note an important event:
I am honored to introduce a new hub to the habras community dedicated to video conferencing solutions. Now, thanks to joint efforts (mine and UFO), Video conferencing has its home on Habré, and I invite everyone involved in this extensive and relevant topic to date to subscribe to the new hub .

Two scenarios of camera pointing at the speaker

At the moment, integrators of video conferencing solutions choose for themselves two different ways to implement the task of targeting a speaker:

Automatic - Intelligent
Semi-automatic - programmable

The first option is solutions from Cisco, Polycom and other manufacturers, we will consider them below. Here we are dealing with a full automation of camera pointing at the speaker of the videoconference. Unique audio / video signal processing algorithms allow the camera to select the desired position independently.

The second option is automation systems based on various external controllers, we will not consider them in detail, because The article is dedicated just to automatic tracking of speakers.
There are not few supporters of the second scenario of implementing camera guidance, and for good reason. Experienced integrators understand that intelligent solutions from Polycom and Cisco require ideal operating conditions for the normal operation of automation. But such conditions are not always possible to provide, therefore the following solution of the camera pointing task sometimes becomes the guarantee of the system operation:
')
1. In the memory of the camera (or sometimes in the control controller), all the necessary presets are manually entered in advance (the position of the rotator and the multiplicity of the optical zoom). As a rule, this is the general plan of the meeting room, and the appearance of each participant in the conference in portrait mode.

2. Next, the initiators for calling the required preset are set to the specified locations — these are either microphone consoles or radio buttons, in general, any device capable of giving a clear signal to the control controller.

3. The control controller is programmed in such a way that each initiator has its own preset. The general plan of the room - all initiators are turned off.
As a result, when using the congress system, for example, and the controller of control, the speaker, before starting his speech, activates his personal microphone console. The control system instantly fulfills the saved camera position.

This scenario works smoothly - the system does not need to make voice triangulation and video analytics. I pressed the button - the preset worked, no delays and false positives.
Control and automation systems are used in large, complex areas where sometimes not one, but several video cameras are installed. Well, for small and medium-sized meeting rooms, automatic systems are quite suitable (if there is a budget).
Let's start with the founding fathers.

Polycom EagleEye Director

Once this decision caused a sensation in the field of video conferencing. Polycom EagleEye Director was the first solution in the field of intelligent camera guidance. The solution consists of an EagleEye Director base unit and two cameras. The peculiarity of the first implementation is that one camera is assigned only to the large type of speaker, and the second to the general layout of the meeting room. In this case, the general camera can be placed at all separately from the base in another place of the meeting room - it does not participate directly in the process of automatic guidance.
The system works as follows:

The general room plan is active - everyone is silent
The speaker begins to speak - the microphone array picks up the voice, the camera moves in the direction of sound, using a patented technology that includes voice triangulation. The general camera is still active.
The main camera is just beginning to look for the source of sound, conducting video analytics. The system identifies the speaker through a bunch of eyes, nose, mouth, frames the picture with the speaker, and displays the stream from the main camera
The speaker is changing. The microphone array understands that the voice is heard from another place. The general plan is turned on again.
And then in a circle, starting from point 2
If the new speaker is in the frame with the previous one, the system changes the positioning “to hot”, without changing the active flow to the general plan.

The downside, in my opinion, is the presence of only one main camera. This leads to a significant delay in changing the speaker. And every time at the moment of guidance, the system includes the general plan of the room - with lively conversation this flickering begins to annoy.

Polycom EagleEye Director II

This is the second version of the solution from Polycom, which was released relatively recently. The principle of operation has undergone changes, and has become more like a solution from Cisco. Now both PTZ cameras are basic and serve to seamlessly switch channels from one speaker to another. For the general plan of the meeting room is now responsible for a separate camera, integrated into the body of the base unit EagleEye Director II. The stream from this wide-angle camera is for some reason displayed in an additional window in the corner of the screen, occupying 1/9 of the main stream. The principle of positioning is the same - voice triangulation and video stream analysis. And the bottlenecks are the same: if the system does not see the talking mouth, the camera will not hover. And such a situation can happen quite often - the speaker turned away, the speaker turned sideways, the speaker - ventriloquist, the speaker blocked his mouth with his hand or a document.
Both promo videos were shot correctly - 2 people, they speak in turns, and their mouths are opened as if they were at a speech therapist. But even in such refined conditions, there is a very significant delay. But, but framing is flawless - a comfortable portrait plan.

Cisco TelePresence SpeakerTrack 60

To describe this solution, I will use the text from the official brochure.
SpeakerTrack 60 uses a unique approach using two cameras to quickly switch between participants quickly. One camera quickly finds a close-up of the active speaker, and the other searches for and displays the next speaker. The MultiSpeaker function prevents unnecessary switching if the next presenter is already present in the current frame.
Unfortunately, I did not have the opportunity to test SpeakerTrack 60 on my own. Therefore, conclusions have to be made according to “from the fields” and based on the results of the demo video below. I counted the maximum delay of almost 8 seconds when hovering over a new speaker. The average delay was 2-3 seconds, judging by the video.

HUAWEI Intelligent Tracking Video Camera VPT300

I came across this decision from Huawei by chance. The cost of the system is about $ 9K. Works only with Huawei terminals. The developers have added their own "chip" - the layout for one screen is a video from two speakers, if there is no one else in the room. According to the characteristics and declared functionality - this is a very interesting version of the automatic guidance system. But, unfortunately, I did not find absolutely no demonstration material. The only video that has appeared on this topic is a mounted video review of the solution, without an original sound, to the music. Thus, it was not possible to assess the quality of the system. For this reason, I will not consider this option.
I see that the company Huawei has a working blog on Habré - maybe colleagues will be able to publish any useful information on this product.

New - universal solution SmartCam A12 Voice Tracking

SmartCam A12VT is a monoblock that includes two PTZ cameras for tracking speakers, two built-in cameras for analyzing the general floor plan, as well as a microphone array built into the base of the case - as you can see there are no bulky and fragile structures like your opponents.
Before proceeding with the description of the new product, I will put together the characteristics and features of solutions from Cisco and Polycom, so that SmartCam A12VT can be compared with existing proposals.

Polycom EagleEye Director

Retail cost of the system without a terminal - $ 13K
Minimum Cost of EagleEye Director + RealPresence Group 500 Solution - $ 19K
Average switch delay 3 seconds
Guidance by voice + video analytics
High requirements for the person of the speaker - you can not hide your mouth
Incompatibility with third-party equipment

Cisco TelePresence SpeakerTrack 60

Retail cost of the system without a terminal - $ 15.9K
Minimum Cost of TelePresence SpeakerTrack 60 + SX80 Codec Solution - $ 30K
Average switch delay 3 seconds
Guidance by voice + video analytics
Requirements to the person of the speaker - did not check, did not find information
Incompatibility with third-party equipment

SmartCam A12 Voice Tracking

Retail cost of the system without a terminal - $ 6,2K
The minimum cost of the solution SmartCam A12VT + Yealink VC880 - $ 10.8K
The minimum cost of the SmartCam A12VT solution + software terminal is $ 7.7K
Average switch delay 3 seconds
Guidance by voice + video analytics
Requirements for the person of the speaker - no requirements
Compatibility with third-party equipment - HDMI

As two main and indisputable advantages of the SmartCam A12 Voice Tracking solution, I find:

Universality of connection - through HDMI the system is integrated with both hardware and software terminal systems VCS
Low cost - with similar functionality, A12VT is several times more affordable on the budget than the above proposals.

To demonstrate the system, we recorded a video review. The task was not so much advertising as functional. Therefore, the video is devoid of the pathos of the Polikomovsky promo video. The place of the presentation was not representative, but the laboratory meeting room of our partner, IPMatika company.
I had a goal not to hide the flaws of the system, but rather to expose the bottlenecks of the functional, to make the system go wrong.

In my opinion, the system was tested successfully. I declare it confidently, because at the time of this writing, the SmartCam A12 Voice Tracking solution was visited by a dozen real meeting rooms of our customers. The violation of the automation was observed only in the conditions of violation of the recommended operating rules. In particular - the minimum distance to nearby participants. If you sit very close to the camera, less than a meter - the microphone array will not be able to recognize you, and the lens can be traced.

In addition to the distance, there is another requirement - the installation height of the camera.

If the camera is set too low - there may be problems with the positioning of the voice. The option under the TV, unfortunately, did not work.
But the installation of the system on the display medium is an ideal way to operate the device. Shelf for the camera comes in a set, only wall mount is supported in a regular way.

How SmartCam A12 Voice Tracking works

The main PTZ lenses have equal roles - their task is to track the speakers one by one and display the overall plan. Analytics of the overall picture in the room and determination of the distance to objects is carried out using video streams received from two cameras integrated into the base of the system. This feature allows you to reduce the response time of the lens, when changing the speaker, up to 1-2 seconds. The camera manages to alternate participants in a comfortable rhythm, even if they exchange short sentences.
The video demonstration of the system operation fully reflects the functionality of the SmartCam A12VT . But, for those who have not watched the video, I will describe in words how the automation works:

The room is empty: one of the lenses shows the general plan, the second is ready - it is waiting for people
People enter the room and sit down: a free lens finds two extreme participants and frames the image by them, cutting off the empty part of the room.
While people are moving, the lenses take turns tracking everyone in the room, keeping them in the center of the frame.
The speaker begins to say: the lens is active, tuned to the general plan. The second is directed to the speaker, and only then goes into broadcast mode.
The speaker changes: the lens that is configured on the first speaker is active, and the second lens throws out a general plan and adjusts to the new speaker
At the moment of switching the picture from the first speaker to the second, the free lens instantly adjusts to the general floor plan
If everything stops - a free lens will show the already completed general plan without any delays.
If the speaker changes again, the free lens will go looking for him.

Conclusion

In my opinion, this decision, presented at ISE and ISR last year, makes high technology closer - if not to the people, then to business for sure. It is clear that for 400 thousand rubles, few people will buy home such a “toy”, but for business, for corporate video conferencing, this is a very affordable and convenient solution to the task of auto-guidance of the camera.
Given the versatility of SmartCam A12 Voice Tracking , the system can be used as a solution from scratch, or as an extension of the functionality of an existing video conferencing infrastructure. Connecting via HDMI is a big step towards the user, unlike the proprietary systems of the manufacturers described above.

I want to thank the partners who have helped in testing.
AyPiMatika Company - for the terminal Yealink VC880, a meeting room and Yakushin Yura.
Smart-AB company - for the right of the first and exclusive review of the solution and provision of the SmartCam A12 Voice Tracking system for testing.

In the last article, the Online Constructor of the Meeting Room - selection of the optimal solution for videoconferencing , as a website promotion vc4u.ru and Designer VKS, we announced a 10% discount on the price in the catalog for the HABR code word until the end of the summer of 2019.

Discount applies to products in sections:

For the SmartCam A12 Voice Tracking solution, I offer an additional discount of 5% to the existing 10% - a total of 15% by the end of the summer of 2019.

Waiting for your comments and responses in the survey!

Thanks for attention.
Respectfully,
Kirill Usikov ( Usikoff )
Head of
CCTV and video conferencing systems
1@stss.ru
stss.ru
vc4u.ru

Source: https://habr.com/ru/post/459038/

All Articles