Why does the robot have ears? (survey: do you need OpenTod)

The second of the laws of robotics, formulated by the notorious American science fiction writer Isaac Asimov, says that the robot must obey the orders given by man. What ways you can give orders to the robot? According to most science fiction films, the most comfortable way of communicating with a robot is natural human speech. That is why we provided the robot Tod, as a real servant of man, the long-awaited opportunity to understand voice control commands and speech synthesis in Russian. Now it is enough, for example, to give the order “Robot, go to the kitchen” in order for the robot to fulfill the necessary task. Under the cut, we will tell you more about the software used to recognize and synthesize speech on the robot, and in the videos we will show examples of using voice commands.
The vector of development of our project depends on the opinion of the community. Are you interested in using Tod robot as an open source platform for developers? Please vote in our poll.

Speech Recognition in Pocketsphinx

Most owners of modern smartphones have already tried some kind of voice search system and appreciated some of its advantages over the traditional method of sensory input. And some automation lovers have taught their PCs how to understand voice control commands, since there are enough manuals on this topic on Habré and the network.
If your robot uses Linux, then teaching him to understand speech will not be much more difficult than doing the same on a home PC. You can use to recognize the speech of any of the engines, distributed with open source. Unlike speech recognition cloud services, this allows the robot to stay connected even in the absence of the Internet.
Our robot uses the open source CMU Sphinx voice engine developed by American Carnegie Mellon University and actively supported by the Massachusetts Institute of Technology and Sun Microsystems. One of the advantages of this engine is the ability to adapt the sound model for a particular person. And what is important for us, the engine is easily integrated into ROS - a robotic framework for our Tod.
CMU Sphinx consists of 3 main components:

acoustic model that converts sound into phonemes
dictionary that converts phonemes into words
language model - builds a sentence from the words obtained

Acoustic model is a set of sound recordings, divided into speech segments. For a small dictionary, you can create an acoustic base yourself, but it is better to use the acoustic base of the VoxForge.org project, which contains more than ten hours of dictation in Russian.
The next stage of adaptation of the acoustic model is optional, but it will make the recognition better for your voice. The phrases you dictate are added to the main acoustic model, which allows you to take into account the peculiarities of your pronunciation in recognition.
The dictionary in CMU Sphinx is just a text file with phrases and corresponding phonemes. Our dictionary consists of various commands to control the robot:

Vocabulary

without bb je s
without (2) bb iz
without (3) bb is
without (4) bb je z
without (5) bb je s
forward f pp i rr jo t
time v rr je mm i
where g dd je
two dv aa
two-three dv aa t rr ii
day dd je nn
tomorrow z aa ftr ay
hall z aa l
hello zdr aa stvuj
you know zn aa i sh
name is zav uu t
how k aa k
what k aa k ay i
which (2) kak aa i
what kak oo j
the end of kanc aa
who kt oo
kitchen k uu h nn uj
love you ju bb i sh
me mm i nn ja
cute mm ii lyj
i m nn je
you can m oo zh y sh
my m oo j
find naj tt ii
weeks nn i dd je ll i
look back ag ll ja t kk i
one a dd ii n
papa aa pp i
beer pp ii v ay
you p ay zh yv aa i sh
let's play p ay igr aa im
bye pak aa
weather pag oo d ay
item p rr id mm je t
bring p rr i vv i zz ii
hi p rr i vv je t
tell r ay ska zh yy
today ss iv oo d nn i
now ss ij ch ja s
now (2) ss i ch ja s
now (3) sch ja s
how many sk oo ll k ay
you tt i bb ja
you (2) tt ja
you (3) tt i
point t oo ch k ay
three t rr ii
three-four t rr ii ch it yy rr i
you t yy
you know u mm je i sh
four ch it yy rr i
anything sh t oo nn ib uu tt
anything (2) ch t oo nn ib uu tt
anything (3) ch t oo nn ibu tt

The dictionary is converted into a language model understandable for the CMU Sphinx engine. So, in the end, it looks like a speech recognition process.

')
In ROS, any program node, subscribing to the topic / recognizer / output, can now receive text-based sentences built with the CMU Sphinx language model. We wrote a small voice control node that receives recognized phrases and converts them into patrol commands or synthesizes robot response phrases. Below you will find a video on this topic.

Speech synthesis in the Festival

For full communication with the robot, there is not enough reverse voice response. Our bot, Tod, was helped to speak on the Festival-accessible Linux speech synthesis package. Festival is also a joint development of several large universities, which provides high-quality speech synthesis and supports the Russian language. On the basis of a bunch of Sphinx / Festival, you can implement full-fledged dialogues. And here is a video demonstrating the use of voice commands of our robot.

And what else can you hear?

Speaking about the task associated with the sound, it is impossible not to mention the HARK. HARK is a Japanese sound software that greatly expands the ability to process sound. Here are some of them:

sound source localization
highlighting of several useful sound sources (for example, phrases of several people speaking at the same time)
noise filtering to extract "clean" speech from the audio stream
creating a three-dimensional audio effect for the telepresence task

It does not make much sense to use HARK with only one microphone, since most sound processing tasks are solved on the basis of the so-called array of microphones. And here Kinect fits in perfectly, with an array of 4 microphones attached to the front.
We, of course, did not miss the opportunity to use HARK in our project. In the process of patrolling the territory, the robot must respond to surrounding events, including the man’s addressing to it. The sound source localization module provided by HARK can help the robot to find the interlocutor, even if it is not present in the direct line of sight. Such a task is reduced to localizing the sound source and rotating the head so that it faces the other person. How it looks, look in our video.

Regular readers of our blog, for sure, have noticed that since the last publication, the robot Tod has not only become smarter, but also has grown up, has acquired a manipulator and a second Kinect. In the next post we will talk about how to control the manipulator and use it to capture objects. See you again in our blog.

Source: https://habr.com/ru/post/214849/

All Articles

Why does the robot have ears? (survey: do you need OpenTod)

Speech Recognition in Pocketsphinx

Speech synthesis in the Festival

And what else can you hear?

More articles: