Why does the robot have ears? (survey: do you need OpenTod)
The second of the laws of robotics, formulated by the notorious American science fiction writer Isaac Asimov, says that the robot must obey the orders given by man. What ways you can give orders to the robot? According to most science fiction films, the most comfortable way of communicating with a robot is natural human speech. That is why we provided the robot Tod, as a real servant of man, the long-awaited opportunity to understand voice control commands and speech synthesis in Russian. Now it is enough, for example, to give the order “Robot, go to the kitchen” in order for the robot to fulfill the necessary task. Under the cut, we will tell you more about the software used to recognize and synthesize speech on the robot, and in the videos we will show examples of using voice commands. The vector of development of our project depends on the opinion of the community. Are you interested in using Tod robot as an open source platform for developers? Please vote in our poll.
language model - builds a sentence from the words obtained
Acoustic model is a set of sound recordings, divided into speech segments. For a small dictionary, you can create an acoustic base yourself, but it is better to use the acoustic base of the VoxForge.org project, which contains more than ten hours of dictation in Russian. The next stage of adaptation of the acoustic model is optional, but it will make the recognition better for your voice. The phrases you dictate are added to the main acoustic model, which allows you to take into account the peculiarities of your pronunciation in recognition. The dictionary in CMU Sphinx is just a text file with phrases and corresponding phonemes. Our dictionary consists of various commands to control the robot:
Vocabulary
without bb je s without (2) bb iz without (3) bb is without (4) bb je z without (5) bb je s forward f pp i rr jo t time v rr je mm i where g dd je two dv aa two-three dv aa t rr ii day dd je nn tomorrow z aa ftr ay hall z aa l hello zdr aa stvuj you know zn aa i sh name is zav uu t how k aa k what k aa k ay i which (2) kak aa i what kak oo j the end of kanc aa who kt oo kitchen k uu h nn uj love you ju bb i sh me mm i nn ja cute mm ii ​​lyj i m nn je you can m oo zh y sh my m oo j find naj tt ii weeks nn i dd je ll i look back ag ll ja t kk i one a dd ii n papa aa pp i beer pp ii v ay you p ay zh yv aa i sh let's play p ay igr aa im bye pak aa weather pag oo d ay item p rr id mm je t bring p rr i vv i zz ii hi p rr i vv je t tell r ay ska zh yy today ss iv oo d nn i now ss ij ch ja s now (2) ss i ch ja s now (3) sch ja s how many sk oo ll k ay you tt i bb ja you (2) tt ja you (3) tt i point t oo ch k ay three t rr ii three-four t rr ii ch it yy rr i you t yy you know u mm je i sh four ch it yy rr i anything sh t oo nn ib uu tt anything (2) ch t oo nn ib uu tt anything (3) ch t oo nn ibu tt
The dictionary is converted into a language model understandable for the CMU Sphinx engine. So, in the end, it looks like a speech recognition process.
')
In ROS, any program node, subscribing to the topic / recognizer / output, can now receive text-based sentences built with the CMU Sphinx language model. We wrote a small voice control node that receives recognized phrases and converts them into patrol commands or synthesizes robot response phrases. Below you will find a video on this topic.
Speech synthesis in the Festival
For full communication with the robot, there is not enough reverse voice response. Our bot, Tod, was helped to speak on the Festival-accessible Linux speech synthesis package. Festival is also a joint development of several large universities, which provides high-quality speech synthesis and supports the Russian language. On the basis of a bunch of Sphinx / Festival, you can implement full-fledged dialogues. And here is a video demonstrating the use of voice commands of our robot.
And what else can you hear?
Speaking about the task associated with the sound, it is impossible not to mention the HARK. HARK is a Japanese sound software that greatly expands the ability to process sound. Here are some of them:
sound source localization
highlighting of several useful sound sources (for example, phrases of several people speaking at the same time)
noise filtering to extract "clean" speech from the audio stream
creating a three-dimensional audio effect for the telepresence task
It does not make much sense to use HARK with only one microphone, since most sound processing tasks are solved on the basis of the so-called array of microphones. And here Kinect fits in perfectly, with an array of 4 microphones attached to the front. We, of course, did not miss the opportunity to use HARK in our project. In the process of patrolling the territory, the robot must respond to surrounding events, including the man’s addressing to it. The sound source localization module provided by HARK can help the robot to find the interlocutor, even if it is not present in the direct line of sight. Such a task is reduced to localizing the sound source and rotating the head so that it faces the other person. How it looks, look in our video.
Regular readers of our blog, for sure, have noticed that since the last publication, the robot Tod has not only become smarter, but also has grown up, has acquired a manipulator and a second Kinect. In the next post we will talk about how to control the manipulator and use it to capture objects. See you again in our blog.