Program, aport!

In the last article I touched upon the topic of working with Microsoft Kinect for Windows and demonstrated the sensor capabilities using the example of the game of cubes. Let me remind you that tracking a human figure (skeleton tracking) is not the only possibility of a sensor, and today I would like to talk about speech recognition.

To learn about the Microsoft Speech Platform, we will write a simple application in which an arbitrary object (for example, a tank) will move along a plane. I did not add voice commands to the previous example for two reasons. First, chronologically, this example appeared earlier. And, secondly, I wanted to concentrate in each example on a separate functionality (such code is easier to learn).

Determine which packages will be useful to us:

Microsoft Speech Platform - SDK . Required package. This is a platform and toolkit for getting started with speech recognition.
Microsoft Speech Platform - Runtime Languages . Optional package. By default, the SDK includes only a library for recognizing English speech. You can download additional libraries to work with other languages.
Microsoft Kinect for Windows - SDK . Optional package.

I will reveal a terrible secret, the presence of a sensor is not a prerequisite for the possibility of speech recognition. Speech Platform develops independently of Kinect, but at the same time in the Kinect SDK you will definitely find examples of speech recognition. Our example will work with both a sensor and a conventional microphone.

First of all, you need to understand what to program. The sequence of actions is extremely simple:

select a recognition engine (engine) from those available in the system for the required language;
create a command dictionary and pass it to the received handler;
set the audio source for the handler. It can be Kinect, microphone, audio file;
give the handler a command to start recognition.

Now in Visual Studio we will create a new WPF Application project. I will write in C #.
')
To begin with, we will try to find a connected sensor. The KinectSensor class provides this feature using the KinectSensors property:

KinectSensor kinect = KinectSensor.KinectSensors .Where(s => s.Status == KinectStatus.Connected) .FirstOrDefault();

Speech recognition engine - class SpeechRecognitionEngine , its static method InstalledRecognizers () helps to get information about all installed handlers in the system.

 RecognizerInfo info = SpeechRecognitionEngine.InstalledRecognizers() .Where(ri => string.Equals(ri.Culture.Name, "en-US", StringComparison.InvariantCultureIgnoreCase)) .FirstOrDefault();

It is easy to guess that in this way we get information about the processor of recognition of English speech recognition ( RecognizerInfo ), if there is such a processor. The InstalledRecognizers method does not return instances of handlers, but information about them. Therefore, the next step is to create an instance of the handler. Just pass the handler ID to the constructor:

 var sre = new SpeechRecognitionEngine(info.Id);

Now let's think about this. We need to control the object on the plane. Which teams will come up for this? I think that 4 commands are enough: UP (up), DOWN (down), LEFT (left), RIGHT (right). And for variety, you can add the fifth EXIT command (exit). I note that I wrote the code for recognizing commands in English, but you can choose any of the 54x available . Create a command dictionary and load it into the recognition handler.

 var commands = new Choices(); commands.Add("up"); commands.Add("down"); commands.Add("left"); commands.Add("right"); commands.Add("exit"); var gb = new GrammarBuilder(commands) { Culture = info.Culture }; sre.LoadGrammar(new Grammar(gb));

A list of words (commands) for recognition is created in the Choices object. The next step is to create a grammar object associated with the command culture, and then the grammar is loaded into the recognition processor.

Each word you say, the handler compares with the word patterns in the grammar, to determine if you have spoken any command. But remember that each attempt at recognition is accompanied by a certain probability of error; a little further you will see this with an example.

You can now define handlers for speech recognition events. It is important for us to handle the SpeechRecognized event that occurs when the recognition handler finds a match for the spoken command in the dictionary. In the SpeechRecognizedEventArgs object, we have the Result property available in which we can find: a recognized word , the magnitude of the probability that the word is recognized correctly, and much more . Two other events, SpeechHypothesized and SpeechRecognitionRejected, are of interest for debugging rather than actual use. The first event occurs when the recognition handler makes a recognition assumption. The second is when the recognition handler can define a word only with a small degree of probability.

 private void Sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { //     70% if (e.Result.Confidence >= 0.7) { Action handler = null; switch (e.Result.Text.ToUpperInvariant()) { case "UP": case "DOWN": case "LEFT": case "RIGHT": handler = () => { /* some actions */ }; break; case "EXIT": handler = () => { this.Close(); }; break; default: break; } if (handler != null) { //   ,      UI     Dispatcher.BeginInvoke(handler, DispatcherPriority.Normal); } } }

It remains for us to set the source of the audio signal and start recognition. Here I would like to note the feature of Kinect. The audio stream is ready for operation approximately 4 seconds after initialization. This should be taken into account and, for example, create a timer to start recognition with a 4-second delay.

Remember, I said at the beginning that our code will work both with Kinect and with a regular microphone? In order to realize this, it is sufficient to correctly set the source of the audio signal.

 if (kinect != null) { var audioSource = kinect.AudioSource; audioSource.BeamAngleMode = BeamAngleMode.Adaptive; var kinectStream = audioSource.Start(); //    Kinect sre.SetInputToAudioStream(kinectStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null)); } else { //         sre.SetInputToDefaultAudioDevice(); } //  .          . sre.RecognizeAsync(RecognizeMode.Multiple);

As for the UI, everything is simple. We draw an object of any shape (it can even be a picture), I drew a tank.

tank

And add animation to move. Of course, in order not to create comical situations when the tank is moving sideways, I also added animation to turn in the right direction. An example of an animation for executing a LEFT command (left):

 <Storyboard x:Key="LEFT"> <DoubleAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.RenderTransform).(TransformGroup.Children)[2].(RotateTransform.Angle)" Storyboard.TargetName="PART_Tank"> <EasingDoubleKeyFrame KeyTime="0:0:0.5" Value="-90"/> </DoubleAnimationUsingKeyFrames> <DoubleAnimation Storyboard.TargetProperty="(Canvas.Left)" Storyboard.TargetName="PART_Tank" Duration="0:0:1" By="-30"/> </Storyboard>

Recognition in action:

You will find the source code files and the compiled version at the end of the article. Please note that if you run the compiled example without the Speech SDK installed, you need to install the Microsoft Speech Platform Runtime and the English recognition engine MSSpeech_SR_en-US_TELE.msi

Summing up, I will say that Microsoft Speech Platform is really a big and interesting product, I touched only a small part of it. I would advise interested people to see examples of working with this platform in the Kinect SDK, I think this is a good starting point.

In conclusion, I would like to thank VIAcode for providing the sensor for the experiments.

Build an example without Kinect
Build the Kinect Example
Sample source code

Source: https://habr.com/ru/post/142677/

All Articles

Program, aport!

More articles: