Using the Google Speech API to manage your computer

Good day to all habrazhiteli.

On Habré already written several articles on the use of the Google Speech API, including its use when creating a smart home .

In this article I want to tell you how to write a small program for voice control of a computer.
')
Who cares, I ask under the cat.

For development, I use Embarcadero RAD Studio XE and several free support components (JEDI Core, JEDI VCL, New Audio Components for Delphi, Synapse, uJSON, CoolTrayIcon)

The article “Using Google Voice Search in your .NET application” described how the Google Speech API works and what the subtleties are.

I will describe the algorithm of my program and some nuances of using auxiliary components.

1. Record sound in FLAC format

For this, I use the component New Audio Components for Delphi. The sound is recorded in the FLAC format with a frequency of 8 kHz and saved to a file.

VCL component DXAudioIn1 is responsible for recording, recording settings are specified in it (1 channel and frequency 8 kHz)

Next, the data from DXAudioIn1 goes to FastGainIndicator1, which has level processing on OnGainData, if the level has fallen N times below the set level (red pointer), recording stops and data is sent to Google.
I also made it possible to start automatic recording when the level is exceeded at some threshold M times (blue pointer).

Of course, this algorithm is not very reliable, but it eliminates the need to press the start and stop buttons. With appropriate settings for levels and number of triggers, the program detects the presence of a useful component from a microphone.

And at the end, the data from FastGainIndicator1 goes to the FLACOut1 component, which writes directly to the file in the FLAC format.

The startRecord procedure is responsible for starting the recording.

2. Sending a file to Google for recognition and response

The recorded file using the Synapse library is sent to Google for recognition.

What are the subtleties when working with Synapse and the fact that the data needs to be sent using HTTPS?

a) You must have libraries libeay32.dll and ssleay32.dll
b) In uses it is necessary to connect the file SSL_OpenSSL

The HTTPPostFile function is responsible for sending the file.

It is called simply:
HTTPPostFile ('https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU', 'userfile', ExtractFilename (OutFileName), Stream, StrList);

where
Stream is a TFileStream stream in which we read our recorded file in FLAC format.
StrList is a TStringList with a response from Google.

The HTTPPostFile function itself is fairly simple, but there are subtleties in it:

function TMainForm.HTTPPostFile(Const URL, FieldName, FileName: String; Const Data: TStream; Const ResultData: TStrings): Boolean; const CRLF = #$0D + #$0A; var HTTP: THTTPSend; Bound, Str: String; begin Bound := IntToHex(Random(MaxInt), 8) + '_Synapse_boundary'; HTTP := THTTPSend.Create; try Str := '--' + Bound + CRLF; Str := Str + 'content-disposition: form-data; name="' + FieldName + '";'; Str := Str + ' filename="' + FileName + '"' + CRLF; Str := Str + 'Content-Type: audio/x-flac; rate=8000' + CRLF + CRLF; HTTP.Document.Write(Pointer(Str)^, Length(Str)); HTTP.Document.CopyFrom(Data, 0); Str := CRLF + '--' + Bound + '--' + CRLF; HTTP.Document.Write(Pointer(Str)^, Length(Str)); HTTP.MimeType := 'audio/x-flac; rate=8000, boundary=' + Bound; Result := HTTP.HTTPMethod('POST', URL); ResultData.LoadFromStream(HTTP.Document); finally HTTP.Free; end; end;

3. Parsing the response line from Google and executing the command

The response string from Google comes in the JSON form, for example:

{"Status": 0, "id": "5e34348f2887c7a3cc27dc3695ab4575-1", "hypotheses": [{"utterance": "notepad", "confidence": 0.7581704}]}

For parsing I use the uJSON library.

What do the answer fields mean:
status = 0 field - record successfully recognized
status = 5 field - record not recognized
id field is a unique identifier of the request
the hypotheses field is the result of recognition, it has 2 subfields:
utterance - recognized phrase
confidence - recognition accuracy

Sending the file, parsing the response, searching and executing the command I brought to a separate thread JvThreadRecognize.

Command lists are stored in the MSpeechCommand.ini file, example file:

;notepad.exe
;script\Show_Desktop.scf
;script\Lock_Workstation.cmd
;script\Halt_Workstation.cmd
;script\Reboot_Workstation.cmd
;script\Logoff_Workstation.cmd
qip;C:\Program Files\QIP Infium\infium.exe
;firefox.exe

Results: This program does not pretend to be finished, this is just an example of using the Google Speech API for executing some commands on a computer (while this is only launching applications and executing system commands). But no one bothers to modify it and teach to move the mouse, type text in a text editor, etc.

The final build of the program and source code (GPLv3) are available at code.google.com/p/mspeech

I will be glad to hear constructive criticism and suggestions. Thank.

Source: https://habr.com/ru/post/144535/

All Articles

Using the Google Speech API to manage your computer

More articles: