📜 ⬆️ ⬇️

"Smart home" with their own hands. Part 3. Synthesis and voice recognition using Google

In the last article, we were able to obtain images from our webcams as snapshots once a second. Now it's time to take on the promised - voice recognition and synthesis.

Small retreat


Starting from this article, I will begin to describe my software, which coordinates all the subsystems of the smart home. I consider it necessary to note that it has already gone far enough away from the code described in this article, with newer and more functional versions available through the trac link . Distribution is licensed under the GNU GPLv3 . If someone wants to join the development - you are welcome;)

A little information


')
Speech recognition

As I wrote in the first article , we will use Google services for voice synthesis and recognition. I think many have come across mobile devices running Android with voice search. As an additional feature, this very voice search was added to the Google Chrome browser. It should be noted that the company has not yet announced the official API for this service, but thanks to the open source Chrome, craftsmen have found what is being sent and what is being sent in response. It looks like this:

  1. We write the wav-file with a sampling frequency of sound 16000 Hz, mono
  2. We recode the resulting file into flac format
  3. We’ll send the file to https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU , representing Google as a Chrome client
  4. We get the answer in JSON format


The answer is something like:

{"status":0,"id":"84e03bf4efe17fa7856333560d6faba4-1","hypotheses":[{"utterance":" ","confidence":0.85437811}]}

We are interested in the answer only the last two fields - utterance and confidence . The first is the desired recognized word / phrase, the second is the recognition accuracy. If confidence is more than 0.5 , we can assume that the recognition is reliable.

Speech synthesis

Speech synthesis will also be carried out through the Google service and, as far as I know, the official API has not announced to it. To get a sound phrase from the text, you need to make an entirely uncomplicated combination of actions:

  1. Send a request of the form: http://translate.google.com/translate_tts?tl=ru&q=text, presented by Google Chrome browser in the headers
  2. Get response stream in MP3 encoding


As you can see, everything is not difficult at all. Now we implement this information programmatically.

Some code


As I wrote, centrally managed our “smart home” will be a specially written perl demon. I ask in advance for the quality of the code not to kick, for your humble servant is just a sysadmin :)
So, we define the range of tasks that this software should perform:

  1. Accept requests for recognition of sound files
  2. Determine the status of devices, give them commands
  3. Perform some actions if a command sequence is found.
  4. React in a specified way to data from sensors and cameras
  5. Keep statistics, records and logs
  6. Have a convenient web-interface for viewing status, cameras, giving commands, etc.


Perhaps I have forgotten something or missed it, but, as it seems to me, these are the main tasks of the smart home software. Now we start to implement all this.

To create a TCP / IP daemon on Perl, use the Net :: Server :: Fork module. I will assume that you already know the perl language.
 #!/usr/bin/perl -w package iON; use strict; use utf8; use base qw(Net::Server::Fork); sub process_request { my $self = shift; while (<STDIN>) { if (/text (\d+)/) { toText($1); next; } if (/quit/i) { print "+OK - Bye-bye ;)\n\n"; last; } print "-ERR - Command not found\n"; logSystem(" : $_", 0); } } iON->run(port => 16000, background => undef, log_level => 4, host => 'localhost'); 1; 
Briefly run over, according to what is written here. We declare ourselves a module with the name iON based on the Net :: Server :: Fork module and start the server on port 16000 on localhost with the highest level of logging details and without the “demon” mode. Next, overload the process_request () function. She is responsible for processing the received data from the client. In our case, if the server sees the text of the text format number - the toText function is executed with the parameters as a number that the client sent us. With the quit command, I think everything is clear.

What does the toText () function do ? Yes, actually, speech recognition!
 sub toText { my $num = shift; print "+OK - Trying recognize text\n"; my $curl = WWW::Curl::Easy->new; $curl->setopt(CURLOPT_HEADER,1); $curl->setopt(CURLOPT_POST,1); #$curl->setopt(CURLOPT_VERBOSE, 1); my @myheaders=(); $myheaders[0] = "Content-Type: audio/x-flac; rate=16000"; $curl->setopt(CURLOPT_HTTPHEADER, \@myheaders); $curl->setopt(CURLOPT_URL, 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU'); my $curlf = WWW::Curl::Form->new; $curlf->formaddfile("data/input-$num.flac", 'myfile', "audio/x-flac"); $curl->setopt(CURLOPT_HTTPPOST, $curlf); my $response_body; $curl->setopt(CURLOPT_WRITEDATA,\$response_body); # Starts the actual request my $retcode = $curl->perform; # Looking at the results... if ($retcode == 0) { $response_body =~ /\n\r\n(.*)/g; my $json = $1; my $json_xs = JSON::XS->new(); $json_xs->utf8(1); my @hypo = $json_xs->decode($json)->{'hypotheses'}; my $dost = $hypo[0][0]{'confidence'}; my $text = $hypo[0][0]{'utterance'}; $dost = 0.0 if !defined $dost; $text = "" if !defined $text; print "+OK - Text is: \"$text\", confidence is: $dost\n"; if($dost > 0.5) { checkcmd($text); } { print "+ERR - Confidence is lower, then 0.5\n"; #sayText("  !"); } } else { # Error code, type of error, error message print("+ERR - $retcode ".$curl->strerror($retcode)." ".$curl->errbuf); } system("rm data/input-$num.flac"); } 
I will not describe in detail - it is exactly those actions that are needed for text recognition. Google is fed a file from the data subdirectory named input-number.flac . How it is formed there, a little later. After - the answer is read, and if its accuracy is above 0.5, the recognized text is passed as a parameter to the function checkcmd () . At the end of everything, the sound file is deleted. I note that it will be necessary to install the curl program and add more modules to the beginning of our script:
 use WWW::Curl::Easy; use WWW::Curl::Form; use JSON::XS; 
Now about speech synthesis. This will be handled by a function called sayText () in the parameter quotation, which accepts the actual text that needs to be voiced. But first, let's add some missing modules and global variables:
 require Encode; use URI::Escape; use LWP::UserAgent; our $mp3_data; 
Now the code itself:
 sub sayText { my $text = shift; print "+OK - Speaking \"$text\"\n"; my $url = "http://translate.google.com/translate_tts?tl=ru&q=".uri_escape_utf8($text); my $ua = LWP::UserAgent->new( agent => "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.872.0 Safari/535.2"); $ua->get($url, ':content_cb' => \&callback); open (MP3, "|padsp splay -M") or die "[err] Can't save: $!\n"; print MP3 $mp3_data; close(MP3); $mp3_data = undef; print "+OK - Done!\n"; return; } sub callback { my ($data, $response, $protocol) = @_; $mp3_data .= $data; # } 
As you can see, the server’s response as a stream is handled by the callback () function, which adds data to the $ mp3_data variable. The data is transmitted via the pipe to the splay program which is launched via the padsp program, which is responsible for the OSS emulation (in Ubuntu, the OSS was drunk). The -M switch causes the program to play data from the standard input.

Now let's talk about where the mysterious files appear in flac in the data directory. Everything is simple - a separate script deals with this:
 #!/usr/bin/perl use strict; use IO::Socket; while (1) { my $rnd = int(rand(1000)); `rec -q -c 1 -r 16000 ./data/input-$rnd.wav trim 0 4`; `flac -f -s ./data/input-$rnd.wav -o ./data/input-$rnd.flac`; `rm ./data/input-$rnd.wav`; my $sock = new IO::Socket::INET( PeerAddr => "localhost", PeerPort => 16000, Proto => 'tcp') || next; print $sock "text ".$rnd; undef $rnd; } 
As we can see, writing and format conversion perform several programs called from the script:

  1. rec (from the distribution of the sox program)
  2. flac


The rec command makes short 4-second records with a random number in the name, which are clamped by the flac program. After that, a connection is made to our main daemon and the command text that_tour_random_number is transmitted . Why do I write 4 second short records? The thing is how the computer will record our voice. There are two possible solutions:

  1. Permanent record
  2. Write file when exceeding a certain volume


The second option did not suit me for various reasons, including due to bad microphones;) Let us examine in more detail the first option with a permanent recording. We break our record into many small pieces that are constantly sent to the Google server for recognition. I found that all of my teams have entered a maximum of 3-4 seconds so far. If we run several (suppose 5) copies of the script with an interval of 1 second, we get continuous voice recognition. Add this functionality to our main program:
 for(1..5) { system("perl mic.pl &>/dev/null"); sleep 1; } 
Now we only need to implement the function checkcmd () in order to test the operation of the whole complex. We also need an address reversal to eliminate false positives.
 sub checkcmd { my $text = shift; if($text =~ //) { sayText("  - $text"); # if $text eq "  "; } } 
Now, put it all together. We got two scripts, let's call them srv.pl and mic.pl , as well as the data subdirectory for storing our sound files.

srv.pl

 #!/usr/bin/perl -w package iON; use strict; use utf8; use WWW::Curl::Easy; use WWW::Curl::Form; use JSON::XS; use URI::Escape; use LWP::UserAgent; require Encode; use base qw(Net::Server::Fork); ##  ################################ $|=1; our $parent = $$; our $mp3_data; ################################ for(1..5) { system("perl mic.pl &>/dev/null"); sleep 1; } ##    ############################### iON->run(port => 16000, background => undef, log_level => 4, host => 'localhost'); ################################ ################################ sub DESTROY { if($$ == $parent) { system("killall perl"); system("rm data/*.flac && rm data/*.wav"); } } ##    ################################ sub process_request { my $self = shift; while (<STDIN>) { if (/text (\d+)/) { toText($1); next; } if (/quit/i) { print "+OK - Bye-bye ;)\n\n"; last; } print "-ERR - Command not found\n"; } } ############################### ############################### sub toText { my $num = shift; print "+OK - Trying recognize text\n"; my $curl = WWW::Curl::Easy->new; $curl->setopt(CURLOPT_HEADER,1); $curl->setopt(CURLOPT_POST,1); #$curl->setopt(CURLOPT_VERBOSE, 1); my @myheaders=(); $myheaders[0] = "Content-Type: audio/x-flac; rate=16000"; $curl->setopt(CURLOPT_HTTPHEADER, \@myheaders); $curl->setopt(CURLOPT_URL, 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU'); my $curlf = WWW::Curl::Form->new; $curlf->formaddfile("data/input-$num.flac", 'myfile', "audio/x-flac"); $curl->setopt(CURLOPT_HTTPPOST, $curlf); my $response_body; $curl->setopt(CURLOPT_WRITEDATA,\$response_body); # Starts the actual request my $retcode = $curl->perform; # Looking at the results... if ($retcode == 0) { $response_body =~ /\n\r\n(.*)/g; my $json = $1; my $json_xs = JSON::XS->new(); $json_xs->utf8(1); my @hypo = $json_xs->decode($json)->{'hypotheses'}; my $dost = $hypo[0][0]{'confidence'}; my $text = $hypo[0][0]{'utterance'}; $dost = 0.0 if !defined $dost; $text = "" if !defined $text; print "+OK - Text is: \"$text\", confidence is: $dost\n"; if($dost > 0.5) { checkcmd($text); } { print "+ERR - Confidence is lower, then 0.5\n"; } } else { # Error code, type of error, error message print("+ERR - $retcode ".$curl->strerror($retcode)." ".$curl->errbuf); } system("rm data/input-$num.flac"); } ############################### ##    ############################### sub checkcmd { my $text = shift; chomp $text; $text =~ s/ $//g; print "+OK - Got command \"$text\" (Length: ".length($text).")\n"; if($text =~ //) { sayText("  - $text"); } return; } ##  ############################### sub sayText { my $text = shift; print "+OK - Speaking \"$text\"\n"; my $url = "http://translate.google.com/translate_tts?tl=ru&q=".uri_escape_utf8($text); my $ua = LWP::UserAgent->new( agent => "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.872.0 Safari/535.2"); $ua->get($url, ':content_cb' => \&callback); open (MP3, "|padsp splay -M") or die "[err] Can't save: $!\n"; print MP3 $mp3_data; close(MP3); $mp3_data = undef; print "+OK - Done!\n"; return; } sub callback { my ($data, $response, $protocol) = @_; $mp3_data .= $data; # } ######################################## ######################################## 1; 

mic.pl

 #!/usr/bin/perl use strict; use IO::Socket; while (1) { my $rnd = int(rand(1000)); `rec -q -c 1 -r 16000 ./data/input-$rnd.wav trim 0 3`; `flac -f -s ./data/input-$rnd.wav -o ./data/input-$rnd.flac`; `rm ./data/input-$rnd.wav`; my $sock = new IO::Socket::INET( PeerAddr => "localhost", PeerPort => 16000, Proto => 'tcp') || next; print $sock "text ".$rnd; undef $rnd; } 

What happened


Give the right to run our scripts:

chmod 755 srv.pl mic.pl

Run the srv.pl script, wait for all processes to start, say, say, the phrase: “System! One two Three!". We hear in a few seconds: “Your team is two times three”. It should be noted that our team will fall into several sound files and, accordingly, be executed several times. To avoid this, you need to enter a check for the last command. Add this functionality in the next section.

Total


In this article, we have implemented the base of our software for managing the “smart home” system. So far it is not able to do anything except speech recognition and synthesis, but this is temporary;)

In the next article I will tell you how to fasten a web-interface with some tasty buns and camera viewing to this whole thing.

UPD: Part 4

Source: https://habr.com/ru/post/129936/


All Articles