Your bot for a few hours, or talk about beer with the machine

The topic of improving the interaction between machines and humans is more relevant than ever. There were technical possibilities for the transition from the model of "100 clicks" to the paradigm of "say what you want." Yes, I mean various bots, which for several years have been developed by all and sundry. For example, many large companies, not only technological, but also retail, logistics, banks are currently conducting active Research & Design in this area.

A simple example, for example, how does the process of choosing goods in any online store take place? A bunch of lists, categories in which I dig and choose something. It suck's. Or, for example, entering the Internet bank, I come across different menus, if I want to make a transfer, then I have to select the corresponding items in the menu and enter a bunch of data, if I want to see the list of transactions, then again, I have to strain brain and index finger. It would be much easier and more convenient to go to the page, and just say: “I want to buy a liter of milk and half a liter of vodka”, or simply ask the bank: “What about the money?”.

The list of professions facing extinction in a fairly close perspective, are added: tellers, call center operators, and many others. And with a simple example, which took me 7 hours to implement, I will show how you can simply make the integration of speech recognition, and identify entities, using the example of open Wit.Ai (Google Speech API integration is also included).
')

There are many APIs for speech recognition, some of them are paid, some are open for developing their systems. For example, the Google Speech API offers 60 minutes of recognition per month for free, if this limit is exceeded, a fee is charged at a rate of 0.006 cents per minute, in 15 second increments. On the contrary, Wit.ai positions itself as an open API for developers, but whether they will have the same level of service, for example, if for example the number of calls to the service grows to hundreds of thousands, or even millions per month, remains a question.

A couple of weeks ago, we had seminars on Data Science and Artificial Intelligence in Tartu, and many speakers addressed, one way or another, the topic of human-machine interaction in a language that people can understand. And on the following, after the events, the weekend, I decided to implement speech recognition using public services. And of course, I would like the bot to understand that I want to drink beer, and what kind of beer it is. Dark or light, as well as possible and would understand the varieties.

In general, the task is to write on the client side what he said, send to the server, do some transformations, make a call to a third-party API, and get the result.
At first, I planned to go to the Google Speech API to get a transcript of the audio file, and then I wanted to send a string of text to Wit.AI to get a set of entities and intentions.

The choice for backend was trivial for me, it's Spring Boot. And not only because the Java stack is native to me, but also because I wanted to create a small service that will serve as an intermediary between third-party APIs and the client. Additional functionality can be implemented by introducing an additional service.

I posted the source code on github . I deployed a working application on Heroku - https://speechbotpoc.herokuapp.com/ . Before use, allow use of the microphone. Next, click on the microphone icon and say something, then click the icon again. What you said will ~~be used against you~~ will be sent to recognition, and after a while, you will receive the result in the panel on the left. I turned off the choice of translation language, since this example can be used by several people, in order to avoid race condition, the choice of recognition language is disabled.

So, for starters, we create an empty project using Spring Initializr or Spring Boot CLI, as you like, I chose Spring Initializr. The required dependencies are Spring MVC, Lombok (if you don’t want to write a lot of generic code), Log4j. We import the created project framework into a preferred IDE (someone else uses Eclipse?).

First, we need to record the audio file on the client side. HTML5 provides all the features for this, (MediaRecorder interface), but there is an excellent implementation from Matt Diamond, which is distributed under the MIT license, and I decided to take it, because Matt developed a good visualization on the client side as well. Most of the time, in fact, was occupied not by the development of the server part, but by the implementation on the client side. I did not use AngularJS or ReactJS, because I was interested in general integration, and my choice was jQuery, cheap and cheerful.

As for the server part, at first I wanted to use the Google Speech API for the initial transcription of an audio file into text, and since this API requires transcoding of recorded speech to Base64, then on the client side, after receiving the audio data, I transcoded to Base64, and then sent to the server.

In our frame we create a controller that will receive our audio file.

@RestController public class ReceiveAudioController { @Autowired @Setter private WitAiService service; private static final Logger logger = LogManager.getLogger(ReceiveAudioController.class); @RequestMapping(value = "/save", method = RequestMethod.POST) public @ResponseBody String saveAudio(@RequestBody Audio data, HttpServletRequest request) throws IOException { logger.info("Request from:" + request.getRemoteAddr()); return service.processAudio(data); } }

Everything is quite simple, we take the data and transfer it to the service, which does all the further work.

WitAiService is also quite simple.

 @Data @Component @ConfigurationProperties(prefix="witai") public class WitAiService { private static final Logger logger = LogManager.getLogger(WitAiService.class); private String url; private String key; private String version; private String command; private String charset; public String processAudio(Audio audio) throws IOException { URLConnection connection = getUrlConnection(); OutputStream outputStream = connection.getOutputStream(); byte[] decoded; decoded = Base64.getDecoder().decode(audio.getContent()); outputStream.write(decoded); BufferedReader response = new BufferedReader(new InputStreamReader(connection.getInputStream())); StringBuilder sb = new StringBuilder(); String line; while((line = response.readLine()) != null) { sb.append(line); } logger.info("Received from Wit.ai: " + sb.toString()); return sb.toString(); } private URLConnection getUrlConnection() { String query = null; try { query = String.format("v=%s", URLEncoder.encode(version, charset)); logger.info("Query string for wit.ai: " + query); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } URLConnection connection = null; try { connection = new URL(url + "?" + query).openConnection(); } catch (IOException e) { e.printStackTrace(); } connection.setRequestProperty ("Authorization","Bearer " + key); connection.setRequestProperty("Content-Type", "audio/wav"); connection.setDoOutput(true); return connection; } }

All necessary parameters, such as a key and a token for Wit.AI, are taken from the application.properties file. (yes, I opened tokens for my application). If you want to register your application on Wit.AI, then you will need to change the tokens and App.ID in the settings files. Registration on Wit.AI is quite simple, the token can be obtained in the Settings section.

Spring Boot tightens settings using annotations. The Data and Setter annotations are Lombok project annotations that help avoid writing template code, in the form of setters, getters, default constructors, and so on.

I don’t include the model files and the auxiliary converter here; all this can be viewed in the source code.

If you open the Developer Console in Chrome, you will see that after you have written what has been said, a REST call is made to the service, where the audio data encoded in Base64 is transmitted to the server in the content field. I had some problems with Google billing (billing is registered to an old credit card that is closed and I haven’t yet issued a new one), and after I wrote the interaction with the Google Speech API, I decided to send the recognition directly to Wit.AI. But Wit.AI accepts the data in streaming form as it is, and therefore the server translates from Base64 back to wav format.

  byte[] decoded; decoded = Base64.getDecoder().decode(audio.getContent()); outputStream.write(decoded);

Small points regarding recognition. Pay attention to the microphone level, because if it is very sensitive, you will get a cut off sound that will negatively affect the quality of recognition.

Now that we have gotten the recognition of what we have said, let's learn to understand the application, what we specifically want. In the Understanding Wit.AI section (To do this, you need to register your application on Wit.AI, that is, be registered, it is easy.) We can specify the so-called intent, or intentions. It looks quite logical, we write a line, for example, “Let's go drink beer” and select the word we need, in this case “Beer”, and click on Add a new entity , then we select the intent and create an intention "Beer". Go ahead, we want to understand whether we want to drink beer, and then we create a new entity, “Drink” or “Drink”. Further, it is recommended to introduce some more examples, like: “Maybe for a beer today”, “Let's drink a beer tomorrow”, etc. Then the system will more and more accurately highlight the intent - Beer.

Now let's say we want to understand what kind of beer we want to drink, light or dark. In the same way, we introduce a new phrase, “Let's drink dark beer tomorrow,” and on the word “dark” we should again click on Add a new entity , but further, we should not use the existing one, but create our own essence, in my application it is and is called beer_type . Then we repeat for light beer, just choose the already created entity beer_type . As a result, the system begins to understand that I want to drink beer, and what kind of beer specifically. It is also possible to set all of the above not manually, but automatically, using the WEST.AI REST interface. Categories are easy to translate into essence in batch mode.

Try to suggest in the example to drink dark or light beer, you will see that the system returns an object in which there is both a beer intention and a beer_type - dark or light.

This example is a little toy, BUT ... Adding entities can be done through the REST interface, so you can quite easily transfer the product catalog to the bot essence, there are many context entities in Wit.AI, such as time / date, location, and so on. That is, you can get information from contextual words such as today or tomorrow (date), here (place), etc. Detailed documentation is here .

By itself, the code is quite simple, and everything is clear from it. The logic of integration with other services is the same. After receiving the transcription, you can transfer this string to other services, for example, write a service that will add the desired goods to the basket, or simply transfer the string to the translation for the neuroservice ( http://xn--neurotlge-v7a.ee/# while from Estonian to English and back, but you can train the model in other languages). That is, this example can serve as a small brick to build a more complex flow of interactions. For example, I can, for example, order a beer at home, combining this example with a food ordering service, receiving their token or cookie. Or, on the contrary, send friends suggestions about beer by messenger. Uses a huge amount of options

Possible ways of use: online stores, online banks, transfer systems, etc.

If you have questions, you can contact.

Source: https://habr.com/ru/post/328612/

All Articles

Your bot for a few hours, or talk about beer with the machine

More articles: