Processing voice requests in Telegram using Yandex SpeechKit Cloud

How it all began

This summer, I participated in the development of the Datatron bot, which provides access with open financial data from the Russian Federation. At some point, I wanted the bot to process voice requests, and decided to use Yandex for this task.

After a long search for at least some useful information on this topic, I finally met a man who wrote voiceru_bot and helped me deal with this topic (sources refer to his repository). Now I want to share this knowledge with you.

From words to practice

Below will be the fragments of the code is completely ready for use, which you can almost just copy and paste into your project.

Step 1. Where to start?

Get a Yandex account (if you don’t have one). Then read the SpeechKit Cloud API terms of use . In short, for non-commercial projects with the number of requests no more than 1000 per day, use is free. Then go to the Developer's Cabinet and order the key to the desired service. It is usually activated within 3 business days (although one of my keys was activated after a week). Finally, review the documentation .

Step 2: Save the sent voice recording

Before you send a request to the API, you need to get the voice message itself. In the code below, in a few lines, we get an object in which all data about the voice message is stored.

import requests @bot.message_handler(content_types=['voice']) def voice_processing(message): file_info = bot.get_file(message.voice.file_id) file = requests.get('https://api.telegram.org/file/bot{0}/{1}'.format(TELEGRAM_API_TOKEN, file_info.file_path))

Saving an object to the file variable, we are primarily interested in the content field, which stores the byte recording of the sent voice message. We need it for further work.

Step 3. Recoding

The voice message in Telegram is saved in OGG format with the Opus audio codec. SpeechKit can handle audio data in the OGG format with the Speex audio codec. Thus, it is necessary to convert the file, best of all to PCM 16000 Hz 16 bit, since according to the documentation this format provides the best quality of recognition. FFmpeg is perfect for this. Download it and save to the project directory, leaving only the bin folder and its contents. Below, a function is implemented, which with the help of FFmpeg recodes a stream of bytes in the desired format.

 import subprocess import tempfile import os def convert_to_pcm16b16000r(in_filename=None, in_bytes=None): with tempfile.TemporaryFile() as temp_out_file: temp_in_file = None if in_bytes: temp_in_file = tempfile.NamedTemporaryFile(delete=False) temp_in_file.write(in_bytes) in_filename = temp_in_file.name temp_in_file.close() if not in_filename: raise Exception('Neither input file name nor input bytes is specified.') #        FFmpeg command = [ r'Project\ffmpeg\bin\ffmpeg.exe', #   ffmpeg.exe '-i', in_filename, '-f', 's16le', '-acodec', 'pcm_s16le', '-ar', '16000', '-' ] proc = subprocess.Popen(command, stdout=temp_out_file, stderr=subprocess.DEVNULL) proc.wait() if temp_in_file: os.remove(in_filename) temp_out_file.seek(0) return temp_out_file.read()

Step 4. Audio transmission in parts

SpeechKit Cloud API accepts a file up to 1 MB in size, while its size must be specified separately (in Content-Length). But it is better to implement file transfer in parts (no larger than 1 MB in size using the Transfer-Encoding header: chunked). It's safer, and text recognition will be faster.

 def read_chunks(chunk_size, bytes): while True: chunk = bytes[:chunk_size] bytes = bytes[chunk_size:] yield chunk if not bytes: break

Step 5. Sending a request to the Yandex API and parsing the response

Finally, the last step is to write one single function that will serve as the "API" for this module. That is, first it will call the methods responsible for converting and reading bytes by blocks, and then go to the SpeechKit Cloud request and read the response. By default, for requests, the topic is set to notes, and the language is Russian.

 import xml.etree.ElementTree as XmlElementTree import httplib2 import uuid from config import YANDEX_API_KEY YANDEX_ASR_HOST = 'asr.yandex.net' YANDEX_ASR_PATH = '/asr_xml' CHUNK_SIZE = 1024 ** 2 def speech_to_text(filename=None, bytes=None, request_id=uuid.uuid4().hex, topic='notes', lang='ru-RU', key=YANDEX_API_KEY): #    if filename: with open(filename, 'br') as file: bytes = file.read() if not bytes: raise Exception('Neither file name nor bytes provided.') #     bytes = convert_to_pcm16b16000r(in_bytes=bytes) #     Yandex API url = YANDEX_ASR_PATH + '?uuid=%s&key=%s&topic=%s&lang=%s' % ( request_id, key, topic, lang ) #    chunks = read_chunks(CHUNK_SIZE, bytes) #      connection = httplib2.HTTPConnectionWithTimeout(YANDEX_ASR_HOST) connection.connect() connection.putrequest('POST', url) connection.putheader('Transfer-Encoding', 'chunked') connection.putheader('Content-Type', 'audio/x-pcm;bit=16;rate=16000') connection.endheaders() #    for chunk in chunks: connection.send(('%s\r\n' % hex(len(chunk))[2:]).encode()) connection.send(chunk) connection.send('\r\n'.encode()) connection.send('0\r\n\r\n'.encode()) response = connection.getresponse() #    if response.code == 200: response_text = response.read() xml = XmlElementTree.fromstring(response_text) if int(xml.attrib['success']) == 1: max_confidence = - float("inf") text = '' for child in xml: if float(child.attrib['confidence']) > max_confidence: text = child.text max_confidence = float(child.attrib['confidence']) if max_confidence != - float("inf"): return text else: #      - -    raise SpeechException('No text found.\n\nResponse:\n%s' % (response_text)) else: raise SpeechException('No text found.\n\nResponse:\n%s' % (response_text)) else: raise SpeechException('Unknown error.\nCode: %s\n\n%s' % (response.code, response.read())) #    lass SpeechException(Exception): pass

Step 6. Using the written module

Now we add the main method, from which we will call the function speech_to_text. It only needs to add processing of the case when the user sends a voice message in which there are no sounds or recognizable text. Remember to import the speech_to_text function and the SpeechException class if necessary.

 @bot.message_handler(content_types=['voice']) def voice_processing(message): file_info = bot.get_file(message.voice.file_id) file = requests.get( 'https://api.telegram.org/file/bot{0}/{1}'.format(API_TOKEN, file_info.file_path)) try: #      text = speech_to_text(bytes=file.content) except SpeechException: #  ,     else: # -

That's all. Now you can easily implement voice processing in your projects. And not only in the Telegram, but also on other platforms, based on this article!

Sources:

»@Voiceru_bot: https://github.com/just806me/voiceru_bot
» Telebot library was used to work with Telegram API on Python

Source: https://habr.com/ru/post/311578/

All Articles