VoiceFabric: speech synthesis technology from the cloud

Today we will talk about the prospects and opportunities of the VoiceFabric cloud service for developers and users. The service voices any textual information with a synthesized voice in real time. Under the cat, we will tell you in detail about our synthesis, scenarios for its use (standard and not so) and how to connect it to our projects, as well as how it is unique.

Why do you need a speech synthesis?
Over the history of the service, we have received from customers hundreds of different options for how this technology can be applied. Sometimes it is the task of adapting services and sites for people with visual limitations, but many use the possibilities of synthesis and just for their own convenience (for example, to listen to books in the car). Using speech synthesis can be extremely effective for solving business problems of large companies and start-ups.

If you classify all requests, you get a not so big list:
1. Voice books and articles for private use. You can make audiobooks and offer them to others.
2. Voice of videos on YouTube and other video channels. Usually these are educational videos / lectures or foreign videos / interviews for which there are captions in Russian. For example .
3. Create audio podcasts based on RSS feeds and news feeds.
4. Voice of the site content. For example (button in the header of the site).
5. Sound any dynamic information in the IVR-menu call centers (telephony). You can also static messages, too. Call KC Russian Railways, Megaphone, Rosselkhozbank, etc.
6. Social networks. For example, we have a joint project with VKontakte.
7. Mobile applications.
8. Informational messages in the GGS networks: announcements at railway stations and in transport, various autoinformers, car dialers, etc.
9. Voices for robots and virtual consultants, when the texts change all the time and voice all the options with the help of announcers for a long time and is not very convenient.
')
What kind of speech synthesis do we have
Currently there are 9 different voices:
- 7 in Russian (2 male and 5 female);
- 1 American English - Carol;
- 1 voice of Kazakh language - Assel . (According to our data, this is the only Kazakh synthesis in the world, ready for industrial implementation, in any case, we haven’t found any analogues, if you find it - throw it in the comments).

All examples of voices can be heard here .
Each of them is available in a format of 8000 Hz (for telephony) and 22050 Hz.

Our Russian synthesis was developed by Russian scientists and developers. It contains all the rules and grammar, features and abbreviations inherent in the Russian language. And when creating foreign voices, we attracted native speakers to take into account their language features and nuances.

To understand how our Russian synthesis differs from foreign analogues, check out his work on articulating arrays of an unprepared informational text — a natural, conversational, which was originally written for people to read. Such texts usually contain many abbreviations and abbreviations that are immediately clear to the person, but when writing them it was not intended that they would ever be read by the machine.
Try to voice, for example, in Google TTS, the phrase: “University named after prof. Bonch-Bruyevich is located in St. Petersburg, Bolshevik Ave., 22 ”, or something similar, and then compare with our synthesis. On major implementations, we are constantly faced with such texts. A vivid example is the knowledge base in a call center that was once filled for operators. To translate in this case the entire knowledge base into a form that is digestible for the car is an expensive and long exercise.

We also have support for Lipsync technology - this is when animated lips move in time with what they say. You can make virtual characters who will move their lips correctly when they say something.

And, of course, support for SSML markup (speech synthesis markup language).

We also create unique voices to order. We even had the experience of creating a synthesized human voice, which has long been "not with us." Learning speech synthesis took place according to old recordings (even records), so the sound of the synthesis is appropriate. But, nevertheless, it is a real synthesis and it can read any modern text. You can listen to what happened here .

A few words on how to embed synthesis into your project
We offer two ways to use TTS VoiceFabric:

1) API key that is embedded in the web request.
VoiceFabric API service information is exchanged with the application over HTTPS. Text that does not exceed 4096 characters can be passed to the synthesis by a GET request. Text up to 10 MB can be transmitted for synthesis by a POST request.
The format of the output sound file is codec = pcm, bit = 16, rate = 8000, raw.
All requests must be formed according to the HTTP protocol. Query string parameters: UrlEncode, & separator, etc.
All details in the integration documentation .

2) A Web service where you can insert any text (ctrl + C | ctrl + V), select a voice and get the voiced text as an audio file.

Try , look, write comments. Feedback is very important to us.

PS on my own.
I have been doing speech synthesis for a long time and many articles from Habra are no longer read on the site, but I listen. I simply don’t have time to read, and you can listen to interesting articles and at the same time do other things at the same time, or I do MP3 from the article and go outside.

Source: https://habr.com/ru/post/244663/

All Articles

VoiceFabric: speech synthesis technology from the cloud

More articles: