Word for human speech: what it can do

Descript has developed an audio editor that helps to get not just text transcripts of podcasts , speeches or talks, but also edit audio recordings as easily as ordinary text in Word.

At the end of last year, Descript - a startup Andrew Mason (founder of Groupon) - raised $ 5 million from a16z venture fund (Andreessen Horowitz).

In this article we will tell you more about what a new product is, and we will touch on other developments in the field of transcription.
')

/ photo / Victorgrigas / CC

How did Descript come about and what problem did it solve?

Descript was launched three years ago as one of the internal tools of another application that Andrew is working on. This application - Detour - audio guide, replacing the guide pre-prescribed tours and stories about the sights.

Now the service offers more than 150 audio tours around the 17 largest cities in the world. Recording and editing audio recordings is quite a laborious process that takes time and work of specialists. On the other hand, the company's business model provides for a fairly rapid scaling and attracting a large number of speakers who do not have the necessary skills to handle records.

This is where the Descript is included - an audio editor with the possibility of transcription. He translates the story into text form and allows you to edit the audio in text form. Thus, the company optimizes the process of voice acting and processing of audio recordings.

For the past two and a half years, the Detour team has helped audio content producers work with Descript. The experience gained during such interaction allowed the company to finalize the application and release it as a separate product.

What can this audio editor

The capabilities of Descript in its current state are as follows:

Works with recordings in .m4a, .mp3, .aiff, .aac, and .wav formats - you can upload several audio files for processing at once.
Transcribed with an accuracy of 93.3% - according to a company that compares it with competitors - Temi (88.3%), Trint (87.4%), Happyscribe (86.6%) - and gives a comparative table of thematic services with examples of audio recordings.
Allows you to add pauses and rearrange the fragments, while edits are synchronized with the audio that you can immediately listen to - on the principle of WYSIWYG .
It can export a project to Apple Logic Pro X, AVID Pro Tools, Adobe Audition, and allows commenting by analogy with the editing mode in Word or Google Docs.

Service analogs use themed APIs from IBM Watson, Speechmatics, Nuance, Microsoft, and Amazon. The Descript team has selected the appropriate Google API.

According to the team, the main argument in its favor is the access to the huge amounts of data needed for modeling and accurate speech recognition - in the case of Google, such a repository of speech samples comes up, for example, YouTube.

Who else does something like that

In 2016, the staff of Princeton University developed another “photoshop for audio” - VoCo (by the way, alizar has already spoken about it before). This tool is similar to Descript, and allows you not only to edit audio recordings in text form, but also to synthesize new words or phrases in the voice of the speaker (even if they did not appear in the original recording). This requires a recording of 20 minutes. VoCo takes into account the context and adds the appropriate intonational emphasis on new fragments.

Such services can help not only journalists, media companies or entrepreneurs creating topical startups based on the use of audio content. For those people who, due to the presence of specific diseases, can communicate only with the help of speech synthesis systems, VoCo and analogs will help to speak in a less “robotic” voice. One of the most famous examples is the speech synthesis system developed by Intel specifically for Stephen Hawking (this system and earlier analogues were told on the GT here and here ).

/ photo / Intel Free Press / CC

The startup Lyrebird, introduced this year, followed the path of VoCo. If we compare its capabilities with the project of Princeton University, then Lyrebird needs to analyze only 60 seconds of audio recording for the subsequent synthesis of speech.

This year, the startup Voysis also announced itself, which is aimed at using audio services like Siri and Alexa in a niche. Another project is the NowTranscribe service, which specializes in predicting those fragments that can be used to supplement or correct the original audio recording. And another example - Trint, able to understand which of the speakers belongs to one or another phrase, uttered on the record. This project works with 13 languages and is focused on tasks related to recorded conferences and negotiations.

Speech synthesis and ethical question

The emergence of Descript and similar services raises the question of the ethical use of speech synthesis systems. With the help of such tools, any person can fabricate a new audio recording from separate fragments of another person’s speech. This opens up the possibility for various kinds of fraudulent schemes, attacks using social engineering methods and direct damage to the reputation of the speakers.

The developers of such projects are well aware of this situation. On the website of a startup Lyrebird there is a whole section devoted to the ethical side of the issue. And Andrew Mason, the head of Descript, stresses that in the near future the credibility of any audio material may fall by analogy with photos that can be changed with the help of well-known graphic editors.

Interesting sound - other materials that we have prepared for you:

Source: https://habr.com/ru/post/374235/

All Articles

Word for human speech: what it can do

How did Descript come about and what problem did it solve?

What can this audio editor

Who else does something like that

Speech synthesis and ethical question

More articles: