Speech recognition specialists have long lacked a large open corpus of oral Russian, so only large companies could afford to engage in this task, but they were not in a hurry to share their experiences.
We are in a hurry to correct this lasting misunderstanding for years. ')
So, we bring to your attention a data set of 4000 hours of annotated speech, collected from various online sources.
Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.
And then we are going to make open and commercial models for speech recognition using this dataset. And we offer you to join: help us improve dataset, use it in your tasks.
Why is our goal 10 thousand hours?
There are various studies of the generalization of neural networks in speech recognition, but it is known that good generalization does not work on datasets for less than 1000 hours. The figure of the order of 10 thousand hours is already considered acceptable in most cases, and then depends on the specific task.
What else can be done to improve the quality of recognition, if there is still not enough data?
Often, you can adapt the neural network to your speakers through the recital of the speakers of the texts. You can also adjust the neural network to a dictionary from your subject area (language model).
How did we do this?
Found channels with high-quality subtitles on YouTube, downloaded audio and subtitles
Gave audio for recognition to other speech recognition systems.
Reading addresses with robo voices
Audiobooks and texts of books were found on the Internet, after which they were broken into fragments in pauses and compared with one another (the so-called “alignment” task)
Added small Russian datasets available on the Internet.
After that, the files were converted into a single format (16-bit wav, 16 kHz, mono, hierarchical arrangement of files on the disk).
Metadata was saved in a separate file manifest.csv.
How to use it:
File DB
The location of files is determined by their hashes, like this:
from utils.open_stt_utils import read_manifest from scipy.io import wavfile from pathlib import Path manifest_df = read_manifest('path/to/manifest.csv') for info in manifest_df.itertuples(): sample_rate, sound = wavfile.read(info.wav_path) text = Path(info.text_path).read_text() duration = info.duration
The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the phrase duration in seconds.
What to read or watch in Russian, to get better acquainted with the task of speech recognition?
Recently, as part of the Deep Learning course, we recorded a lecture on speech recognition (and a bit of synthesis) on our fingers . Perhaps it will be useful to you!
Licensing issues
We lay out dataset under a double license: for non-commercial purposes, we offer a license cc-by-nc 4.0 , for commercial purposes - use after an agreement with us.
As usual in such cases, all rights to use the data included in the data remain with their owners. Our rights apply to dataset itself. For scientific and educational purposes, there are separate rules for this, see the legislation of your country.