Huge open dataset Russian speech

Speech recognition specialists have long lacked a large open corpus of oral Russian, so only large companies could afford to engage in this task, but they were not in a hurry to share their experiences.

We are in a hurry to correct this lasting misunderstanding for years.
')
So, we bring to your attention a data set of 4000 hours of annotated speech, collected from various online sources.

Details under the cut.

Here is the data for the current version 0.3:

Data type	annotation	Quality	Phrases	Clock	GB
Books	alignment	95% / clean	1.1M	1.511	166
Calls	ASR	70% / noisy	837K	812	89
Generated (Russian addresses)	Tts	100% / 4 votes	1,7M	754	81
Speech from YouTube videos	subtitles	95% / noisy	786K	724	78
Books	ASR	70% / noisy	124K	116	13
Other datasets	reading and alignment	99% / clean	17K	43	five

But you immediately link to the site of our corps .

Will we develop the project further?

Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.

And then we are going to make open and commercial models for speech recognition using this dataset. And we offer you to join: help us improve dataset, use it in your tasks.

Why is our goal 10 thousand hours?

There are various studies of the generalization of neural networks in speech recognition, but it is known that good generalization does not work on datasets for less than 1000 hours. The figure of the order of 10 thousand hours is already considered acceptable in most cases, and then depends on the specific task.

What else can be done to improve the quality of recognition, if there is still not enough data?

Often, you can adapt the neural network to your speakers through the recital of the speakers of the texts.
You can also adjust the neural network to a dictionary from your subject area (language model).

How did we do this?

Found channels with high-quality subtitles on YouTube, downloaded audio and subtitles
Gave audio for recognition to other speech recognition systems.
Reading addresses with robo voices
Audiobooks and texts of books were found on the Internet, after which they were broken into fragments in pauses and compared with one another (the so-called “alignment” task)
Added small Russian datasets available on the Internet.
After that, the files were converted into a single format (16-bit wav, 16 kHz, mono, hierarchical arrangement of files on the disk).
Metadata was saved in a separate file manifest.csv.

How to use it:

File DB

The location of files is determined by their hashes, like this:

target_format = 'wav' wavb = wav.tobytes() f_hash = hashlib.sha1(wavb).hexdigest() store_path = Path(root_folder, f_hash[0], f_hash[1:3], f_hash[3:15]+'.'+target_format)

Reading files

 from utils.open_stt_utils import read_manifest from scipy.io import wavfile from pathlib import Path manifest_df = read_manifest('path/to/manifest.csv') for info in manifest_df.itertuples(): sample_rate, sound = wavfile.read(info.wav_path) text = Path(info.text_path).read_text() duration = info.duration

The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the phrase duration in seconds.

Filter only files of a certain length

 from utils.open_stt_utils import (plain_merge_manifests, check_files, save_manifest) train_manifests = [ 'path/to/manifest1.csv', 'path/to/manifest2.csv', ] train_manifest = plain_merge_manifests(train_manifests, MIN_DURATION=0.1, MAX_DURATION=100) check_files(train_manifest) save_manifest(train_manifest, 'my_manifest.csv')

What to read or watch in Russian, to get better acquainted with the task of speech recognition?

Recently, as part of the Deep Learning course, we recorded a lecture on speech recognition (and a bit of synthesis) on our fingers . Perhaps it will be useful to you!

Licensing issues

We lay out dataset under a double license: for non-commercial purposes, we offer a license cc-by-nc 4.0 , for commercial purposes - use after an agreement with us.
As usual in such cases, all rights to use the data included in the data remain with their owners. Our rights apply to dataset itself. For scientific and educational purposes, there are separate rules for this, see the legislation of your country.

Once again , the project site for those who did not see the link above .

Source: https://habr.com/ru/post/450760/