📜 ⬆️ ⬇️

Huge open dataset Russian speech

image

Speech recognition specialists have long lacked a large open corpus of oral Russian, so only large companies could afford to engage in this task, but they were not in a hurry to share their experiences.

We are in a hurry to correct this lasting misunderstanding for years.
')
So, we bring to your attention a data set of 4000 hours of annotated speech, collected from various online sources.

Details under the cut.

Here is the data for the current version 0.3:
Data typeannotationQualityPhrasesClockGB
Booksalignment95% / clean1.1M1.511166
CallsASR70% / noisy837K81289
Generated (Russian addresses)Tts100% / 4 votes1,7M75481
Speech from YouTube videossubtitles95% / noisy786K72478
BooksASR70% / noisy124K11613
Other datasetsreading and alignment99% / clean17K43five

But you immediately link to the site of our corps .

Will we develop the project further?


Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.

And then we are going to make open and commercial models for speech recognition using this dataset. And we offer you to join: help us improve dataset, use it in your tasks.

Why is our goal 10 thousand hours?


There are various studies of the generalization of neural networks in speech recognition, but it is known that good generalization does not work on datasets for less than 1000 hours. The figure of the order of 10 thousand hours is already considered acceptable in most cases, and then depends on the specific task.

What else can be done to improve the quality of recognition, if there is still not enough data?


Often, you can adapt the neural network to your speakers through the recital of the speakers of the texts.
You can also adjust the neural network to a dictionary from your subject area (language model).

How did we do this?



How to use it:


File DB


The location of files is determined by their hashes, like this:

target_format = 'wav' wavb = wav.tobytes() f_hash = hashlib.sha1(wavb).hexdigest() store_path = Path(root_folder, f_hash[0], f_hash[1:3], f_hash[3:15]+'.'+target_format) 

Reading files


 from utils.open_stt_utils import read_manifest from scipy.io import wavfile from pathlib import Path manifest_df = read_manifest('path/to/manifest.csv') for info in manifest_df.itertuples(): sample_rate, sound = wavfile.read(info.wav_path) text = Path(info.text_path).read_text() duration = info.duration 

The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the phrase duration in seconds.

Filter only files of a certain length


 from utils.open_stt_utils import (plain_merge_manifests, check_files, save_manifest) train_manifests = [ 'path/to/manifest1.csv', 'path/to/manifest2.csv', ] train_manifest = plain_merge_manifests(train_manifests, MIN_DURATION=0.1, MAX_DURATION=100) check_files(train_manifest) save_manifest(train_manifest, 'my_manifest.csv') 

What to read or watch in Russian, to get better acquainted with the task of speech recognition?


Recently, as part of the Deep Learning course, we recorded a lecture on speech recognition (and a bit of synthesis) on our fingers . Perhaps it will be useful to you!


Licensing issues



Once again , the project site for those who did not see the link above .

Source: https://habr.com/ru/post/450760/


All Articles