Recognition and conversion of subtitles from VOB to SRT format

In this article I would like to touch on one important aspect that lovers of watching movies and video products face in the original language, wishing to copy the subtitles from the original DVD disc to watch the movie in the original language. Agree that the best translation in most cases loses the original audio track.

As you know, on DVD discs subtitles are presented in a pre-rendered format, which makes it impossible to edit or translate them. The available utilities for automated conversion are not only focused on the English-speaking audience, in addition to doing their job quite badly, there are a lot of errors in the recognized text. Having taken care of this issue, in one evening I developed and successfully tested a simple Perl method and script, which I bring to your attention.

We will need the following programs: FineReader, SubRip and the Perl interpreter to execute the script for assembling subtitles from text files recognized by FineReader. Where can I get them from Yandex or Google, all of these programs are widely known.

So we begin.
1. Run the SubRib utility and open the VOB with the suffix * _0.VOB file of the desired video sequence. Select the subtitle track you need as shown in the screenshot below. Select the option “save subtictures as BMP”.
')

2. Click the Start button. Select the directory where SubRip will save the extracted subtitle images in BMP format, then specify the file prefix, the numeric number and the BMP extension will be added to SubRip automatically. Then select the subtitle rendering layout as shown below. From my own experience, I recommend choosing a black and white custom scheme, resetting the values of the Color 1 and Color3 parameters, and setting the minimum values for the Color2 and Color4 parameters.

3. Wait while SubRip will extract images from VOB files and create pictures in BMP format in your previously selected directory. The save process will be displayed in a new window opened by the application.

4. After the process is completed, save the generated SubRip file with subtitle timings, it will be visible in a new window that opens during the generation of BMP images. Select ASCII format.

5. That's it, now we have subtitles and a timing file in our hands. Time to open the FineReader. Launch FineReader, select the recognition languages present in the subtitles (if there are more than one), select the option “open PDF or images”, use CTRL-A to select all images from our catalog in the dialog. Before you open the image, indicates the recognition options. Configuration options shown in the two screenshots below.

To simplify the process, you can use only the built-in templates either, but if you want to control the recognition process with your own templates, select the second option.

6. After recognizing the text and checking it, you need to save the result. Since FineReader did not always correctly recognize the end of the paragraph in the subtitles, according to the results of the experiments, I chose the option to save into separate files.
The type of the saved file (we save to a text file) is shown in the screenshot below:

When saving, select a directory, specify a prefix for text files, from the drop-down menu select the item “create a separate file for each page” and then click on the “options” button

and specify the save options as shown below.

7. As a result of all the above actions, we have a directory with a lot of text files in UTF-8 encoding. Now we need to convert them. For this, I wrote a small script for assembling subtitles based on the previously saved in step 4 and many text files. To do this, save the Perl script shown below or download the executable file of the compiled version of the script and run with two parameters,
--subtutles full path and name of the directory with text files
--timing the full path and name of the timing file.

#!/usr/bin/perl use strict; use warnings; use Getopt::Long; use File::GLob; use utf8; #perl2exe_include "unicore/Heavy.pl" #perl2exe_include "overloading.pm" #perl2exe_include "File/Glob.pm" #------------------------------------------------------------------ my ($arg_subtitles,$arg_timing); GetOptions("subtitles=s"=> \$arg_subtitles, "timing=s"=> \$arg_timing); usage() if (!$arg_subtitles || !$arg_timing); $arg_subtitles =~ s#[/\\]#\\\\#g; $arg_timing =~ s#[/\\]#\\\\#g; my $buf = ""; my @subs_array; while (<$arg_subtitles/*.txt>){ my $fname = $_; my $sub_number = $1 if ($fname =~ /^.*?0{0,5}(\d{1,5})\.txt$/); local $/; open (sFILE,$fname) or die "Can't read file $fname [$!]\n"; $buf = <sFILE>; $buf =~ s/\xEF\xBB\xBF//; close (sFILE); $subs_array[$sub_number]=$buf; } open(tFILE, "<".$arg_timing) or die "Can't read file $arg_timing [$!]\n"; print "\xEF\xBB\xBF"; while (<tFILE>) { if (m/(\d{2,2}:\d{2,2}:\d{2,2}):(\d{2,2}) (\d{2,2}:\d{2,2}:\d{2,2}):(\d{2,2}) \S+(\d{5,5})\.\w{3,3}/) { my $start_hms= $1; my $start_mls= $2; my $end_hms=$3; my $end_mls=$4; my $sub_number = $5; $sub_number =~ s/^0{0,4}//; print "$sub_number\n$start_hms,$start_mls"."0"." --> $end_hms,$end_mls"."0"."\n".$subs_array[$sub_number]."\n\n"; } } close (tFILE); sub usage { die <<"EOT"; Usage: $0 --subtitles path_to_the_subs_folder --timing path_to_the_timing_file path_to_the_subs_folder is the name of the folder where recognised subtitles are stored while saving recognised subtitles from BMP images, choose text format and "store one file per page" options EOT }

The script outputs the generated UTF8 file to the console, so you can redirect it to a file of your choice.
That's all, thank you for your attention.

Source: https://habr.com/ru/post/189804/

All Articles

Recognition and conversion of subtitles from VOB to SRT format

More articles: