Adaptation of subtitles based on vocabulary

Idea

I watch a large number of films. After the level of English was enough to watch them with subtitles in the original language (in 99% of cases it is English), I began to constantly face the problem that one or two words will pop up which I do not know, I had to climb In the dictionary, or switch to Russian subtitles, which overshadowed viewing. At first, I thought that it would not be bad to know in advance what unknown words would come across and memorize them, but later it turned out that there were about 200 such words for each film, which made the process of memorization much more difficult. So I came up with the idea of putting the translated word directly while watching a movie. It turned out something like this:

I just took 2-3 words from the translation for each part of the speech, which was quite bad for the homonyms. Later, I had a very clear idea: to define the part of speech by the sentence and substitute the necessary translation. So I met nltk and came up with something like this:

')
With much increased readability. Such a text can really easily catch your eyes without having to stop viewing.

Implementation

Compiling an individual dictionary

I took the idea from here . First, let's see what the list of words of the English language is, ordered by how often they are found in books - this is it .

Compiling a dictionary is based on the assumption that if you know, say, the word value, which is 800 on this list, then you are also very likely to know the word battle, which is already in 715 place.

Let's make another list, this time consisting of unique words from one of the books you read, in my case “His Farewell Bow” by Conan Doyle, and cross it with the first list.

If we try to visualize this intersection, we get the following picture:

As you can see, to read this book without having to look into the dictionary, your vocabulary should consist of 34 thousand words. If you believe this article, then even if you are a 68-year-old old man from Hampshire, you will not succeed.

Ultimately, your vocabulary will be made up of so many words from the first list, which would be to cover 70% of the words from the second list. Of course, the percentage can be increased, but, as practice has shown in my case, the fact that I know 4% of this book, which fall between 7 and 8 thousand, does not mean at all that I know the whole thousand. After 70% too much variation.

Just imagine how many books you need to re-read in order to immediately understand what this monologue is about:

The word cymbal is at 22015 place.

Definition of speech by context

Everything is easy, because nltk does everything for us. We check if we know this word, establish what part of speech it is in this context, translate. The main thing - do not forget to bring the nouns to the singular, and the verbs to the infinitive. Of course, the accuracy of the definition is not so high, because it is an oral speech, and the subtitles are often of poor quality, but the percentage is still quite high. In fact, I use two libraries, but the second one benefits only when it has a discrepancy with the first. In this example, nltk considered that in this context the guys are meant as a verb, this is where the pattern pattern saved us:

Here they went twice:

I would like to say at once that learning the language in this way will not work, only spoil the film. But then again, the goal of this program is to just give you a word, no more. If your vocabulary is not large enough, it will look something like this:

Nothing useful, only suffering.

Github - inside the description of the launch and installation.

Source: https://habr.com/ru/post/365979/

All Articles

Adaptation of subtitles based on vocabulary

Idea

Implementation

Compiling an individual dictionary

Definition of speech by context

More articles: