📜 ⬆️ ⬇️

Improving your English: reinventing subtitles

1. Intro




- Tatyana Leonidovna, is it possible, we will see this movie with subtitles?
- No, young woodpeckers, we are training your auditory perception, so you will watch movies without them! With subtitles, you will only read the text and not listen.
- Tatyana Leonidovna, but without subtitles we do not understand more than half!
- And this is your problem.

The beginning of the 2000s, a dialogue with a teacher in a French special school, St. Petersburg.


')

2. What is the matter?


TV shows and movies are great things to improve English. You already know the grammar, own a large set of words. It’s still too early to maintain a free conversation with a native speaker, and it’s already boring to practice tests and exercises. You start watching movies and TV shows.

You look yourself and look. It seems everything is clear, clear, but here begins a quick dialogue between two characters, from which you understand only prepositions. Ok, turn on the subs. And they solve the problem - you begin to understand what is happening.

However, after watching several videos with sabs, people often notice two things.


Without sabs, nothing is clear, and with sabs progress in auditory perception is inhibited and ... it is still incomprehensible.

3. Now what?




There are 7 words on this screen from “South Park”. 6 of them are familiar to almost everyone learning English. And they can be easily recognized and understood, even if they are spoken quickly and with an accent. One word remains, with which (with high probability) there will be problems. The word weary is tired, weary.



And the remaining words can be thrown out. They are familiar to almost everyone and absolutely do not need to be shown on the screen. If we apply this logic to the rest of the scenes, we will get subs in which only difficult words appear, and we will have to listen and understand the rest.

As it turned out, this idea is not new at all. Quick googling showed that at least several bloggers wrote articles with a similar idea, but offered to adapt the subtitles manually. And we, geeks, will do automatic adaptation of subs programmatically!

4. Build a bike


The task is reduced to the search for complex words in the text that need translation.

The basic idea is that you can analyze ooooochen many texts in English, count statistics on the use of words and understand that some words are used much less frequently than others. These rare words fall under the concept of “compound word” - they are rarely found, so you do not know their translation and writing.

I already did all this as a hobby after work (by the way, here’s an article about how it all began). All this resulted in the Bamboo Ninja project, which allows you to analyze books in English, find complex words in them, insert a translation and collect the book back. Subtitles are also text, so I will take the lessons from there and apply them to subtitles.

We open the subs, break them into pieces, then into separate words and begin the analysis. For each word, we need to solve a binary classification problem — pass a word through an algorithm that returns 1 or 0 at output — whether the word is easy for an English learner or difficult. The classifier makes his decision based on statistical data obtained from analyzing ~ 40 GB of text data from different sources (in general it was worth collecting data really from very different sources: gutting chat logs, news, lyrics. And I was too lazy and used mostly book texts, But more on that later).

Then there is a certain amount of messing around with the database, writing the code and it turns out that the subs look like


5. We go on a built bike


I drove through a program of 3-4 dozen sabs, estimated the values ​​of the metrics that the analyzer issued. I tried to watch movies with what happened. Showed to friends, acquaintances and site visitors.

To evaluate the results, I used two classic metrics for machine learning tasks:

It turned out that metric values ​​tend to jump from movie to movie. On some films, the completeness and accuracy showed 85% -90% of the desired, and on the other - around 55%. Having rummaged in a problem, I found the reason - I collected most of the data for statistical analysis from art books over the past 300 years and some words are found more often in them than are found in modern English. For example, the word bayonet (bayonet) in those days was much more common than now, but our classifier considers this word not so rare.

Although Colin, my friend from Britain, laughed for a long time and said that the expression “my meat bayonet” ( beef bayonet ) is now very often found among the military, but we will not consider this case.

I decided to roll back to the old version of the classifier, which I used several months ago. It was built in the summer using only 500 large books, but the books in that sample were more diverse: Harry Potter, A Song of Ice and Flame, technical documentation for programmers, books on psychology, medicine, and much more. A classifier with a smaller but more varied amount of data turned out to be an order of magnitude better than a classifier built only on English fiction. The word recognition algorithm has become much less common.

The result obtained is generally consistent with the goal, but the algorithm still provides subs that are suitable for a person who has solid experience in using English. You need to have a certain skill in speech recognition and a tangible vocabulary of several thousand basic words. In this case, the subs will stand in good stead in improving English.

I designed all my experiences into the service and screwed it to my hobby site and added a small library of sabs to the same place for those who want to test this thing without departing from the cash register.

6. Outro


Turning watching TV shows into an educational process instead of stupid screen reading seems like a worthwhile task. And the improvement in the operation of the algorithm will make it possible to spend many more evenings with benefit.

Thanks to all! Good movies and success in English.

Source: https://habr.com/ru/post/390677/


All Articles