English subtitles with Leo

Hi, Habrahabr!
I have been learning English for a long time and I want to achieve an ideal, but this process is not fast. At the moment, the level of my English allows me to quite well recognize the spoken language, but for the time being, I still watch movies with subtitles. Even without them, I’m sure that there may be words in the video that I don’t know, and even if the general meaning is clear, I still want to know what the word is.

Thus, when watching a movie, you get the following procedure:

look
we meet an unfamiliar word
switch to the browser tab lingualeo
look for the word, choose the translation, add
watch the movie further

It seems to be quite good, but tiring. I want to watch the movie continuously, but if you know for sure that all the words will be familiar - to abandon the subtitles, listen and train at the same time. As I solved this problem, read on.

Start

That's how the idea for a simple project was born, which I implemented. Since I mostly work with python, the choice fell on it. Having studied the work site, I got to work.

The first thought was simple: download the entire dictionary, regularly remove all the text from the subtitles, compare, give out a pack of so-called unknown words. But then I was a fiasco. In English, there is such a thing - lemma . Of course, in my lingualeo dictionary not all words were in this form, absolutely not all words are in subtitles. Googling I found a library like nltk . It seems to be a way out, in order to get lemma, it was necessary to pass an argument to a verb that is or a noun. How to make out where that? Or you have to write a hefty bike or ...

Unexpected news

Looking through the answers of the server lingualeo, I noticed that there is a lemma in the answer. Thus, I could only compare the search word with lemma and if they matched, this is exactly what I need, otherwise, make another request. The solution is found, you can do business, but there is one BUT - there is still such a thing as stop words. These are words that are usually ignored (don't, a, the, etc.). These lists are googled, they are of different sizes and it’s not clear what to choose. In the expanded lists, theoretically, there can be such words that also need to be learned. And if someone else wants to use my work?

It was decided that the user should make such a list himself, depending on his personal preferences.

The first steps

The first implementation was based on a pack of json files in which I stored the current user's dictionary, words that came from subtitles with translation options, stopwords. To add / iterate / ignore words, I decided to make a web page so that you could visually see what was happening. The problem turned out to be obvious, the Flask chosen by me could and could have been taught to paginate the list of new words (of which there could be one film and a thousand), but it was not worth those labor costs and everything else in this solution had only minuses rather than advantages, therefore I with a clear conscience decided to use sqlite using SQLAlchemy.

I don’t give examples of the code, you can get acquainted with all the code on github (link at the end of the article), I’ll just give here the logic of work.

How it works?

So, when you first start you need to specify the email and password, if they are correct, save. Then we download / update the dictionary - it is sorted by date in lingualeo, which is extremely convenient, because with each new page I save the number of words added to the database. If it is 0, then the words are all saved, to be sure, download a couple more pages, if empty, everything has been updated.

Further, if the text file is specified, we sort it. Comparison with the list of stopwords (words that the user ignores) comes 2 times. At the time of parsing the text file and then, upon receipt of the lemma, if the words are changed. Made so as not to force the server once again :)

By the way, the locally uploaded dictionary can be saved in csv for import into Anki, if there is such a desire, you need to specify the option - --savecsv .

To view new words you need to run flask:

 python3 server.py

Opening localhost: 5000 will open a page with words. They can be immediately ignored (added to stopwords) or added to the active dictionary on Lingualeo. I added a link to save csv, if someone needs it, use case: ignore unnecessary words, and upload the required ones to a file for importing into the same Anki. The translation will be selected the one in which the maximum votes.

Conclusion

The source code is available at the link on github , if there are any suggestions, wishes, welcome to comments or pullrequests / issues.
Thanks for attention :)

Source: https://habr.com/ru/post/339258/

All Articles