In life or work, sometimes you have to deal with texts in a foreign language, whose knowledge is still far from perfect. To read and understand what was at stake (and, at best, learn a few new words), I usually used two options. The first is the translation of the text in the browser, the second is the translation of each word separately using, for example, ABBYY Lingvo. But these methods have many drawbacks. Firstly, the browser offers translation by sentences, which means that it can change the word order and the translation may be even more incomprehensible than the original text. Secondly, the browser does not offer any alternative translations or synonyms for words, which means that it becomes problematic to learn new words. Other variants and synonyms can be obtained by searching for a specific word in the translator, and this takes some time, especially if there are many such words. Finally, reading the text, I would like to know which words are most popular in this language, so that I can memorize them and then use them in my written or spoken language.
I thought that having a similar “translator” at hand would be nice, and so I decided to implement it in python. Anyone who is interested, please under the cat.
Word count
When writing a program, I was guided by the following logic. First you need to rewrite the whole text in lower case letters, remove unnecessary characters and symbols (.?!, Etc.) and count how many times each word occurs in the text. Inspired
by Google ’s
code , I did it without any difficulty, but I decided to record the results in a slightly different form, namely
{1: [group of words that is with frequency 1], 2: [--//-- with frequency 2], etc.}
. This is useful if sorting is required, including within each group of words, for example, if we want the words to go in the same order as in the text. So I want to get a double sort: so that the most frequent words appear at the beginning, and if they occur with the same frequency, then they should be arranged according to the source text. This idea is reflected in the following code.
def word_count_dict(filename, dictList=de500): count = {} txt = re.sub('[,.!?":;()*]', '', open(filename, 'r').read().lower()) words = txt.split() for word in words: if not word in count: count[word] = 1 else: count[word] += 1 return {i: sorted([w for w in count if (count[w]==i and w not in dictList.values())], key=lambda x: txt.index(x)) for i in set(count.values())}
Well, everything works as it should, but there is a suspicion that in the top of the list there will be auxiliary words (such as the) and others whose translation is obvious (for example, you). You can get rid of them by creating a special list of the most frequently used words in order to exclude all the words that are in this list when forming the dictionary. Why is this still convenient? Because, having learned the right word, we can add it to the list, and the corresponding translation will no longer be shown. Denote the list variable
dictList and forget about it for some time.
')
Translation of words
After spending a few minutes searching for a convenient online translator, it was decided to check out Google and Yandex in action. Since
Google shut down the Translate API exactly 3 years and 1 day ago, we will use the workaround offered by
WNeZRoS . In response to a request for a word, Google offers translation, alternative translations and their reverse translation (that is, synonyms). Using Yandex as usual requires receiving a key, and in response to a request, you can find not only a translation, but also examples, and probably something else. In both cases, the answer will contain a list in json format, quite simple with Google, and somewhat complicated with Yandex. For this reason, and also because Google knows more languages ​​(and often words), it was decided to dwell on it.
We will send requests using the wonderful
grab library, and write responses to an auxiliary text file (
dict.txt ). We will try to find the main translation, alternative variants and synonyms in it, and if they are, print them. Let's make it so that the last two options can be disabled. The corresponding code will look like this.
def tranlsate(word, key, lan1='de', lan2='ru', alt=True, syn=True): g = Grab(log_file = 'dict.txt') link = 'http://translate.google.ru/translate_a/t?client=x&text='\ + word + '&sl=' + lan1 + '&tl=' + lan2 g.go(link) data = json.load(open('dict.txt')) translation, noun, alternatives, synonims = 0, 0, 0, 0 try: translation = data[u'sentences'][0][u'trans'] noun = data[u'dict'][0][u'pos'] alternatives = data['dict'][0]['terms'] synonims = data['dict'][0]['entry'][0]['reverse_translation'] except: pass if lan1=='de' and noun==u' ': word = word.title() if translation: print ('['+str(key)+']', word, ': ', translation) if alt and alternatives: [print (i, end=', ') for i in alternatives] print ('\r') if syn and synonims: [print (i.encode('cp866', errors='replace'), end=', ') for i in synonims] print ('\n')
As you can see, the default translation is configured from German to Russian. The variable
key corresponds to the frequency of the word in the text. We will transfer it from another function, which will trigger the translation for each word.
Call transfer function
Everything is simple here: I want to get groups of words with a corresponding frequency in the form of a dictionary (the
word_count_dict function) and find the translation of each word (the
tranlsate function). I also want to show only the first n groups of the most used words.
def print_top(filename, n=100): mydict = word_count_dict(filename) mydict_keys = sorted(mydict, reverse=True)[0:n] [[tranlsate(word, key) for word in mydict[key]] for key in mydict_keys]
List of most used words
Well, the program is almost ready, it remains only to make a list of the most used words. They are easy to find on the Internet, and I made a list of the 50, 100, and 500 most used words in German and wrote it in a separate file so as not to clutter up the code.
If someone wants to make a similar list in English or another language, I will be grateful if he or she shares it so that I can add it to my own.
Preliminary results
By running the program, you can get the results approximately as follows:
[ ] : ,
Well, the code is written, the program works, but how comfortable and efficient is it? To try to answer this question, I took a couple of texts in German for verification.
The first
article from Deutsche Welle is devoted to the topic of financing coal mining at Deutsche Bank near Australia. The article contains 498 words, of which 15 most frequently used in the text (we will use the list of 50 most used German words to exclude) correspond to 16.87% of the whole text. Roughly speaking, this means that if we assume that a person does not know these words, then after reading the translation of 6.67% of all words found in the text, his level of understanding will increase by almost 17% (if you measure the level of understanding only by the number of familiar words in the text) . At first glance, pretty good.
The second
article from Spiegel tells how the German DAX stock index reacted to Poroshenko’s victory in the presidential elections in Ukraine (yes, he grew up). The article contains 252 words, of which 8 of the most common (6.06%) likewise correspond to 11.9% of the text.
In addition, it should be noted that if the translated text is short enough for each word to occur only once (for example, a message received by e-mail), then follow the proposed translation in the same order as the words appear in the text, very convenient.
Sounds nice (es klingt schön), but these are very rough tests, since I have entered too many prerequisites. I think that checking how this idea can facilitate the work with texts in a foreign language is possible only with some regular use of this program, which, unfortunately, is not very convenient yet. To translate the text, you must first copy it into a
.txt file and assign the file name to the variable
filename , and then run the
print_top function.
What is missing?
Instead of a conclusion, I would like to reflect on what is missing at this stage, and how this could be improved.
First, as just said, convenience. The code is inconvenient to use - you need to copy the text, + dependency on python and the grab library. What to do? Alternatively, write a browser extension so that you can select a specific element on the page (for example, similar to how it is implemented in
Reedy ) and receive its translation. Secondly, a list of words to exclude most used in other languages. Finally, various jambs with encodings are possible.
Most likely, in the near future, my hands will not reach the changes described above (since the code was written, it’s time to start learning a deeper language!), So if someone wants to join, I will be happy with the company and the help.
The entire code can be found under the spoiler, as well as on
github .