📜 ⬆️ ⬇️

How to determine the language of the text?

image I want to once and for all solve the problem of determining the language of user input on the site. Imagine that I am doing a multilingual Habrahabr :-) and do not want to ask the user what language he writes. I think the computer should cope with this problem.



Task statement

Existing solutions are not suitable. The problem is that they were made by mathematicians and programmers. These solutions mainly analyze one parameter and give out strange probabilities that the text is written in some language. And I do not need probabilities. I need to determine the language :-). The second problem is that statistical algorithms fall apart on texts containing the inclusion of other languages.
')
I think you need to consistently analyze a lot of parameters.

For the sake of experiment, I tried to visually identify an unfamiliar language of the text. For example, I easily distinguish Portuguese from German, although I do not know either of them.

The algorithm of my actions is something like this:
Some examples of inconsistent algorithms

vitali.at.tut.by is a statistical algorithm based on counting the number of two-letter combinations in the text. Test failed because The binary has been removed from the site.
Barley module .
A living example that said this article is written in Turkish.
There is also an article about neural network classifiers and semantic analysis :
In a polygram model with degree n and base M, the text is represented by the vector {f i }, i = 1..M n , where f i is the frequency of occurrence of the i-th n-gram in the text. of the form a 1 ... a n-1 a n ...
- Nichrome did not understand

I did not search further.


An example of a consistent algorithm

Google Translator perfectly defines a language using several words. It does not confuse even the inclusion of "foreign" words in the sentence.


PS

I also have a strange desire to do this on the client side, using javascript. I don’t think that analyzing a few words requires accessing the Google Language API ...

PS 2

As a result, I used the Google Language API ... I suspect that in controversial cases they use a dictionary search, and I cannot afford it on the client side.

Source: https://habr.com/ru/post/52239/


All Articles