
I want to once and for all solve the problem of determining the language of user input on the site.
Imagine that I am doing a multilingual Habrahabr :-) and do not want to ask the user what language he writes. I think the computer should cope with this problem.
Task statement
- it is necessary to determine its language by a fragment of the text. The length of the fragment - from a few words to a few dozen sentences. This may be a coherent text or, for example, an article title;
- the text is in UTF-8;
- a set of languages ​​of interest to me is rigidly set. Number of languages ​​- 5..10. In a working application, you can even limit yourself to a smaller number in order to increase the accuracy of the determination
- text may contain inclusions of other languages. It is necessary to determine the main (more than 60% of words, for example);
- for my task it is not always necessary to determine precisely. For example, more often than not, it is not necessary to distinguish between Ukrainian and Russian;
- The author of the text knows the language in which he writes .
Existing solutions are not suitable. The problem is that they were made by mathematicians and programmers. These solutions mainly analyze one parameter and give out strange probabilities that the text is written in some language. And I do not need probabilities. I need to determine the language :-). The second problem is that statistical algorithms fall apart on texts containing the inclusion of other languages.
')
I think you need to consistently analyze a lot of parameters.
For the sake of experiment, I tried to visually identify an
unfamiliar language of the text. For example, I easily distinguish Portuguese from German, although I do not know either of them.
The algorithm of my actions is something like this:
- I look at the set of characters that occur in words. Moreover, precisely according to the words, and not throughout the text, since There may be “foreign” words in the text. At this stage two thirds of languages ​​can already be excluded.
- I am looking for characteristic alliances and prepositions - it is possible to divide languages ​​with the same character set by them
- I try to read the text on the assumption that it is written in a language that was defined. It's hard for me to say what happens at this stage ... I think this is a complex analysis of the grammar of the language. I recognize the words I know, check the endings.
Some examples of inconsistent algorithms
vitali.at.tut.by is a statistical algorithm based on counting the number of two-letter combinations in the text. Test failed because The binary has been removed from the site.
Barley module .
A living example that said this article is written in Turkish.
There is also
an article about neural network classifiers and semantic analysis :
In a polygram model with degree n and base M, the text is represented by the vector {f i }, i = 1..M n , where f i is the frequency of occurrence of the i-th n-gram in the text. of the form a 1 ... a n-1 a n ...
- Nichrome did not understandI did not search further.
An example of a consistent algorithm
Google Translator perfectly defines a language using several words. It does not confuse even the inclusion of "foreign" words in the sentence.
PS
I also have a strange desire to do this on the client side, using javascript. I don’t think that analyzing a few words requires accessing the Google Language API ...
PS 2
As a result, I used the Google Language API ... I suspect that in controversial cases they use a dictionary search, and I cannot afford it on the client side.