Yandex Linguistics API for .NET

After visiting Yet another Conference 2013, I had an idea to write an API for all Yandex linguistics services under .NET. After a brief googling, fortunately, there were no such libraries. Despite the fact that it may not be needed by anyone, I decided to implement it at least in order to practice RestSharp , testing, and various github functions (issuers, release, markdown, etc.). In addition, in the implementation process I had to deal with an interesting string comparison algorithm, which I will mention in the topic.

Immediately throw links to sources and binaries on GitHub: Code , Binary

Implemented APIs

Yandex.Predictor. This service allows applications to receive in the form of hints the most likely continuation of words or phrases. The predictor also considers typos in the original query. This simplifies the text entry process, especially on mobile devices.
Yandeks.Slovar. This service allows applications to get detailed dictionary entries from Yandex machine dictionaries. Articles contain grouped translations, information about parts of speech, examples, as well as transcriptions for English words.
Yandex.Translate. Text translation for more than 30 languages.
Yandex.Speler. The spell checker, which helps to find and correct spelling errors. The work of the service is based on the use of a spelling dictionary. Currently Speller is checking texts in Russian, Ukrainian and English.

RestSharp makes it very easy to write code for synchronous and asynchronous HTTP GET and POST requests, as well as convert the received responses in XML or JSON format to .NET objects (this project used XML).
')

Extended Damerau — Levenshtein distance calculation function

In the process of implementing the speller, I wanted the user to display not only the corrected version of the text, but also errors in it. The thought of Levistein ’s distance immediately came to mind. But:

This algorithm does not take into account the transposition errors, which are 80% when typing (data from Wikipedia).
This algorithm it returns the distance , rather than the position of errors in the new word.

The first drawback was leveled using the Damerau-Levenshtein distance , and the second - using the matrix analysis obtained during the operation of the algorithm (distance is the value of the last row element in the last column of the matrix. Accordingly, in my case, the distance will be the total number of errors returned ).

Thus, the algorithm was implemented to search for the following errors in erroneous (word) and correct (correctedWord) words:

Replacement. Example: synchrophasatron -> sync o phase of the throne
Insert. Example: synchrophasic -> synchrophasic he
Uninstall. Example: synchrophasotron -> synx rof azron
Transposition Example: synchrophasortone -> synchro phas

In addition, the weights of various errors can be adjusted (by default, all have the same weight, equal to one).

The code of the extended Damerau — Levenshtein distance calculation function

public static List<Mistake> DamerauLevenshteinDistance( string word, string correctedWord, bool transposition = true, int substitutionCost = 1, int insertionCost = 1, int deletionCost = 1, int transpositionCost = 1) { int w_length = word.Length; int cw_length = correctedWord.Length; var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1]; var result = new List<Mistake>(Math.Max(w_length, cw_length)); if (w_length == 0) { for (int i = 0; i < cw_length; i++) result.Add(new Mistake(i, CharMistakeType.Insertion)); return result; } for (int i = 0; i <= w_length; i++) d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None); for (int j = 0; j <= cw_length; j++) d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None); for (int i = 1; i <= w_length; i++) { for (int j = 1; j <= cw_length; j++) { bool equal = correctedWord[j - 1] == word[i - 1]; int delCost = d[i - 1, j].Key + deletionCost; int insCost = d[i, j - 1].Key + insertionCost; int subCost = d[i - 1, j - 1].Key; if (!equal) subCost += substitutionCost; int transCost = int.MaxValue; if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1]) { transCost = d[i - 2, j - 2].Key; if (!equal) transCost += transpositionCost; } int min = delCost; CharMistakeType mistakeType = CharMistakeType.Deletion; if (insCost < min) { min = insCost; mistakeType = CharMistakeType.Insertion; } if (subCost < min) { min = subCost; mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution; } if (transCost < min) { min = transCost; mistakeType = CharMistakeType.Transposition; } d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType); } } int w_ind = w_length; int cw_ind = cw_length; while (w_ind >= 0 && cw_ind >= 0) { switch (d[w_ind, cw_ind].Value) { case CharMistakeType.None: w_ind--; cw_ind--; break; case CharMistakeType.Substitution: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution)); w_ind--; cw_ind--; break; case CharMistakeType.Deletion: result.Add(new Mistake(cw_ind, CharMistakeType.Deletion)); w_ind--; break; case CharMistakeType.Insertion: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion)); cw_ind--; break; case CharMistakeType.Transposition: result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition)); w_ind -= 2; cw_ind -= 2; break; } } if (d[w_length, cw_length].Key > result.Count) { int delMistakesCount = d[w_length, cw_length].Key - result.Count; for (int i = 0; i < delMistakesCount; i++) result.Add(new Mistake(0, CharMistakeType.Deletion)); } result.Reverse(); return result; }

Interface

The interface was implemented on WinForms with the hope that the application will run on Mono. However, it was not tested.

This library can be used in any projects, but with attribution (Apache 2.0).

Source: https://habr.com/ru/post/204372/

All Articles

Yandex Linguistics API for .NET

Implemented APIs

Extended Damerau — Levenshtein distance calculation function

Interface

More articles: