Literally, Microsoft’s hackathon on Azure ML has just ended. In the process of solving one of the tasks, the processing of Russian-language text data was required, and preferably within the system. As a result, having spent significant time searching for a solution, I want to share it. I hope that this will help someone to save time and not to beat his head against the wall in vain.
As you know, Azure ML has two languages for developing scripts within the system. Let's start with Python, where the problem is solved most easily.
Standard function that is called for data processing:
def azureml_main(dataframe1 = None, dataframe2 = None): for index, row in dataframe1.iterrows(): search = str(row['Search']).decode('utf-8')
Here we take the Search column from the dataset, which is connected to the first input of the block “Execute Python Script” and convert it from utf-8. After that, all string functions work with this line correctly. In case we need to return text data, we need to perform the reverse operation:
')
out_list.append(str.encode('utf-8')) return pandas.DataFrame(out_list)
To use stemming, you need to import the RussianStemmer class, create an object, call the constructor with False parameters, and then use this object. You cannot use the True parameter to load a standard set of stop words at the moment, an error is generated.
from nltk.stem.snowball import RussianStemmer stemmer = RussianStemmer(False) stemmer.stem(word)
Solving a similar problem for R looks so simple that it seems an incredible time that I had to kill to find this solution.
search<-dataset1$Search Encoding(search)<-'UTF-8'
Similarly, select the Search column and set the encoding explicitly. After this call, the string functions start working with the text correctly, I checked on the stringi library. Reverse conversion is not required, everything works fine and so.
Similarly, we implement stemming in R. I’ll draw your attention to the need to specify the UTF-8 encoding, otherwise the stemmer will not do anything.
library(SnowballC) stems <- wordStem(words, language = "russian")