Weka project for the task of recognition of tonality (sentiment)

This is a translation of my publication in English .

The Internet is full of articles, notes, blogs, and successful machine learning (ML) stories to solve practical problems. Someone uses it for good and just cheer up like this image:

')
However, it is sometimes not so easy for a person who is not an expert in these areas to get close to existing tools. There are certainly good and relatively fast paths to practical machine learning, for example, the scikit Python library. By the way, this project contains code written in the SkyNet team (the author happened to be its leading participant) and illustrating the simplicity of interaction with the library. If you're a Java developer, there are a couple of good tools: Weka and Apache Mahout . Both libraries are universal in terms of applicability to a specific task: from recommender systems to text classification. There are tools and more honed under the text machine learning: Mallet and a set of libraries Stanford . There are also lesser-known libraries like Java-ML .

In this post, we will focus on the Weka library and make a draft or template project for text machine learning with a specific example: the task of recognizing tonality or sentiment (sentiment analysis, sentiment detection). Despite all this, the project is fully operational and even under a commercial-friendly license (Weka itself under GPL 3.0), i.e. with a great desire, you can even apply the code in your projects.

From the entire set of generally suitable Weka algorithms for the selected task, we use the Multinomial Naive Bayes algorithm. In this post, I almost always cite the same links as in the English version. But since the translation task is creative, let me give a link on the subject to the domestic resource for machine learning .

In my opinion and experience of interacting with machine learning tools, usually a programmer is looking for a solution to three problems when using this or that ML library: setting up an algorithm, training an algorithm and I / O, i.e. saving to disk and loading from disk of the trained model into memory. In addition to these purely practical aspects of the theoretical, perhaps the most important is the assessment of the quality of the model. We will touch this too.

So, in order.

Setting the classification algorithm

Let's start with the task of recognition of tonality into three classes.

public class ThreeWayMNBTrainer { private NaiveBayesMultinomialText classifier; private String modelFile; private Instances dataRaw; public ThreeWayMNBTrainer(String outputModel) { // create the classifier classifier = new NaiveBayesMultinomialText(); // filename for outputting the trained model modelFile = outputModel; // listing class labels ArrayList<attribute> atts = new ArrayList<attribute>(2); ArrayList<string> classVal = new ArrayList<string>(); classVal.add(SentimentClass.ThreeWayClazz.NEGATIVE.name()); classVal.add(SentimentClass.ThreeWayClazz.POSITIVE.name()); atts.add(new Attribute("content",(ArrayList<string>)null)); atts.add(new Attribute("@@class@@",classVal)); // create the instances data structure dataRaw = new Instances("TrainingInstances",atts,10); } }

In the above code, the following happens:

An object of the classification algorithm class is created (we love puns)
A list of target class labels is given: NEGATIVE and POSITIVE
Creates a data structure to store pairs (object, class label)

Similarly, but with a larger number of output labels, a classifier is created for 5 classes:

  public class FiveWayMNBTrainer { private NaiveBayesMultinomialText classifier; private String modelFile; private Instances dataRaw; public FiveWayMNBTrainer(String outputModel) { classifier = new NaiveBayesMultinomialText(); classifier.setLowercaseTokens(true); classifier.setUseWordFrequencies(true); modelFile = outputModel; ArrayList<Attribute> atts = new ArrayList<Attribute>(2); ArrayList<String> classVal = new ArrayList<String>(); classVal.add(SentimentClass.FiveWayClazz.NEGATIVE.name()); classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_NEGATIVE.name()); classVal.add(SentimentClass.FiveWayClazz.NEUTRAL.name()); classVal.add(SentimentClass.FiveWayClazz.SOMEWHAT_POSITIVE.name()); classVal.add(SentimentClass.FiveWayClazz.POSITIVE.name()); atts.add(new Attribute("content",(ArrayList<String>)null)); atts.add(new Attribute("@@class@@",classVal)); dataRaw = new Instances("TrainingInstances",atts,10); } }

Classifier Training

The training of the classification algorithm or classifier consists in reporting the example algorithm (object, label) placed in the pair (x, y). An object is described by some features, by a set (or vector) of which one can qualitatively distinguish an object of one class from an object of another class. For example, in the task of classifying objects-fruits, for example, into two classes: oranges and apples, such signs could be: size, color, presence of pimples, presence of a tail. In the context of the key recognition task, the feature vector may consist of words (unigrams) or pairs of words (bigrams). And the labels will be the names (or sequence numbers) of the tonality classes: NEGATIVE, NEUTRAL or POSITIVE. Based on the examples, we expect that the algorithm will be able to learn and generalize to the level of prediction of the unknown label y 'by the feature vector x'.

We implement the method of adding a pair (x, y) to classify tonality into three classes. We assume that the feature vector is a list of words.

 public void addTrainingInstance(SentimentClass.ThreeWayClazz threeWayClazz, String[] words) { double[] instanceValue = new double[dataRaw.numAttributes()]; instanceValue[0] = dataRaw.attribute(0).addStringValue(Join.join(" ", words)); instanceValue[1] = threeWayClazz.ordinal(); dataRaw.add(new DenseInstance(1.0, instanceValue)); dataRaw.setClassIndex(1); }

In fact, as a second parameter, we could pass in a method and a string instead of an array of strings. But we deliberately work with an array of elements so that the code above is able to impose the filters we want. For the analysis of tonality, a completely relevant filter is the gluing of the words of the negatives (particles, etc.) followed by the word: do not like => do not _ like . Thus, signs like and not like form bipolar entities. Without gluing, we would get that the word like can occur in both positive and negative contexts, which means it does not carry the right signal (as opposed to reality). In the next step, when building the classifier, the string from the element-rows will be tokenized and turned into a vector.

Actually, the classifier training is implemented in one line:

 public void trainModel() throws Exception { classifier.buildClassifier(dataRaw); }

Simply!

I / O (saving and loading model)

A very common scenario in the field of machine learning is the training of the classifier model in memory and the subsequent recognition / classification of new objects. However, to work as part of a product, the model must be supplied on disk and loaded into memory. Saving to disk and loading from disk to the memory of a trained model in Weka is achieved very simply due to the fact that the classes of classification algorithms implement the Serializable interface among many others.

Saving a trained model:

  public void saveModel() throws Exception { weka.core.SerializationHelper.write(modelFile, classifier); }

Trained model loading:

  public void loadModel(String _modelFile) throws Exception { NaiveBayesMultinomialText classifier = (NaiveBayesMultinomialText) weka.core.SerializationHelper.read(_modelFile); this.classifier = classifier; }

After loading the model from the disk, let's classify the texts. For the three-class prediction, we implement the following method:

  public SentimentClass.ThreeWayClazz classify(String sentence) throws Exception { double[] instanceValue = new double[dataRaw.numAttributes()]; instanceValue[0] = dataRaw.attribute(0).addStringValue(sentence); Instance toClassify = new DenseInstance(1.0, instanceValue); dataRaw.setClassIndex(1); toClassify.setDataset(dataRaw); double prediction = this.classifier.classifyInstance(toClassify); double distribution[] = this.classifier.distributionForInstance(toClassify); if (distribution[0] != distribution[1]) return SentimentClass.ThreeWayClazz.values()[(int)prediction]; else return SentimentClass.ThreeWayClazz.NEUTRAL; }

Pay attention to the line if (distribution [0]! = Distribution [1]) . As you remember, we defined the list of class labels for this case as: {NEGATIVE, POSITIVE}. Therefore, in principle, our classifier should be at least binary. But! If the probability distribution of the two tags is the same (50% for each), we can confidently assume that we are dealing with a neutral class. Thus, we get a classifier into three classes.

If the classifier is built correctly, then the following unit test should work correctly:

  @org.junit.Test public void testArbitraryTextPositive() throws Exception { threeWayMnbTrainer.loadModel(modelFile); Assert.assertEquals(SentimentClass.ThreeWayClazz.POSITIVE, threeWayMnbTrainer.classify("I like this weather")); }

For completeness, we implement a wrapper class that builds and trains a classifier, saves the model to disk and tests the model for quality:

  public class ThreeWayMNBTrainerRunner { public static void main(String[] args) throws Exception { KaggleCSVReader kaggleCSVReader = new KaggleCSVReader(); kaggleCSVReader.readKaggleCSV("kaggle/train.tsv"); KaggleCSVReader.CSVInstanceThreeWay csvInstanceThreeWay; String outputModel = "models/three-way-sentiment-mnb.model"; ThreeWayMNBTrainer threeWayMNBTrainer = new ThreeWayMNBTrainer(outputModel); System.out.println("Adding training instances"); int addedNum = 0; while ((csvInstanceThreeWay = kaggleCSVReader.next()) != null) { if (csvInstanceThreeWay.isValidInstance) { threeWayMNBTrainer.addTrainingInstance(csvInstanceThreeWay.sentiment, csvInstanceThreeWay.phrase.split("\\s+")); addedNum++; } } kaggleCSVReader.close(); System.out.println("Added " + addedNum + " instances"); System.out.println("Training and saving Model"); threeWayMNBTrainer.trainModel(); threeWayMNBTrainer.saveModel(); System.out.println("Testing model"); threeWayMNBTrainer.testModel(); } }

Quality model

As you may have guessed, testing the quality of the model is also quite simple with Weka. The calculation of the quality characteristics of the model is necessary, for example, in order to check whether our model has been retrained or underestimated. With the lack of knowledge of the model, it is intuitively clear: we did not find the optimal number of signs of classified objects, and the model turned out to be too simple. Re-training means that the model is too adjusted for examples, i.e. it does not generalize to the real world, being unnecessarily complicated.

There are different ways to test a model. One of these methods is to select a test sample from the training set (say, one third) and run through cross-validation. Those. at each new iteration, we take a new one-third of the training set as a test sample and calculate relevant quality parameters for the problem being solved, for example, accuracy / completeness / accuracy, etc. At the end of such a run, we calculate the average for all iterations. This will be a depreciated quality model. That is, in practice it may be lower than in the full training data set, but closer to quality in real life.

However, for a quick glance at the accuracy of the model, it is enough to calculate accuracy, i.e. the number of correct answers to incorrect:

  public void testModel() throws Exception { Evaluation eTest = new Evaluation(dataRaw); eTest.evaluateModel(classifier, dataRaw); String strSummary = eTest.toSummaryString(); System.out.println(strSummary); }

This method displays the following statistics:

 Correctly Classified Instances 28625 83.3455 % Incorrectly Classified Instances 5720 16.6545 % Kappa statistic 0.4643 Mean absolute error 0.2354 Root mean squared error 0.3555 Relative absolute error 71.991 % Root relative squared error 87.9228 % Coverage of cases (0.95 level) 97.7697 % Mean rel. region size (0.95 level) 83.3426 % Total Number of Instances 34345

Thus, the accuracy of the model for the entire training set of 83.35%. The complete project with the code can be found on my github . The code uses data from kaggle . Therefore, if you decide to use the code (or even compete in the competition), you will need to accept the conditions of participation and download data. The task of implementing the complete code for classifying the tonality into 5 classes remains for the reader. Successes!

Source: https://habr.com/ru/post/229779/

All Articles

Weka project for the task of recognition of tonality (sentiment)

Setting the classification algorithm

Classifier Training

I / O (saving and loading model)

Quality model

More articles: