Guessing on neural networks: whether the author himself noted in the comments to the post

I will share a story about a small project: how to find in the comments the author's answers, knowingly not knowing who the author of the post is.

I started my project with minimal knowledge of machine learning and I think for specialists there will not be anything new. This material is in a sense a compilation of various articles, I will tell you how it approached the problem, in the code you can find useful trivia and techniques with processing natural language.

My initial data were as follows: a database containing 2.5M media materials and 39.5M comments on them. For 1M posts, one way or another, the author of the material was known (this information was either present in the database, or was obtained by analyzing data on circumstantial evidence). On this basis, a dataset of 215K markup records was formed .
')
Initially, I applied an approach based on heuristics produced by natural intelligence and translated into sql queries with full-text search or regular expressions. The simplest examples of the text for the analysis: “thanks for the comment” or “thank you for the good grades” is the author in 99.99% of cases, and “thanks for the work” or “Thank you! Send mail to the material. Thank you! ”- a normal review. With this approach, it was possible to filter out only explicit matches, excluding the cases of trivial typos, or when the author conducts a dialogue with commentators. Therefore, it was decided to use neural networks, this idea came not without the help of a friend.

A typical sequence of comments, which one is the author?

Answer

The method for determining the tonality of the text was taken as a basis; the task is simple: we have two classes: the author and not the author. For training models, I used the service from Google that provides virtual machines with GPU and Jupiter notebook interface.

Examples of networks found on the Internet:

embed_dim = 128 model = Sequential() model.add(Embedding(max_fatures, embed_dim,input_length = X_train.shape[1])) model.add(SpatialDropout1D(0.2)) model.add(LSTM(196, dropout=0.5, recurrent_dropout=0.2)) model.add(Dense(1,activation='softmax')) model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])

on lines cleared of html-tags and special characters, they gave about 65-74% percent accuracy, which was not much different from coin flip.

An interesting point, aligning the input sequences via pad_sequences(x_train, maxlen=max_len, padding='pre') gave a significant difference in the results. In my case, the best result was when padding = 'post'.

The next step was the use of lemmatization, which immediately gave an increase in accuracy up to 80% and with this it was already possible to work further. Now the main problem was the correct cleaning of the text. For example, typos in the word “thank you” were converted (typos were chosen according to the frequency of use) into such a regular expression (there were a half or two dozen of similar expressions).

 re16 = re.compile(ur"(?:\b:(?:1|c(?:|)|(?:|)|(?:(?:|(?:(?:(?:|(?:)?|))?|(?:)?))|)|(?:(?:(?:|)|)||||(?:(?:||(?:|)|(?:|(?:(?:(?:||(?:(?:||(?:[]|)|[]))?|[і]))?|||1)||)|)|||[]|(?:|)|(?:(?:(?:[]|)|?|(?:(?:(?:|(?:)?))?|)|(?:|)))?)||)|(?:|x))\b)", re.UNICODE)

Here I would like to express a special thanks to the overly polite people who consider it necessary to add this word to each of their sentences.

Reducing the proportion of typos was necessary, because at the exit from the lemmatizer, they give strange words and we lose useful information.

But a blessing in disguise, tired of struggling with typos, doing complex text cleaning, applied the word2vec vector representation of words. The method allowed to translate all typos, clerks and synonyms into closely spaced vectors.

Words and their relationships in vector space.

Cleaning rules were significantly simplified (aha, storyteller), all messages, user names, were divided into sentences and uploaded to a file. An important point: in view of the brevity of our commentators, to build quality vectors, words need additional contextual information, for example, from the forum and wikipedia. Three models were trained on the resulting file: classic word2vec, Glove and FastText. After many experiments I finally stopped at FastText, as the most qualitatively distinguishing clusters of words in my case.

All these changes brought a stable 84-85 percent accuracy.

Examples of models

 def model_conv_core(model_input, embd_size = 128): num_filters = 128 X = Embedding(total_unique_words, DIM, input_length=max_words, weights=[embedding_matrix], trainable=False, name='Word2Vec')(model_input) X = Conv1D(num_filters, 3, activation='relu', padding='same')(X) X = Dropout(0.3)(X) X = MaxPooling1D(2)(X) X = Conv1D(num_filters, 5, activation='relu', padding='same')(X) return X def model_conv1d(model_input, embd_size = 128, num_filters = 64, kernel_size=3): X = Embedding(total_unique_words, DIM, input_length=max_words, weights=[embedding_matrix], trainable=False, name='Word2Vec')(model_input) X = Conv1D(num_filters, kernel_size, padding='same', activation='relu', strides=1)(X) # X = Dropout(0.1)(X) X = MaxPooling1D(pool_size=2)(X) X = LSTM(256, kernel_regularizer=regularizers.l2(0.004))(X) X = Dropout(0.3)(X) X = Dense(128, kernel_regularizer=regularizers.l2(0.0004))(X) X = LeakyReLU()(X) X = BatchNormalization()(X) X = Dense(1, activation="sigmoid")(X) model = Model(model_input, X, name='w2v_conv1d') return model def model_gru(model_input, embd_size = 128): X = model_conv_core(model_input, embd_size) X = MaxPooling1D(2)(X) X = Dropout(0.2)(X) X = GRU(256, activation='relu', return_sequences=True, kernel_regularizer=regularizers.l2(0.004))(X) X = Dropout(0.5)(X) X = GRU(128, activation='relu', kernel_regularizer=regularizers.l2(0.0004))(X) X = Dropout(0.5)(X) X = BatchNormalization()(X) X = Dense(1, activation="sigmoid")(X) model = Model(model_input, X, name='w2v_gru') return model def model_conv2d(model_input, embd_size = 128): from keras.layers import MaxPool2D, Conv2D, Reshape num_filters = 256 filter_sizes = [3, 5, 7] X = Embedding(total_unique_words, DIM, input_length=max_words, weights=[embedding_matrix], trainable=False, name='Word2Vec')(model_input) reshape = Reshape((maxSequenceLength, embd_size, 1))(X) conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embd_size), padding='valid', kernel_initializer='normal', activation='relu')(reshape) conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embd_size), padding='valid', kernel_initializer='normal', activation='relu')(reshape) conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embd_size), padding='valid', kernel_initializer='normal', activation='relu')(reshape) maxpool_0 = MaxPool2D(pool_size=(maxSequenceLength - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0) maxpool_1 = MaxPool2D(pool_size=(maxSequenceLength - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1) maxpool_2 = MaxPool2D(pool_size=(maxSequenceLength - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2) X = concatenate([maxpool_0, maxpool_1, maxpool_2], axis=1) X = Dropout(0.2)(X) X = Flatten()(X) X = Dense(int(embd_size / 2.0), activation='relu', kernel_regularizer=regularizers.l2(0.004))(X) X = Dropout(0.5)(X) X = BatchNormalization()(X) X = Dense(1, activation="sigmoid")(X) model = Model(model_input, X, name='w2v_conv2d') return model

and 6 more models in the code . Part of the models was taken from the network, some was invented independently.

It was noticed that different comments stood out on different models, which prompted the idea to use ensembles of models. At first I assembled the ensemble manually, choosing the best pairs of models, then I made a generator. In order to optimize brute force - he took the gray code as a basis.

 def gray_code(n): def gray_code_recurse (g,n): k = len(g) if n <= 0: return else: for i in range (k-1, -1, -1): char='1' + g[i] g.append(char) for i in range (k-1, -1, -1): g[i]='0' + g[i] gray_code_recurse (g, n-1) g = ['0','1'] gray_code_recurse(g, n-1) return g def gen_list(m): out = [] g = gray_code(len(m)) for i in range (len(g)): mask_str = g[i] idx = 0 v = [] for c in list(mask_str): if c == '1': v.append(m[idx]) idx += 1 if len(v) > 1: out.append(v) return out

With the ensemble "life has become more fun" and the current percentage of accuracy of the model is kept at 86-87%, which is mainly due to the poor quality classification of some authors in the dataset.

Problems encountered by me:

Unbalanced dataset. The number of comments from the authors was significantly less than other commentators.
Classes in the sample are in strict order. The bottom line is that the beginning, middle and end differ significantly in the quality of classification. This is clearly seen in the learning process on schedule f1-measures.

For the decision was made his own bike for the separation of training and validation of the sample. Although in practice, in most cases, the train_test_split procedure from the sklearn library will suffice.

Current working model graph:

As a result, I received a model with a confident definition of the authors by short comments. Further improvement will be associated with the purification and transfer of the results of the classification of real data to the training dataset.

All code with additional explanations is laid out in the repository .

As a postscript: if you need to classify large amounts of text, take a look at the “Very Deep Convolutional Neural Network” VDCNN model ( implemented on keras), this is the ResNet analog for texts.

Used materials:

• Overview of machine learning courses
• Analysis of tonalities using convolutions
• Convolution networks in NLP
• Metrics in machine learning
https://ld86.imtqy.com/ml-slides/unbalanced.html
• A look inside the model

Source: https://habr.com/ru/post/441850/

All Articles

Guessing on neural networks: whether the author himself noted in the comments to the post

More articles: