Not very big data and text toning

Every idea has a simple, clear and wrong solution.
One of these solutions I will describe in this article.
Do not try to repeat these experiments at home.
And if you try, then claims for burned-down processors are not accepted.

So, there is a datsset www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
It consists of 1000 negative and 1000 positive reviews.
')
How to get an accuracy comparable to a convolutional neural network, word2vec, or other wonders of modern technology using the xz ~~loaf of black bread of the~~ archiver xz and the linear classifier?

Very simple.

1. We take 100 random texts (50 of the negatives and 50 of the positives)
2. We throw them out of the dataset.
3. For each of the 1900 remaining, we consider the generalized distance with each of the 100 ejected by the following method:
Let X and Y be two files for which you need to calculate the distance.
And let Z (N) be the length of the file N after compressing it with the xz archiver.

Calculate the values
X = Z (X)
Y = Z (Y)
XY = Z (XY)
YX = Z (YX)

The last two values are the result of archiving the concatenation of files X and Y in the first case, and Y and X in the second.

And now we consider the magic formula taken here.
attribute = (min ( XY , YX ) -min ( X , Y )) / max ( X , Y )

4. We have a matrix of 1900 x 100
Now it should be normalized from 0 to 1.

5. Tadam:
$ svm-train -v 10 -s 0 -t 0 -c 10 rand100.norm.svm
Cross Validation Accuracy = 75.6316%

Why does this work?
The fact is that the more common sequences in two texts, the greater will be the compression Z (XY)
Thus, the system itself selects common groups of characters.
It is possible that 200 random files would be better.

But - remember the warning not to repeat it at home?
The process of calculating the matrix on a home computer can take a day - if in one thread.
Or burn the processor in multi-thread mode, if the cooling is weak.
By the way, this is not a joke - I once burned two servers in a data center on the other side of the planet, though another algorithm is even tougher.

And that is why this method is only of theoretical interest in the framework of applied phlometry.

PS
There will be no code beyond the triviality - I did everything with shell and pearl-barley onlineists, when trying to read them with homosaps, archiving begins to occur directly in the cerebral cortex. And neural networks, as you know, are poorly adapted to such a load.

Pps
And inspired me to this experiment - this post .

UPDATE
Judging by the reviews, having shown the practical part, I did not clarify why this is necessary.
In fact, this method is clearly inapplicable in real problems - the cost is very high.

The method of calculating the distance by compression is known for a long time and has a theoretical background.
Those who wish to familiarize him with it recommend it .

The question however is not in a specific method.
Modern analysts build classifiers for a long time, expensively and by hands.
Many things are done by hands - from feature sculpting to network structure development.

Here I showed a completely agnostic method that does not require any knowledge of the subject area. Generally no. We do not care what language the contents of the file are written in; the only restriction is that it must be a linear stream of bytes (therefore it does not work with images).

Yes, an example belongs to the category of toy ones, in real-world tasks such algorithms are meaningless.

But if there are agnostic methods in toy problems, then they can exist in large ones. And the search for such methods is much more interesting than training autoencoders on the L2 metric.
Especially considering the fact that on the NCD metric used here the autoencoder will most likely fail to train.

Source: https://habr.com/ru/post/256787/

All Articles

Not very big data and text toning

More articles: