Entropy and decision trees

Decision trees are a convenient tool in cases when it is necessary not only to classify data, but also to explain why a particular object is assigned to a class.

Let's first, to complete the picture, consider the nature of entropy and some of its properties. Then, on a simple example, we will see how the use of entropy helps in creating classifiers. Then, in general terms, we formulate an algorithm for constructing a decision tree and its features.

Combinatorial entropy

Consider a lot of multi-colored balls: 2 red, 5 green and 3 yellow. Mix them and arrange them in a row. We call this operation permutation :

')
Let's count the number of different permutations, given that the balls of the same color are indistinguishable.

If each ball had a unique color, then the number of permutations would be 10! , but if two balls of the same color are swapped, a new permutation will fail. Thus, you need to exclude 5! permutations of green balls between themselves (as well as, 3! yellow and 2! red). Therefore, in this case, the solution will be:

Multinomial coefficient allows you to calculate the number of permutations in the general case of this problem:

( Ni is the number of identical balls of each color).

All permutations can be numbered from 0 to (W - 1) . Therefore, a string of log ₂ (W) bits uniquely encodes each of the permutations.

Since the permutation consists of N balls, the average number of bits per permutation element can be expressed as:

This value is called combinatorial entropy :

The more homogeneous the set (balls of a single color prevail) - the less its combinatorial entropy, and vice versa - the more different elements in the set, the higher its entropy.

Shannon Entropy

Let's take a closer look at the expression for entropy described above:

Given the properties of logarithms, we transform the formula as follows:

Suppose that the number of balls is large enough to use the Stirling formula :

Applying the Stirling formula, we get:

(where k is the conversion factor to natural logarithms)

Considering that

expression can be converted:

Since the total number of balls is N , and the number of balls of the i-th color is Ni , then the probability that a randomly selected ball will be of this particular color is:

. Based on this, the formula for entropy will take the form:

This expression is the Shannon entropy .

With a more careful conclusion, it can be shown that the Shannon entropy is the limit for combinatorial entropy; therefore, its value is always somewhat greater than the value of combinatorial entropy.

A comparison of two entropies is presented in the following figure, which is calculated for sets containing two types of objects - A and B (the total number of elements in each set is 100):

Thermodynamics

Similar expressions for entropy can be obtained in thermodynamics:

Based on the microscopic properties of substances: based on the postulates of statistical thermodynamics (in this case, operate on indistinguishable particles that are in different energy states).
Based on the macroscopic properties of substances: by analyzing the operation of heat engines.

The concept of entropy plays a fundamental role in thermodynamics, appearing in the formulation of the Second Law of Thermodynamics : if an isolated macroscopic system is in a non-equilibrium state, then its most arbitrary transition to a state with a large value of entropy is most likely :

Maxwell's Demon

To emphasize the statistical nature of the Second Law of Thermodynamics in 1867, James Maxwell suggested a mental experiment: “Imagine a vessel filled with a gas of a certain temperature, the vessel is divided by a partition with a gate that the demon opens to let the fast particles go in one direction and the slow ones in the other. Consequently, after some time, fast particles will concentrate in one part of the vessel, and slow particles will form in another. Thus, contrary to the second law of thermodynamics, Maxwell's demon can reduce the entropy of a closed system " :

Later, Leo Szilard resolved the paradox, but this discussion is somewhat beyond the scope of this article.

Maxwell's Demon == Qualifier

If instead of “fast” and “slow” particles we consider objects belonging to different classes, then Maxwell's demon can be considered as a kind of classifier.

The paradox formulation itself suggests the learning algorithm: you need to find rules (predicates), on the basis of which you break the training data set, so that the average value of entropy decreases . The process of dividing a set of data into parts, leading to a decrease in entropy, can be viewed as the production of information .

Having broken the initial data set into two parts using a certain predicate, one can calculate the entropy of each subset, and then calculate the average entropy value - if it turns out to be smaller than the entropy of the original set, then the predicate contains some generalizing information about the data.

For example, consider the set of two-color balls, in which the color of the ball depends only on the x coordinate:
(for practical reasons, it is convenient to use Shannon entropy for calculations)

It can be seen from the figure that if the set is divided into two parts, provided that one part contains all elements with a coordinate x ≤ 12 , and the other part contains all elements with x> 12 , then the average entropy will be less than the original by ∆S . This means that this predicate summarizes some information about the data (it is easy to notice that for x> 12 - almost all the balls are yellow).

If you use relatively simple predicates ("more", "less", "equal", etc.) - then, most likely, one rule will not be enough to create a full-fledged classifier. But the procedure of searching for predicates can be repeated recursively for each subset. The stopping criterion is the zero (or very small) entropy value. The result is a tree of conditions called the Decision Tree :

The leaves of the decision tree are classes. To classify an object using a decision tree, one must go down the tree sequentially (choosing a direction based on the values of the predicates applied to the classified object). The path from the root of the tree to the leaves can be interpreted as an explanation of why an object is assigned to a class.

In the above example, for simplicity, all objects are characterized by only one attribute — the x coordinate, but the exact same approach can be applied to objects with multiple attributes.

Also, no restrictions are imposed on the values of the attributes of the object - they can be both categorical and numerical or logical in nature. It is only necessary to define predicates that are able to correctly handle attribute values (for example, there is hardly any point in using predicates “more” or “less” for attributes with logical values).

Algorithm for building a decision tree

In general, the algorithm for building a decision tree can be described as follows:
(it seems to me that the algorithm described by "human language" is easier for perception)

s0 =      s0 == 0 :    ,             s0 != 0 :  ,                  ,       ,

What does it mean to look for a predicate ?
Alternatively, we can assume that on the basis of each element of the original set, we can construct a predicate that breaks the set into two parts. Therefore, the algorithm can be reformulated:

 s0 =      s0 == 0 :    ,             s0 != 0 :     :      ,             ∆S   ,    ∆S       ,       ,

How can “generate a predicate on the basis of each element of a set” ?
In the simplest case, you can use predicates that relate only to the value of an attribute (for example, "x ≥ 12" , or "color == yellow," etc.). Therefore, the algorithm will take the form:

 s0 =      s0 == 0 :    ,             s0 != 0 :     :       :      ,             ∆S   ,    ∆S       ,       ,

In fact, if we consider classified objects as points in a multidimensional space, we can see that the predicates dividing the set of data into subsets are hyperplanes, and the classifier training procedure is a search for bounding volumes (in general, as for any other type of classifiers) .

The main advantage is the resulting tree-like structure of predicates, which allows to interpret the results of the classification (although, due to its “greed”, the described algorithm does not always ensure the optimality of the tree as a whole).

One of the cornerstones of the described algorithm is the stopping criterion when building a tree. In the pseudo-codes described above, I stopped building the tree only when I reached the set, in which all elements belong to the same class ( entropy == 0 ). This approach allows you to fully adjust the decision tree to the training data sample, but this is not always effective from a practical point of view (the resulting tree is retrained ).

One of the possible stopping criteria may be a small ∆S value. But with this approach, nevertheless, it is impossible to give universal advice: at what values of ∆S should the construction of a tree be stopped.

Random forest

In order not to bother with the stopping criterion when building a tree, you can do the following: choose random subsets from the training data sample, and for each subset build your decision tree (in principle, it doesn't even matter which stopping criterion will be used) :

The resulting ensemble of trees (a simplified version of Random forest ) can be used for classification, driving the classified object through all the trees. Each tree seems to "vote" for the object belonging to a certain class. Thus, on the basis of what part of the trees voted for one or another class, it can be concluded with what probability the object belongs to any class.

This method allows to adequately handle the boundary data areas:

It can be seen that a single decision tree describes a region that completely contains red dots, while an ensemble of trees describes a shape that is closer to a circle.

If you want to experiment

I created a small application for comparing the decision tree and random forest. Each time the application is started, a random data set is created that corresponds to a red circle on a green background, and as a result of the execution of the application, a picture is obtained, such as the one shown above.

You must have Java runtime installed
Download the dec_tree_demo.jar binary from here .
To start the application, type in the command line: java -jar dec_tree_demo.jar out.png

Sources are on github.

Instead of conclusion

Acceptance trees are a good alternative, in cases when it is annoying to adjust abstract weights and coefficients in other classification algorithms, or when you have to process data with mixed (categorical and numeric) attributes.

Additional Information

Yatsimirsky V.K. Physical chemistry (the concept of entropy is pretty well described here, and some philosophical aspects of this phenomenon are also considered)
An interesting thread about compression and entropy on compression.ru
One more article about decision trees on Habré
Toby Segaran, Programming Collective Intelligence (in this book there is a chapter devoted to decision trees, and in general, if you have not read this book yet, I advise you to look there :-)
Libraries such as Weka and Apache Mahout contain implementation of decision trees.

Source: https://habr.com/ru/post/171759/

All Articles