📜 ⬆️ ⬇️

Top 10 data mining algorithms in simple language



Translator's note : We often write about algorithmic trading (here, for example, a list of references on this topic and relevant educational courses ), today we will deal directly with algorithms that can be used to analyze various data (including in the financial market). The material is an adapted translation of the article by the American developer and analyst Ray Lee.

Today I will try to explain in simple terms the principles of the 10 most efficient data mining algorithms, which are described in this report.
')
When you find out what they are, how they work, what they do and where they apply, I hope that you use this article as a starting point to further explore the principles of data mining.

1. C4.5


What is he doing? Algorithm C4.5 builds a classifier in the form of a decision tree. To do this, it needs to transfer a set of already classified data.

Wait, what is a classifier? The classifier is a tool used in data mining that uses classified data and, on their basis, tries to predict which class should include the new data.

What does an example of using the algorithm look like? Suppose we have a data set — this is data about a group of patients. We know the different parameters of each patient: age, pulse, blood pressure, maximum oxygen consumption, family history, and so on. These parameters are called attributes.

Now:

Based on these attributes, we want to predict whether a patient can get cancer. A patient can fall into one of 2 classes: they will have cancer or will not have cancer. Algorithm C4.5 is reported for each patient class.

That is the point:

Using a set of patient attributes and the corresponding class, C4.5 builds a decision tree that can predict the class for new patients based on their attributes.

Cool, but what is a decision tree? Classification by the decision tree method creates a kind of flowchart for the distribution of new data. If you go back to the patient example, the flowchart branch might look like this:


In this way:

At each point in the flowchart, a question is asked about the significance of an attribute, and depending on these attributes, he or she [patients] fall into a particular class. You can find many examples of building a decision tree here .

Does this training method require or is it self-learning? This method requires training, here the training data set is marked up by classes. Returning again to the example with patients, I note that C4.5 does not decide on its own whether the patient will develop cancer or not. As we said, it creates a decision tree that is used for decision making.

Here are the differences C4.5 from other systems using decision trees:


Why use C4.5? Probably the biggest advantage of decision trees is their simple interpretation. They also have a fairly high speed of work, and the output is easily understood by a person.

Where is it used? On OpenTox, you can find an implementation in Java, which is a tool for visualization and analysis in data mining methods. Orange , a set of open-source tools for analyzing and visualizing date-mining results, uses C4.5 in its decision tree classifier.
Classifiers are a great thing, but here’s another algorithm that is directly related to clustering ...

2. Method to-average


What is he doing? The k-means method creates k-groups from a set of objects in such a way that the group members are the most homogeneous. This is a popular cluster analysis technique for exploring a dataset.

Wait a minute, what is cluster analysis? Cluster analysis is a family of algorithms designed to form groups in such a way that the group members are most similar to each other and not like elements that do not belong to the group. Cluster and group are synonyms in the world of cluster analysis.

Is there any example? Definitely. Suppose we have patient data. In cluster analysis, this is called observation. We know something about each patient, such as his age, pulse, blood pressure, maximum oxygen consumption, cholesterol, and so on. This is a vector representing the patient.

See:

You can think of this vector as a list of numbers that can be interpreted as coordinates of a multidimensional space. Pulse in one dimension, blood pressure in another, and so on.

There may be a question:

How do we group patients together by age, pulse, pressure using these vectors?
Want to know the good news?

You tell the k-means method how many clusters you need, and it does the rest.

How does this happen? The k-means method has many options for working with different types of data.

In general, they all do something like the following:

  1. The k-means method selects points of the multidimensional space that will represent k-clusters. These points are called centers of gravity.
  2. Each patient will be located closest to one of the points. We hope that not all of them will tend to the same center of gravity, therefore several clusters are formed.
  3. Now we have k-clusters, and each patient is a member of one of them.
  4. The k-means method, taking into account the position of cluster members, finds the center of each of the k-clusters (this is where the patient vectors are used!).
  5. The calculated center becomes the new center of gravity of the cluster.
  6. As the center of gravity moved, patients could be closer to other centers of gravity. In other words, they could change memberships.
  7. Steps 2-6 are repeated until the center of gravity ceases to change and the membership does not stabilize. This is called convergence.

Does this training method require or is it self-learning? It's not always the same. But the majority regards the K-means method as self-learning. Instead of specifying the number of clusters, the k-means method “studies” the clusters independently, without requiring information about which cluster the observation data belongs to. The k-means method can be semi-educated .

Why use the k-means method? I do not think that many will argue:

The main advantage of the algorithm is its simplicity. Simplicity usually means faster execution and efficiency compared to other algorithms, especially when working with large data sets.

Further better:

The k-means method can be used for preliminary partitioning into groups of a large data set, after which a more powerful cluster analysis of subclusters is carried out. The k-means method can be used to estimate the number of clusters and check for unrecorded data and links in the sets.

But not everything is so smooth:

The two main disadvantages of the k-means method are the sensitivity to “outliers” and the initial selection of the centers of gravity. You also need to remember that the k-means method is designed to work with continuous values, so you have to do a couple of tricks to get the algorithm to work with discrete data.

Where is it used? A huge number of implementations of the k-means method are available online:


If decision trees and clustering did not impress you, then you just have to like the following algorithm ...

3. Support vector machine


What is he doing? The support vector machine (SVM - Support vector machine) uses a hyperplane to classify data into 2 classes. At the top level, SVM performs the same operations as C4.5, but with one difference — SVM does not use decision trees.

Stop-stop, hyper-what? A hyperplane is a function, for example, the equation for the line y = mx + b . In fact, for a simple classification problem with 2 parameters, the hyperplane can be a line.

As it turned out…

SVM allows you to project your data into a space of higher dimension. When data is projected ...

... SVM defines the best hyperplane, which divides the data into 2 classes.

Can see an example? Of course. The simplest example I found was an example with a bunch of blue and red balls on the table. If the balls do not lie in absolute disarray, you can take a stick and, without changing the position of the balls, separate them with a stick.

You see:

When a new ball is added to a table, you can predict its color, knowing which part of the table it has hit.

What are the balls, the table and the wand? Balls are data, and red and blue are two classes. The stick is the simplest hyperplane, that is, the line.

What about the coolest part?

SVM independently determines the function of the hyperplane.

What if everything is much more complicated? True, it often happens. If the balls are mixed with each other, then a simple stick will not help here.

Here is what you need to do:

Quickly raise the table - the balls will fly into the air. While the balls are in the air and in the correct positions, you can separate the red and blue ones with a large piece of paper.

You might think this is a scam:

Not. Raising a table into the air is equivalent to reflecting data into a space with a higher dimension. In this case, we move from the flat surface of the table to the three-dimensional position of the balls in the air.

How does SVM do it? Using the null space (core) of the matrix gives us a great tool for working in spaces of higher dimension. A large piece of paper is still called a hyperplane, but now it is a function of a plane, not a line. As Yuval Merhav noted - since we have moved to the third dimension, the hyperplane must become a plane.

I believe that this visualization helps to understand the principle of how SVM works:



On Reddit there are 2 excellent threads on this topic, on the ELI5 and ML subforums.

How do the balls on the table or in the air relate to real data? The ball on the table has a location that can be determined by coordinates. For example, a ball can stand 20 cm from the left side of the table and 50 from the bottom. In other words, the coordinates (x, y) of the ball have the values ​​(20.50) [it would be more correct to say: (50.20)]. The coordinates x and y form a two-dimensional dimension.

This is what happens:

If we have a patient data set, then each patient can be described by various parameters, such as pulse, cholesterol level, pressure, and so on. Each of these parameters is a measurement.

As a result:

SVM maps these parameters to a higher dimension, and then finds a hyperplane to separate the classes.

The term margin is often associated with SVM. What is it? The hyperplane margin is the distance between the hyperplane and the 2 nearest data points of each class. In the ball and table example, the margin is the distance between the stick and the nearest red and blue ball.

The bottom line is that SVM tries to maximize the margin so that the hyperplane is approximately the same distance from the red and blue balls - this reduces the chance of classification errors.

Where did SVM get its name? The hyperplane is equidistant from the red and blue balls. These balls are data points called support vectors because they support the hyperplane.

Does this training method require or is it self-learning? This method requires training. To show the SVM that such classes, a data set is used - only after that it turns out to be able to classify new data.

Why SVM? SVM, along with C4.5, are two methods that you should try first. But, according to the theorem of No Free Lunch (“free lunches do not happen”), there is no universal method for solving classification problems. I must add that the weak points of this method are the necessity of choosing a kernel and poor interpretability.

Where is it used? There are many SVM implementations. The most popular ones are scikit-learn , MATLAB and of course libsvm .

The following algorithm is one of my favorite ...

4. Algorithm Apriori


What is he doing? The Apriori algorithm looks for association rules and applies to databases containing a huge number of transactions.

What are association rules? The study of association rules is a technique used in data mining to study the relationships and relationships between database variables.

What is the example of using the Apriori algorithm? Let's say we have a supermarket transaction database. You can think of a database as a huge table in which each row is a transaction number, and each bar represents individual purchases.



Good news:

Applying the algorithm Apriori, we can determine the goods purchased together - that is, to establish associative rules.

What does this give us:

You can identify products that are often bought together. The main task of marketing is to make customers buy more. Related products are called sets.

For example:

You may notice that chips, chips with sauce and soda are often on the shelves nearby. This is called a two-element set. When the database is large enough, it will be much more difficult to “see” interconnections, especially when you are dealing with three-element or larger sets. The Apriori algorithm was created just for this.

How does the Apriori algorithm work? Before you get to the essence of the algorithm, you need to define 3 parameters:

  1. First, you need to set the size of the set. Do you want to define a two-element, three-element set or any other?
  2. Secondly, define support is the number of transactions included in the set divided by the total number of transactions. A set that is equal to support is the most common set.
  3. Thirdly, to determine the reliability , that is, the conditional probability of a certain product to be in a basket with other goods. Example: the chips in your set have a 67% chance of being in the same soda basket.

Apriori's simple algorithm consists of three steps:

  1. Union View the database and determine the frequency of occurrence of individual products.
  2. Clipping Those sets that satisfy support and reliability go to the next iteration with two-component sets,
  3. Repetition . The previous two steps are repeated for each set value until the previously determined size is re-acquired.

Does this training method require or is it self-learning? Apriori is usually considered as a self-learning algorithm, so it is often used to search for interesting patterns and relationships.

Something else…

There is a modification of the Apriori algorithm that can classify tagged data.

Why choose Apriori? It is simple, clear, easy to implement and has many modifications .

On the other hand…

In the process, the algorithm can be quite resource-intensive; calculations can take a lot of time.

Where is it used? There is a huge number of implementations of Apriori. Some of the most popular are ARtool , Weka and Orange .

The following algorithm was the most difficult for me to understand, let's take a look at it ...

5. EM algorithm


What is he doing? In data mining, the expectation-maximization (EM) algorithm is typically used as a cluster algorithm (like the k-means algorithm) for knowledge detection.

In mathematical statistics, the EM algorithm is considered iterative and is used to estimate the maximum likelihood when calculating the parameters of a statistical model with hidden variables.

Let me explain ...

I am not a statistician, so I hope that my simplification will be correct and will help you understand everything.

Here are a few concepts that will make everything easier ...

What is a statistical model? I see the model as something that describes the known data. For example, exam grades may correspond to a normal distribution, so the assumption that grades are generated according to a normal distribution is a model.

Wait a second, what is the distribution? The distribution represents the probabilities of occurrence of all measurable results. For example, exam grades may correspond to a normal distribution. This normal distribution represents all the probabilities of obtaining one estimate or another.

In other words, the distribution helps to understand how many people who take the exam will receive this or that grade.

Cool, and what are the parameters of the model? The parameter describes the distribution, which is part of the model. For example, the normal distribution is described by arithmetic mean and variance.

In the exam example, the distribution of grades (measurable outcomes) fits into the normal distribution. The arithmetic average is 85, and the variance is 100.

In order to describe the normal distribution, you need only two parameters:

  1. Average
  2. Dispersion

And the credibility? Returning to the example with a normal distribution. ... Suppose we have a lot of estimates. However, we do not know everything, but only some of them.

That is the point:

We do not know the arithmetic mean or variance of all estimates, but we can estimate them using the data from the example. Credibility is the probability that the normal distribution curve with the estimated arithmetic mean and variance values ​​will fairly accurately describe the results of the exams.

In other words, having a set of countable outcomes, let's evaluate the parameters of the model. Based on these estimated parameters, a hypothetical probability of the occurrence of one or another outcome, called likelihood, is considered.
Remember that this is a hypothetical probability for existing estimates , and the likelihood of receiving estimates in the future .

You probably think what is the probability ?

Suppose we know the arithmetic mean and variance. The probability of occurrence of an assessment corresponds to the normal distribution. The chance that we will observe certain estimates with a certain frequency is called probability.

To put it in simple terms, we estimate the possible outcomes based on parameters.

Fine! And what is the difference between observational data and hidden data? Observation data is data that you have observed or recorded. Hidden data is missing data. There are many reasons why they may be absent (not fixed, ignored, and so on).

Here's the catch:

During the date of mining and clustering, it is important to evaluate the data point class as missing data. We do not know what kind of class it is, so the interpretation of missing data is very important in the case of applying the EM algorithm to the clustering problem.

I repeat: the EM algorithm is iterative and is used to find the maximum likelihood estimates for the parameters of probabilistic models, in the case when the model depends on some hidden variables. I hope that now you have become clearer.

And now the good news:

By evaluating maximum likelihood, the EM algorithm creates an excellent model that assigns class labels to data points — just like clustering.

How does EM help in clustering? The EM algorithm begins by trying to make a conclusion based on the model parameters.

Then follows an iterative three-step process:

  1. E- step: At this step, based on the parameters of the model, the probability that each data point belongs to a cluster is calculated.
  2. M- step: Updates the parameters of the model in accordance with the cluster distribution, carried out in step E.
  3. The previous two steps are repeated until the model parameters and the cluster distribution are equalized.


Does this training method require or is it self-learning? Since we did not provide tagged data to the algorithm, this is a self-learning method.

Why EM? The main advantage of the EM-algorithm is the ease of execution. On top of that, he can optimize not only the parameters of the model, but also make assumptions about the values ​​of the missing data.

This makes EM a great method for clustering and creating models with parameters. Knowing the clusters and the parameters of the model, we can assume what the cluster contains and where it is worth taking the new data.

Although the EM algorithm has its drawbacks:

  1. As the number of iterations grows, the performance of the algorithm decreases.
  2. EM does not always find the optimal parameters and can get stuck in a local optimum without finding a global one.

Where is it used? The EM algorithm is implemented in Weka . R has an implementation in the mclust package. In scikit-learn there is an EM implementation in the gmm module.

What data mining method does Google use? Let's take a look.

6. PageRank


What is he doing? PageRank is a reference ranking algorithm designed to determine the relative importance of an object associated with a network of objects.

I'm sorry, what? Reference ranking? This is a type of network analysis that defines the associations (read, connections) between objects.

Here is an example: The most famous example of PageRank is the Google search engine. Although their search engine does not fully rely on PageRank, yet it is one of the methods that Google uses to determine the importance of a web page.

Let me explain:

Web pages on the Internet are related to each other. If rayli.net provides a link to CNN, then CNN gets a point in the piggy bank, since rayli.net found CNN site relevant.

But that is not all…

The weight of the score from rayli.net is estimated by the importance and relevance of the site itself.
In other words, any web page giving a link to rayli.net increases its relevance.

Summary?

This concept of voices and relevance is PageRank. The voice of rayli.net behind CNN increases the PageRank of CNN, and the amount by which it increases will depend on the influence and significance of rayli.net.

What does PageRank equal 0,1,2,3 and so on? Although Google does not disclose the exact value of the PageRank number, we can get an idea about it.

And here's how:



Do you see?

It all looks like a competition in popularity. We all have an idea of ​​which sites are relevant and popular. PageRank simply translates our presentation into numbers.

How else is PageRank applied? PageRank has been specifically designed for the worldwide web.

Just think:

In terms of content, PageRank is simply a super-efficient way to carry out reference rankings. However, the objects to be connected need not be web pages.

Here are 3 innovative PageRank applications:

  1. Dr. Stefano Allesina of the University of Chicago used PageRank in the field of ecology to determine which of the individuals are vital to maintaining the ecosystem.
  2. Twitter developed WTF (Who-to-Follow), a personalized version of a recommendation engine based on PageRank, showing a list of people to subscribe to.
  3. Bin Ren (Bin Jiang) from Hong Kong Polytechnic University used the PageRank option to predict the movement of people based on topological metrics in London.

Does this training method require or is it self-learning? PageRank is usually regarded as a self-learning method, as it is often used to determine the relevance of a web page.

Why choose PageRank? The main advantage of PageRank is reliability, despite the difficulty of obtaining a relevant inbound link.

A simple note:

If you have a graph or diagram, and you want to determine its relative importance, priority, ranking, or relevance, let PageRank show itself.

Where is it used? Brand PageRank is owned by Google. However, the PageRank algorithm is patented by Stanford University.

If you have a question about whether you can use PageRank:
I am not a lawyer, so it’s better to consult with knowledgeable people, but you can probably use the algorithm as much as you like, until it starts to bring you financial benefits.

Here are 3 examples of the implementation of PageRank:


Next, go to ...

7. AdaBoost


What is he doing? AdaBoost is a classifier amplification algorithm.

As you remember, the classifier is trying to predict from the data already known to it, which class the new data will belong to.

What is gain? Gaining is an ensemble learning algorithm that takes a variety of learning algorithms, for example, decision trees, and combines them. The goal is to take a set or group of weak classifiers and combine them into one strong.

What is the difference between strong and weak classifier? A weak classifier produces a classification with an accuracy slightly higher than the chance. A popular example of a weak classifier is the so-called “decisive stump” - a single-level decision tree.

A strong classifier, on the contrary, has much greater accuracy. The most common example of a strong classifier is SVM.

What are some examples of AdaBoost? Let's start with three weak classifiers. We will train them in 10 iterations on a patient data set. The data set contains the details of the patient's medical records.

The question is ...

, ?

AdaBoost :

1: AdaBoost , . .
, , .

– , ( ).

2 : AdaBoost .
:

, . , , , .

Why?

– , . 2, .

, , . , , , .

– .

10 : , .

? , .

AdaBoost? AdaBoost – . .

! , . .

-…

- , AdaBoost . .

, AdaBoost . AdaBoost .

? AdaBoost . Here are some of them:


, …

8. k- (kNN)


? kNN (k-Nearest Neighbors) – , , , – .

? , , . , .

. , «» .

C4.5, SVM AdaBoost? kNN, – .

:


kNN? kNN . .

, kNN 2 :
k – , k .

, , kNN , .

, …

kNN , ? kNN , , (). . . , kNN.

, . 2 :


2 Stack Overflow :

kNN , « »? kNN , , . – «», .

, , - .

kNN , , ?

2 :

  1. . , .
  2. , . – . 5 , 1/5. . , .

? , kNN .

kNN? kNN – . , kNN .

…

5 , :

  1. kNN , .
  2. «» kNN-.
  3. . , .
  4. «», kNN , .
  5. kNN.

? :


? ! …

9.


? – , :
.

«»? 2 , .

For example:

, : , , , . , . , , .

, ?

, – . 3 , :


, .

…

? , – . .

? -, .

, , .

:



.

? , 1 2. , 1 2, , , .

: [] 1 2 – .

  1. – 1, , 2, , .
  2. – 1 2.

- ? , Stack Overflow.

:

  1. 1000 .
  2. , - ( ).
  3. , ( ).



?


– , ( ), , , - .

, , .

4 :

1: , – , , . «» «», «», «»:

P(Banana|Long, Sweet, Yellow)

, .

2 : :


( ), :

0.8 x 0.7 x 0.9 x 0.5 = 0.252

3 : , .

4 : :



0,252 , 0,01875, , .

? , .

? , . : .

, , .

, . , , .

? Orange , scikit-learn , Weka R .

, 10 …

10. CART


? CART (classification and regression trees) – , . , , . C4.5, CART – .

? – . .

, . , . : « » « ».

? , … , , .

…

, – .
, , :

CART C4.5?

C4.5Cart
( ). Stack Overflow .
, .. , CART . ,
«», «»

? CART , .

CART? , C4.5, CART, – . CART .

C4.5, CART , .

? CART scikit-learn . R CART . CART Weka MATLAB .

, Salford Systems CART, , .

Source: https://habr.com/ru/post/262155/


All Articles