Machine learning: what you need to know about creating strategies for trading on the exchange. Part IV

On Habré and in the analytical section of our site, we write a lot about financial market trends and continue to publish a series of materials on the creation of strategies for trading on the exchange, based on the articles of the author of the blog Financial Hacker. In previous topics, we talked about the use of market inefficiencies on the example of a history with a price limit for the Swiss franc, considered important factors affecting the effectiveness of the strategy, and discussed general principles for developing model-oriented trading systems.

Today we will talk about the use of technology for these purposes date mining and machine learning.
')
In 1996, the computer Deep Blue first won the world chess champion. It took another 20 years and the program AlphaGo won the series with the best player in Go , losing only one game. Deep Blue was a model-based system with a rigid set of chess rules. AlphaGo uses Data Mining technology. This is a neural network trained on the examples of thousands of Go games. Unlike chess, in this game the choice of options is so huge that brute force does not help. Therefore, the breakthrough did not occur due to the improvement of the hardware, but only thanks to the new software.

Today we will look at the approach to using data mining to develop trading strategies that do not imply an in-depth analysis of market mechanisms. Instead, it uses information from the price curve and other sources to search for predictable anomalies. Machine learning or “artificial intelligence” is not always an indispensable part of such a strategy. In practice, the most popular and most successful application of this method is to work without attracting heaped neural networks or the support vector machine.

Principles of machine learning

At the core of the learning algorithm is the concept of patterns. This is usually historical price data. Each pattern consists of n variables x1 ... xn, usually called prediction markers (predictors) or simply parameters. Such predictors can be the price return of the last n-divisions or a set of classical indicators, as well as any other functions of the price curve. Each template includes a target value of y — for example, the profit of the next transaction after the template is applied or the next price movement. During the learning process, the algorithm learns how to get the target value based on the predictors. This knowledge is stored in a data structure called in our case a model that is individual for each algorithm. This model can be a function of the C language, which describes the prediction rules developed in the learning process. Or it could be a set of connections in a neural network.

Training: x1 ... xn, y => model
Prediction: x1 ... xn , model => y

Predictors must carefully process the information that is needed to predict the target value. They must meet two formal conditions: all values of these markers must be of the same order (for example, -1 ... +1 for algorithms on R or -100 ... +100 for algorithms in the Zorro language). This means that before sending to the trading “engine” they need to be normalized. Secondly, the templates must be balanced, that is, evenly distributed over all values of the target value. There should be as many templates describing the winning option as losing.

Regression algorithms predict numeric values, such as the value of the next price change. Classification algorithms generate a class of quality templates. For example, related to profit or loss. A number of neural network algorithms or support vectors can work simultaneously in both versions. Some algorithms do not need a target value to divide the templates into classes. This is the so-called unsupervised learning, as opposed to the usual with a teacher (supervised learning).

Whatever signals we use as financial predictions as markers, most of them will contain a lot of noise and little useful information. Therefore, financial forecasting is the most difficult task in machine learning. More complex algorithms do not always give a better result. For ultimate success, the choice of predictors is critical. In this case, the strategy of predictive analysis provides a preliminary selection algorithm that selects several useful prediction markers from a variety of options. This selection can take place on the basis of their correlation, significance, or simply those that have passed the test.

Next we will talk about the most popular methods of intellectual analysis used in the world of finance.

Trial and error method

Most of the trading systems that the company author of the blog Financial Hacker develops for its clients are not originally based on a financial model. The customer wants to receive signals for transactions, based on specific technical indicators, filtered through indicators using even more technical indicators. Usually, no one can really answer the question of how this mess of indicators can be a working strategy. The answer is usually: “Just believe. I have been trading by hand for many years now, and everything works. ”

In fact, everything is. At least in some cases. Although many of these systems were not passed forward analysis (some - even an elementary backtest), most cope well with their tasks. The client systematically experiments with technical indicators until he finds the right combination that works in the real market with the selected assets. Trial and error is a classic mining approach. It is simply made by man, not by machine. Sometimes it gives a good result.

Candlestick patterns

It makes no sense to dwell on the analysis of outdated techniques, such as Japanese candlestick patterns , which were popular 200 years ago. The modern equivalent of candlestick patterns is an indicatorless price action analysis. In it, traders are still trying to find a pattern that predicts price movement. But in this case, they analyze the current price curves. For this purpose there is a set of special programs. They select suitable patterns according to the criteria set by the user and use them to build a function. On the Zorro system, this might look like this:

int detect(double* sig) { if(sig[1]<sig[2] && sig[4]<sig[0] && sig[0]<sig[5] && sig[5]<sig[3] && sig[10]<sig[11] && sig[11]<sig[7] && sig[7]<sig[8] && sig[8]<sig[9] && sig[9]<sig[6]) return 1; if(sig[4]<sig[1] && sig[1]<sig[2] && sig[2]<sig[5] && sig[5]<sig[3] && sig[3]<sig[0] && sig[7]<sig[8] && sig[10]<sig[6] && sig[6]<sig[11] && sig[11]<sig[9]) return 1; if(sig[1]<sig[4] && eqF(sig[4]-sig[5]) && sig[5]<sig[2] && sig[2]<sig[3] && sig[3]<sig[0] && sig[10]<sig[7] && sig[8]<sig[6] && sig[6]<sig[11] && sig[11]<sig[9]) return 1; if(sig[1]<sig[4] && sig[4]<sig[5] && sig[5]<sig[2] && sig[2]<sig[0] && sig[0]<sig[3] && sig[7]<sig[8] && sig[10]<sig[11] && sig[11]<sig[9] && sig[9]<sig[6]) return 1; if(sig[1]<sig[2] && sig[4]<sig[5] && sig[5]<sig[3] && sig[3]<sig[0] && sig[10]<sig[7] && sig[7]<sig[8] && sig[8]<sig[6] && sig[6]<sig[11] && sig[11]<sig[9]) return 1; .... return 0; }

Function C returns 1 when the signal matches one of the patterns. Otherwise, the value is 0. Using this code, you can see that this is not the fastest way to search for patterns. Alternatively, first sort the signals by their value, then check the sort order.

Even despite the use of data mining techniques, such price trading should have some rational basis. One can imagine that certain sequences of price movements lead to a certain reaction of market participants. This will be the prediction pattern. The number of patterns will always be limited if you look closely at the sequence of adjacent candles. The next step is to compare candles that are at a distance from each other. We choose them arbitrarily for a sufficiently long period of time. In this case, the number of patterns can be limitless. But here it is easy to lose ground. At the same time, it is difficult to imagine that the price movement can be predicted by the candlestick pattern of a week ago. But in general, the task of finding candlestick patterns is extremely complex and fraught with many errors.

Linear regression

The meaning of the work of most complex machine learning algorithms is simple: you need to predict the variable target value y through a linear combination of predictors x1 ... xn .

The coefficient a _{n is} calculated to minimize the sum of squares of differences between the true value of y of the training pattern and the predicted values of y using the following formula:

For a normal distribution of patterns, minimization is possible through a mathematical matrix, so no iteration is required. In the case of n = 1 with one predictor variable x, the regression formula is simplified to:

This is a simple linear regression. It is used on most trading platforms. If y = price, and x = time, then it is often used as an alternative to moving averages. There is also a polynomial regression, when there is still one predictor x, but there are also x2 and to a higher degree. Thus, x _n == x ⁿ .

Perception

Often this method is considered as a neural network with one neuron. In fact, perception is the same regression function that we discussed above. But with a binary result. Therefore, it is also called logistic regression. Although, in essence, this is a classification algorithm. The advise(PERCEPTRON, …) function advise(PERCEPTRON, …) in Zorro generates a C code that makes a return of 100 or -100, depending on whether the predicted result is within or outside the specified threshold.

 int predict(double* sig) { if(-27.99*sig[0] + 1.24*sig[1] - 3.54*sig[2] > -21.50) return 100; else return -100; }

This fragment shows that the sig array is equivalent to our prediction markers x _n in the regression formula, and the numerical factors are the coefficients a _n .

Neural networks

Linear or logistic regression can solve only linear problems. Many simply do not work with this category of questions. The artificial neural network (INS) is designed to solve non-linear problems. It is a bundle of perceptrons connected in a set of layers. Each of them is a neuron network. Here are the input and output of such a network:

The neural network is trained through coefficient recognition, which minimizes the discrepancy between the prediction template and the target template. But now we need to use the approximation process as well. It is usually used together with the back propagation method of error from the input data to the output, optimizing the load along the way.

This process places two limitations. First, neural outputs now have to be continuously differentiated functions, instead of just being perceptor thresholds. Second: the network should not be too deep; too many hidden layers between inputs and outputs should be avoided. This, of course, limits the complexity of the problems that a simple neural network can solve.

If you use a neural network for trading, then you need to vary and change many parameters. Negligence in handling them can lead to being distorted:

the number of hidden layers;
the number of neurons for each hidden layer;
the number of backward cycles, or epochs;
learning speed, step width of one era;
momentum;
activation function.

The activation function simulates the perceptor threshold. For back propagation of an error, a continuously differentiable function is needed that generates a “soft” step for a certain value of x. For this, the functions sigmoid , tanh , or softmax are commonly used. In our example, the function can be used to regress and predict numeric values instead of binary output.

Deep learning

If we are talking about a lot of hidden layers and thousands of neurons, then this is already a deep learning. Here the standard back distribution does not work. In the past few years there have been several popular methods of teaching such a huge system. Usually they include the stage of pre-learning of hidden layers to achieve the desired effect. One of the options is the Boltzmann machine, an uncontrolled classification algorithm with a special network structure where there are no connections between hidden neurons. Sparse Autoencoder is another option; it uses the standard network structure and prepares hidden layers through the reproduction of input signals for the output of layers with as few active connections as possible. Such methods already allow to solve serious problems. Well, for example, to win the best go player in the world.

Below is an example of an R script using an autoencoder with three hidden layers to determine the trading signals using the neural () function of the Zorro package:

 library('deepnet', quietly = T) library('caret', quietly = T) # called by Zorro for training neural.train = function(model,XY) { XY <- as.matrix(XY) X <- XY[,-ncol(XY)] # predictors Y <- XY[,ncol(XY)] # target Y <- ifelse(Y > 0,1,0) # convert -1..1 to 0..1 Models[[model]] <<- sae.dnn.train(X,Y, hidden = c(20,20,20), activationfun = "tanh", learningrate = 0.5, momentum = 0.5, learningrate_scale = 1.0, output = "sigm", sae_output = "linear", numepochs = 100, batchsize = 100, hidden_dropout = 0, visible_dropout = 0) } # called by Zorro for prediction neural.predict = function(model,X) { if(is.vector(X)) X <- t(X) # transpose horizontal vector return(nn.predict(Models[[model]],X)) } # called by Zorro for saving the models neural.save = function(name) { save(Models,file=name) # save trained models } # called by Zorro for initialization neural.init = function() { set.seed(365) Models <<- vector("list") } # quick OOS test for experimenting with the settings Test = function() { neural.init() XY <<- read.csv('C:/Project/Zorro/Data/signals0.csv',header = F) splits <- nrow(XY)*0.8 XY.tr <<- head(XY,splits) # training set XY.ts <<- tail(XY,-splits) # test set neural.train(1,XY.tr) X <<- XY.ts[,-ncol(XY.ts)] Y <<- XY.ts[,ncol(XY.ts)] Y.ob <<- ifelse(Y > 0,1,0) Y <<- neural.predict(1,X) Y.pr <<- ifelse(Y > 0.5,1,0) confusionMatrix(Y.pr,Y.ob) # display prediction accuracy }

Support Vector Machine

Like the neural network, the support vector machine is an advanced version of linear regression. Take a look at this formula again:

Prediction markers x _n can be viewed as coordinates of a space with n dimensions. By binding the target value y to a fixed value, we define a plane or, as it is also called, a hyperplane. It separates templates with y> 0 from templates with y <0. The coefficient a _n can be calculated as the maximum distance from the plane to the nearest template, called the reference vector. Thus, we obtain a binary classifier with an optimal division of patterns into winning and losing ones.

There is a small problem: usually these templates cannot be separated linearly, they are dispersed irregularly in our marker space. We cannot put a flat plane in it. If they could, then there is a simpler way to determine the plane - linear discriminant analysis. But in most cases we are forced to use the trick that the support vector machine method provides: to add more dimensions to our space. After that, the algorithm produces more markers using the kernel function, which combines any two predictors into a new marker. The more dimensions added, the easier it is to split patterns using a hyperplane. Then this plane is transformed back into the original space with n dimensions, overgrown with folds along the way. In order for the kernel function not to stray and work optimally, the process must be carried out without actually calculating the parameters of such a transformation.

The support vector method can be used not only for classification, but also for regression. It also allows you to optimize the prediction process through the following parameters:

kernel function: usually use the radial basis function of the kernel, but you have a choice, you can use the polynomial, sigmoid or linear variant;
gamma, width of the radial basis function of the nucleus;
cost parameter C, “fine” for incorrect classification of patterns in the learning process.

Method k nearest neighbors

In comparison with support vectors, this is a fairly simple algorithm with a bunch of unique features. He does not require training. Therefore, patterns in this case is a model. In trading systems, it allows you to learn continuously through the addition of more templates. The method of nearest neighbors calculates the distance from the current values of the markers to the nearest k-patterns. In a space with n dimensions, this distance is calculated in the same way as in two dimensions:

The algorithm simply predicts the target value based on the average value of the target variables k of the closest pattern, distributed over their inverse distances. It can be used for both classification and regression. The things borrowed from graphic editors (for example, an adaptive binary tree) will help to find the nearest neighbor rather quickly. Previously, such things were often used in programming toys when it was necessary to launch self-study of the enemy’s intelligence. You can use the knn function in R for our purposes or write your own in C.

K-medium method

The k-means method uses an approximation algorithm for uncontrolled classification. It is in some way very similar to the previous method. To separate patterns, the algorithm first places random marks k in the marker space. Then these points are assigned all the templates that are closest to them. Next step: these points move to the average of the nearest patterns. And we get a new distribution of patterns, now certain patterns are getting closer to other marks. The process is repeated until the distribution becomes unchanged. That is, each mark will be located exactly according to the average value of the nearest templates. Therefore, we have a class of k-patterns, each of which is in close proximity to one of the k-points. The algorithm is simple, but it can lead to unexpectedly good results.

Naive Bayes Classifier

The following algorithm uses the Bayes theorem to classify patterns according to non-numeric characters (events), in the same way as the already-discussed candlestick pattern method. Suppose we have an event X, which manifests itself in 80% of successful template cases. What can be learned from this? Calculate the probability of winning a variant containing X. It will not be equal to 0.8, as might be supposed. The effect of this probability can be calculated by the Bayesian theorem:

P (Y | X) is the probability that an event Y (win) appears in all patterns containing an event X. It will be equal to the probability of occurrence of X in all winning patterns (i.e. 0.8) multiplied by the probability of occurrence of Y in all templates (in our case, 0.5, if you carefully read the tips on balancing templates) and divided by the probability of having X in all available templates.

If we are naive in moderation and assume that all events X are independent of each other, we can calculate the overall probability that a particular pattern will be advantageous, using the following formula, using the scaling factor s:

In order for the formula to work, you need to separate the prediction markers so that they are as independent as possible from each other. And this is the most important obstacle to using the Bayes theorem in trading. In most cases, the two events will, in one way or another, depend on each other.

The Bayesian classifier method is available in the e1071 package for R.

Decision Tree and Regression Tree

Both trees aim at predicting the numerical values of the output, based on a series of yes / no solutions. The answer in each case depends on the presence or absence of the event (in the variant of non-numeric features) or on the comparison of the prediction marker values with a fixed threshold. The standard tree function on Zorro will look like this:

 int tree(double* sig) { if(sig[1] <= 12.938) { if(sig[0] <= 0.953) return -70; else { if(sig[2] <= 43) return 25; else { if(sig[3] <= 0.962) return -67; else return 15; } } } else { if(sig[3] <= 0.732) return -71; else { if(sig[1] > 30.61) return 27; else { if(sig[2] > 46) return 80; else return -62; } } } }

How does this tree come from a set of templates? There are several methods. Zorro prefers the Shannon-entropy method of information. First, it checks one of the markers, say, x1. It establishes a hyperplane according to the formula x ₁ = t. This plane separates the templates with the value x ₁ > t from the templates with the value x ₁ <t. The separation threshold t is selective, so the information gain (the ratio of the information entropy of the entire space to the sum of the information entropies of the two separated subspaces) will be maximum. This is the case when the patterns in the subspaces are more similar to each other than the patterns in the whole space.

Then the process starts for the next marker x ₂ , then two hyperplanes split two subspaces. In each case, the condition for this is to compare the marker with the threshold set. Soon we get a branched tree with thousands of comparisons. Following the process starts in the opposite direction, we need to prune the tree, removing all solutions that do not lead to a significant increase in information. Finally, a relatively small tree is obtained, as in the example code above.

The decision tree can be applied in different ways. But it cannot be used as a solution to all problems, since its dividing planes are always parallel to the axes of the marker space. This limits the ability to make accurate predictions. It can also be used for regression. For example, to determine the proportion of patterns associated with a particular branch of the tree. The Zorro tree is a regression tree. The most common tree classification algorithm is C5.0, available in the C50 package for R.

Conclusion

Today there are many different methods of intellectual analysis available to traders. But which is more effective: model-specific strategies or machine learning strategies? Undoubtedly, the latter have a lot of advantages. No need to worry about the microstructure of the market, the psychology of traders and other nonsense that is not expressed in numbers. You can concentrate on pure mathematics. Machine learning is a more sophisticated option, a more attractive way to create a trading system. Everything speaks in his favor. In addition to one thing: despite the rave reviews on the forums, in live trading, all this turns out to be strangely ineffective.

Every week in specialized editions there is a new article about methods of machine learning. The conclusions of these articles should be treated with skepticism. Many of them promise a fantastic return rate of 70-85% profit. If all this were true, the number of billionaires among mathematicians would be off scale. In reality, successful strategies based on machine learning are offensively few.