Identification of cointegrated pairs of shares in stock markets
The purpose of this article is to share the results of a study to identify cointegrated pairs of stocks that are represented on the Moscow and New York stock exchanges, using the Angle-Granger test.
If we take two stocks with stationary increments , and find them some linear combination (spread), which will be stationary, then such a time series will be called cointegrated. The presence of cointegration gives us the opportunity to hedge shares and build a market neutral strategy. Why is this possible?
The principle on which profit is built
We all know that the price of a stock, regarded as a time series, can change quite significantly. If we make a position in any one paper, in most cases it will be a very risky game, since we will take all the risks associated with its volatility. However, there are such actions, from which it can be expected that, being paired, such series will not move too far from each other. This concept is called long-term dynamic equilibrium.
In the context of stationarity, long-term dynamic equilibrium takes on a more accurate form. If we take the stationary range of the spread, built between two cointegrated papers, it will have the property of returning to the average, that is, with any deviation from a certain equilibrium, it will tend to return back. The market neutral strategy is based on this principle.
')
How in the stock markets to find pairs connected by long-term dynamic equilibrium?
Correlation
The first thought that comes to mind is to calculate the correlation between the two papers and trade pairs with a strong correlation. This approach fails for two reasons.
First, if the price series of two stocks would have an ideal correlation, that is, if they changed in the same direction and in the same proportion, the difference between the rows would be zero, and we would not be able to earn any money, because none of the stocks will ever be too expensive or too cheap.
Secondly, the correlation does not give us enough information about the relationship of the two shares in the long term. For example, take a large and diversified portfolio of stocks. Let these shares also be included in the stock index, and let the weight of shares in the portfolio be determined by their weights in the index. Although the portfolio in the long run should move in accordance with the index, there will be periods when stocks that are in the index, but not in the portfolio, will have unusual price movements. Consequently, the empirical correlations between the portfolio and the index can be rather low for some time. Because of this, in the analysis, we simply discard such a portfolio and lose the opportunity to earn. It follows that correlation is not a good way to identify pairs.
It is better to use cointegration to identify pairs.
Cointegration
Often, to ensure the stationarity of the economic series, we take the difference. This leads to the following definition of integration.
The time series is called the integrated order. k and denoted by xtsimI(k) if he and his difference to order kβ1 inclusively nonstationary, and its order difference k stationary.
We only need the values ββto get practical results. k=0 and k=1 . If a k=0 then the series itself will be stationary, and for brevity I will further denote such series I(0) . For k=1 the series will be non-stationary with stationary increments (first-order differences), and for brevity I will further denote such series I(1) .
May we have two I(1) row, xt and yt . Let, moreover, their linear combination ytβbetaxt is an I(0). In this case, the rows xt and yt are called cointegrated:
varepsilont=ytβbetaxtsimI(0).
In essence, cointegration is a regression of non-stationary series. It means that if varepsilont has a zero mean, then this series will rarely deviate far from zero and often cross the zero level. In other words, from time to time, an exact equilibrium or a state close to it will be achieved.
Cointegration of logarithms of prices
We can consider cointegration not only between prices, but also between their logarithms. Unfortunately, cointegration between the logarithms of the prices of two stocks is less obvious and intuitive than just cointegration between the prices of two stocks. However, why is cointegration possible in the case of logarithms?
This is explained by the βefficient market hypothesisβ, option pricing model and Ito's lemma. In fact, the effective market hypothesis does not have a strict formalization. This hypothesis suggests that in a liquid market, where the price of an asset will be the result of a balanced spontaneous supply and demand, the current price will accurately reflect all the information that is available to market players. Future changes in price can only be the result of βnewsβ, which by definition is unpredictable, so the best price forecast for any future date is just the price today. In other words, the price today is yesterday's price plus a random item.
The hypothesis of an effective market is connected with the basic pricing model of options. The fundamental assumption of this model is that the price of the underlying asset S satisfies the process of geometric Brownian motion (GBM):
fracdSS=mudt+sigmadW,
Where mu and sigma - constants, which represent, respectively, the displacement in the price of the asset and the volatility of profitability, and W - this is a Wiener process, that is, increments dW independent and normally distributed with zero mean and variance dt .
To see how the GBM equation is related to the efficient market hypothesis, one needs to apply Itoβs lemma to it. What is it? Assume that the values ββof a variable x obey the stochastic differential equation (SDE)
dx=mudt+sigmadW,
Where W Is a Wiener process, and mu and sigma - functions that depend on variables x and t . Assume also that the function f depends on variables x and t and has derivatives fracpartialfpartialt , fracpartialfpartialx , fracpartial2fpartialx2 . Lemma Ito argues that this function obeys the equation
Where c=muβsigma2/2 , but varepsilontsimNID(0,sigma2) so there is a process varepsilont not just stationary, but a white noise. The concept of a stationary process is broader than white noise, and it differs in that a stationary process has a constant expectation, but it does not have to be zero, as is the case with white noise.
The discrete version of the equation given above can, in turn, be written as:
lnSt=c+lnStβ1+varepsilont.
This equation is a random walk (RW) model that is commonly used to simulate price logarithms in efficient financial markets, and is an example I(1) process. Thus, cointegration can also refer to the logarithms of stock prices.
Despite the fact that some skeptics (in particular, I) may doubt the adequacy of the description of the stock price by the GBM equation and, therefore, the possibility of cointegration between price logarithms, empirical data successfully dispel this skepticism. I checked: if the prices are cointegrated, then their logarithms are cointegrated.
Cointegration testing
The first method of testing cointegration came up with Robert Angle and Clive Granger. In 2003, they received the Nobel Prize in Economics for developing a cointegration method for analyzing time series. They described it 15 years before the prize, in 1987 in the article βCointegration and error correction: representation, estimation and testingβ.
Conceptually, in order to determine from existing observations whether time series are xt and yt cointegrated, we need to test the null hypothesis H0:varepsilontsimI(1) the absence of cointegration between the rows xt and yt against alternative hypothesis H0:varepsilontsimI(0) . If the null hypothesis is rejected, then cointegration is recognized.
The original test for cointegration received the name of the test Angle-Granger in honor of its founders. It is a two-step process preceded by a check. xt and yt on first-order integrability, xtsimI(1) and ytsimI(1) . We discussed this in detail in the article on stationary increments . In fact, it describes all the preparatory work that needs to be done before proceeding directly to the Angle-Granger test. Let's say we did it.
Rows xt and yt are co-integrated if their spread ytβbetaxtsimI(0) , that is, is stationary. The first step in the Engle-Granger test is to obtain a consistent assessment. hatbeta . This is done using the OLS (least squares method) for linear regression to the equation yt=betaxt+varepsilont . The second step is to check for stationary residues varepsilont obtained by OLS-estimation of the cointegration equation.
Usually we test stationarity with the Dickie-Fuller test. However, in 1990, Phillips and Uliaris in the article "Asymptotic properties of residual based tests for cointegration" showed that a series of varepsilont Dicky-Fuller's test cannot be used.
The fact is that the OLS βchoosesβ the residues so that they have the smallest possible variation, therefore, even if the variables are not cointegrated, the OLS makes the residues βsimilarβ to the stationary ones. Because of this, when using the Dickey-Fuller test, the hypothesis of non-stationarity is rejected too often and, accordingly, the hypothesis of cointegration is mistakenly accepted.
If we study the authors' article, we will see that in the appendix they give tables with critical values, however they turned out to be rather inaccurate. Later, in 1991, Engle and Granger published the Long-Run Economic Relationship book. In her 13th chapter, entitled βCritical value for cointegration tests,β McKinnon gave refined asymptotic critical values. t - statistics that were obtained by simulation and are suitable for this case.
In 1993, McKinnon, together with Davidson, published his book βEstimation and Inference in Econometricsβ, where they also gave updated critical values. Thus, if varepsilontsimI(0) (residues are stationary), then ytβbetaxtsimI(0) (the spread is also stationary), which means that there is a cointegration between xt and yt .
In general, the Angle-Granger method is reduced to:
assessment beta using OLS;
spread calculation varepsilont=ytβbetaxt and testing varepsilont on stationarity with the help of specified critical values.
In standard packages such as Matlab, this test has already been written, let's use it.
MATLAB cointegration testing
So, we have two rows of stock prices, xt and yt . we want to xt and yt were co-integrated, that is, to spread varepsilont=ytβbetaxt was stationary. If we want to get a stationary series with zero mean, we can include a constant in the equation, so the spread will look like varepsilont=ytβbetaxtβalpha .
Let's start with the results obtained on the Moscow Stock Exchange, which I described in the article about stationary increments . There I found five I(1) rows. We will make of them all sorts of combinations and check for cointegration with the help of Angle Granger's test.
First, we will select from the Microsoft SQL Server database, in which I store the stock price values ββwe needed from the Moscow Stock Exchange and the papers we need and import them as an array:
conn = database.ODBCConnection('uXXXXXX.mssql.masterhost.ru', 'uXXXXXX', 'XXXXXXXXXX'); curs = exec(conn, 'SELECT ALL PriceId, StockId, Date, Price FROM StockPrices WHERE StockId IN (52, 55, 67, 75, 162) AND Date >= ''2016-01-01 00:00:00.000'' AND Date < ''2017-01-01 00:00:00.000'''); curs = fetch(curs); data = curs.Data sqlquery = 'SELECT ALL StockId, ShortName, Code FROM Stocks WHERE StockId IN (52, 55, 67, 75, 162)'; curs = exec(conn, sqlquery); curs = fetch(curs); names = curs.Data close(conn);
In this array for four out of five stocks there is data from January for 252 trading days. However, for one of the shares, deals began to be made only in February, so the data is only for 215 trading days. It is critically important for us that all stocks have an array of prices of the same length, so in such situations we have two options.
The first option is to exclude a stock with a short array of prices from the experiment and use the maximum number of price measurements in order to get more accurate results. The second option is to donate part of the data and include all the shares for the sake of greater practicality. I conducted both experiments, and in this case there was no difference in the results, so let's just cut off the January data:
dates = unique(datetime(data(:,3))); % Cut dates array until price of stock with StockId=67 is not empty. dates(1:37,:) = []; prices = zeros(length(dates),length(names)); for i = 1:length(names) % Indexes with current stock's data indexes = find(cell2mat(data(:,2)) == cell2mat(names(i,1))); if length(indexes) == 252 indexes(1:37,:) = []; end for j=1:length(dates) % Fill prices according to date prices(j,i) = cell2mat(data(indexes(j),4)); end end
The Angle-Granger test is performed using the egcitest function, which takes as its input an array of time series, in this case the size ntimes2 where n - number of trading days. At the output, the function returns a logical value of 1 if the null hypothesis is rejected in favor of the alternative, and 0 otherwise.
The next task we need to solve is what action to take for xt and which - for yt . In an amicable way, one should try both, and then compare test statistics. In most cases, there will be both direct and reverse regression. Let's start with the case when xt<yt .
We make all possible pairs of five identified I(1) series and perform the Engle-Granger test for both regression with a free member (by default) or without it (given by the 'creg' argument with a value of 'nc'):
In the case of regression with a free member, the program twice rejects the null hypothesis in favor of an alternative model, identifying cointegrated pairs of stocks with tickers (NKHP, VTRS), (NKHP, ZHIV). In the case of a regression without a free member, the program once rejects the null hypothesis in favor of the alternative, identifying a cointegrated pair of shares with tickers (VSYDP, NKHP).
In case of reverse regression ( yt<xt ) with a free member, the program twice rejects the null hypothesis in favor of an alternative model, identifying cointegrated pairs of shares with tickers (VTRS, NKHP), (ZHIV, NKHP). In the case of regression without a free member, the program four times rejects the null hypothesis in favor of the alternative, identifying cointegrated pairs of shares with tickers (GRNT, VTRS), (GRNT, VSYDP), (GRNT, ZHIV), (GRNT, NKHP).
Let's estimate the values beta and alpha , which can be obtained as return values ββof the egcitest function, and draw a spread:
For stocks with tickers NKHP and VTRS, we obtain a spread with coefficients beta=$37.552 and alpha=$197.439 :
For reverse regression, we get a βmirrorβ spread with coefficients beta=$0.085 and alpha=β3,0064 :
For stocks with tickers NKHP and ZHIV, we obtain a spread with coefficients beta=$3.352 and alpha=$239.347 :
For reverse regression, we obtain a spread with coefficients beta=0.2194 and alpha=β49,6077 :
For stocks with tickers VSYDP and NKHP, we obtain a spread with a coefficient beta=$35.652 :
Similar experiments were carried out for the shares of the New York Stock Exchange (NYSE). As a result, 158 cointegrated pairs were obtained for direct regression in the case of regression with a free member and 130 cointegrated pairs in the case of regression without a free member. For backward regression, 170 cointegrated pairs were obtained in the case of regression with a free member and 144 cointegrated pairs in the case of regression without a free member.
Regression Statistics
Let's look at the regression statistics of cointegrated regression for a pair (NKHP, VTRS).
Statistics
Direct regression
Inverse regression
Coefficients
beta=$37.552 , alpha=$197.439
beta=$0.085 , alpha=β3,0064
Test statistics
tcalc=β3.7562 , tcrit=β3.3654
tcalc=β3,5906 , tcrit=β3.3654
t -statistics
tbeta=$21.975 , talpha=53.3845
tbeta=$21.975 , talpha=β12,8953
F -statistics
482,9196
482,9196
Statistics of Durbin-Watson
0.2548
0,2203
Coefficient of determination
0,6939
0,6939
Corrected coefficient of determination
0,6925
0,6925
Akaike Information Criterion
1726.5
88,8336
Schwarz's Baes information criterion
1733.2
95,5748
Hannan-Quinn Information Criterion
1729.2
91.5574
Test statistics in both direct and inverse regression tells us that the variable beta in this case, insignificant ( tcalc<tcrit ). This means that the price may be slightly exogenous, even though the variables are cointegrated.
In order to apply the Studentβs criterion and the Fisher criterion, it is necessary that the statistics have a normal distribution. In our case, the statistics has a distribution similar to what Dickie and Fuller established (I also wrote about it in the article about stationary increments ), therefore the calculated values ββof these statistics will be quite large and nothing meaningful will tell us.
Durbin-Watson statistics are acceptable (with positive autocorrelation, the statistics tend to zero). In the case of reverse regression, it is slightly better than in the case of direct.
The coefficient of determination is acceptable (for acceptable models it is assumed that the coefficient of determination should be at least at least 50%). Judging by this criterion, there is no difference between direct and reverse regression.
Judging by the information criteria, the inverse regression greatly benefits the direct (it is believed that the model with the lowest criterion value will be best).
Look at the regression statistics of the cointegrated regression for the pair (NKHP, ZHIV).
Statistics
Direct regression
Inverse regression
Coefficients
beta=$3.352 and alpha=$239.347
beta=0.2194 and alpha=β49,6077
Test statistics
tcalc=β3,4762 , tcrit=β3.3654
tcalc=β3.3878 , tcrit=β3.3654
t -statistics
tbeta=$24.344 , talpha=$137.97
tbeta=$24.344 , talpha=β19,8524
F -statistics
592,652
592,652
Statistics of Durbin-Watson
0.2614
0,2104
Coefficient of determination
0.7356
0.7356
Corrected coefficient of determination
0.7344
0.7344
Akaike Information Criterion
1695
1108,8
Schwarz's Baes information criterion
1701.7
1115,5
Hannan-Quinn Information Criterion
1697.7
1111.5
Test statistics in both direct and inverse regression tells us that the variable beta in this case, insignificant. Durbin-Watson statistics are acceptable, in the case of reverse regression, slightly better than in the case of direct. The coefficient of determination is acceptable, there is no difference between direct and reverse regression. According to the information criteria, the inverse regression is slightly better than the direct one.
Coagulation regression statistics for the pair (VSYDP, NKHP).
Statistics
Direct regression
Coefficients
beta=$35.652
Test statistics
tcalc=β2,8339 , tcrit=β2,7761
t -statistics
82.5035
F -statistics
infty
Statistics of Durbin-Watson
0.1305
Coefficient of determination
0,1928
Corrected coefficient of determination
0,1928
Akaike Information Criterion
3823,8
Schwarz's Baes information criterion
3827.1
Hannan-Quinn Information Criterion
3825.1
Variable beta judging by the test statistics, again insignificant. Fisher criterion flew into space. Durbin-Watson statistics are acceptable. The coefficient of determination is small, so the model is considered bad.
findings
There are a sufficient number of cointegrated shares in stock markets, that is, such that their spread is a stationary process. The presence of such pairs provides the basis for further research and a stable profit, but we'll talk about specific strategies next time.
What to read on the topic?
Robert F. Engle, C.W.J. Granger. Cointegration and error correction: presentation, evaluation and testing // Applied Econometrics. - 2015. - 39 (3). - p. 107-135.
This is a translation of the original article by the authors of 1987; the definition of cointegration is described in more detail there. You can also continue to read Magnus, whom I recommended in the article on stationary increments , there is also a section on cointegration.