Machine learning for tennis prediction: part 1

Mathematical modeling of tennis is gaining popularity before our eyes. Every year new analytical models and services appear, competing with each other in the accuracy of predicting the outcomes of tennis matches. This is due to the desire to earn on the rapidly growing online market of sports betting: there are cases when the amount of bets on a single match in professional tennis reaches millions of dollars.

In this review, I will look at the basic mathematical methods for predicting tennis: hierarchical Markov models, machine learning algorithms, as well as analyze IBM, Microsoft cases and one Russian service that use machine learning to predict the results of tennis matches.

')

Content
Part 1
Introduction to the Tennis Prediction Problem
Data for tennis
Sports betting

Statistical models

Part 2
Machine learning in tennis
Machine learning models

Logistic regression
Neural networks
Support Vector Machine
Other MO algorithms

Machine learning problems
Cases MO to predict tennis

Ibm
Microsoft
OhMyBet!

Introduction to the Tennis Prediction Problem

Tennis is a great show and a lot of money. The Association of Tennis Professionals (ATP) annually holds more than 60 professional tournaments in 30 countries. The broadcast of the Andy Murray game against Milos Raonich in the Wimbledon 2016 final was watched by over 13.3 million people in the UK alone. Tennis bets catch up on the popularity of football. At the world's largest online betting exchange Betfair, the total amount of bets on the match Murray-Djokovic in the 2013 Wimbledon final was 63 million dollars. Potential profits and scientific interest led to a surge in research in the field of algorithms for accurate prediction of tennis matches.

The points system in tennis has a hierarchical structure: a match consists of sets, which consist of games, which consist of individual points. In most modern approaches to predicting tennis, this structure is used to derive hierarchical expressions for the probability of a player winning a match based on Markov chains. If we assume that points in tennis are distributed independently and in the same way (independent and identical distribution, IID) [1] , to obtain an expression, you only need to know the probability of each player winning a point when serving. Based on this basic statistics, which is easy to obtain from historical data on the Internet, it is possible to calculate the probability that each player will win the game, then the set, and finally the match.

With all the elegance of this approach, it can not be considered ideal. Representing the quality of players in only one parameter (points won when serving), this method is unable to take into account more subtle factors that also affect the outcome of the match. For example, a player’s commitment to a certain strategy, time after injury, and general fatigue from previous matches can only indirectly affect the match forecast obtained by the method of hierarchical models. Moreover, the characteristics of the match itself - coverage, location, weather - are not considered at all in such a forecast.

Taking into account the huge amount of historical data on tennis, it is possible to propose an alternative approach to predicting tennis matches - machine learning. The parameters of the players and the match, together with the result of the match, can be a training sample. A machine learning algorithm with a teacher can use this sample to build a function to predict the results of new matches.

Despite the fact that machine learning suggests itself to solve the problem of predicting tennis, this approach until recently attracted much less attention from researchers than stochastic hierarchical methods. Most studies of machine learning to tennis use logistic regression and neural networks. The ROI of the most accurate model described in the scientific literature is 4.35%, which, according to the author, is 75% better than modern stochastic models [2] .

Most of the online tennis forecasting services (we do not consider forecasters) use exactly stochastic models and offer users the likelihood of winning each player with accompanying statistics, which are proposed to be analyzed independently. I will consider more interesting cases when using the machine learning algorithms are analyzed not only the probabilities of scoring when submitting, but also historical statistics on players and match parameters. I will consider the cases of such giants as IBM, Microsoft, as well as the Russian service OhMyBet !, predicting tennis using machine learning algorithms.

But first things first.

Data for tennis

Historical data on tennis matches are widely available on the Internet. Official tournament websites, for example, www.atpworldtour.com , provide information about players and match results, as well as the performance of an athlete for each match. Some sources, such as www.tennis-data.co.uk , provide historical data in a structured form (CSV or Excel files). Paid databases are also available - more complex, for longer periods and with better accuracy, for example, the OnCourt database.

The most relevant data that can be obtained from such databases are presented in the table below.

Player Details	Name
	Date of Birth
	A country
	Prize fund
	Rating points
	Overall ATP or WTA rating
Match details	Tournament Name
	Type of tournament (eg Grand Slam)
	Court coverage
	Location (country, coordinates)
	date
	Result (set score)
	Prize fund
	Odds (from Pinnacle)
Match statistics for both players	Winning percentage on the first serve
	Asa
	Double mistakes
	Unforced errors
	Percentage of points won on first serve
	Percentage of points won in the second serve
	The percentage of points won at the reception
	Winners
	Break Points (won, total)
	Outputs to the grid (won, total)
	Total points won
	The fastest feed
	Average feed rate
	Average second feed rate
	Odds (from Pinnacle)

For the simulation of the match can be important data such as statistics on sets and points for each player. This data can be obtained by parsing sites such as flashscore.com . It is important to note that with the help of HawkEye ball tracking technology for many tournaments, you can obtain data of higher quality and detail, for example, the position of the ball and the player at any moment of the match. However, the ATP association owning this data does not issue licenses for their use to third parties.

Sports betting

There are two main categories of tennis betting: pre-match and live bets, which differ in the level of odds. In addition, you can bet not only on the winner of the match, but also on many other factors, for example, on the account in separate sets, the total number of games, etc. Most predictive models are focused on pre-match bets on the winner of the match, because It is for this type of rates that the most available historical data on coefficients is available, which allows for the most complete assessment of the effectiveness of the predictive model.

Bets on tennis matches can be placed either in bookmaker offices (online and offline), or on betting exchanges. Traditional bookmakers (for example, Pinnacle) set coefficients for different outcomes of the match, and the client (bettor) plays against the bookmaker. In the case of betting exchanges (for example, Betfair), customers can bet against odds set by other bettors. The exchange equalizes the rates of the clients and earns on collecting commissions from each stake that has been played.

Odds, estimated probability and ROI

The rate coefficient means the profit that the bettor will receive if he correctly guesses the outcome of the event. For example, if the bettor correctly predicted the victory of a player whose coefficient is 3.00, he will receive $ 2 for each dollar delivered (in addition to the amount of the bet that is returned). If the bettor’s prediction is incorrect, he loses only the sum of his bet, regardless of the odds. There are different systems for recording coefficients, the most popular of which are decimal or European (1.5, 2.00, 2.50, etc.) and fractional or British (1/2, 1/1, 6/4 and t d.)

The coefficients express the estimated probability of the outcome of the match, that is, the bookmaker's estimate of the true probability. In the example described above with a factor of 3.00 (1 to 3), the estimated probability p of a player winning a match is 33%.

The table below shows the different systems for recording coefficients and their corresponding probable probabilities.

Decimal (Europe)	Fractional (UK)	USA	Hong Kong	Indonesia	Malaysia	Estimated probability
1.50	1/2	-200	0.50	-2.00	0.50	1 to 1.5 = 67%
2.00	1/1 (evs)	+100	1.00	1.00	1.00	1 to 2 = 50%
2.50	6/4	+150	1.50	1.50	-0.67	1 to 2.5 = 40%
3.00	2/1	+200	2.00	2.00	-0.50	1 to 3 = 33%

Recalculation formulas

X	Read in	Act
Decimal	Fractional	x-1, then convert to fraction
Decimal	USA	100 * (x-1) if x> 2; -100 / (x-1) if x <2
Fractional	Decimal	split fraction, then x + 1
Fractional	USA	split the fraction, then 100 * x if x> = 1; -100 / x if x <1
USA	Decimal	(x / 100) +1 if x> 0; (-100 / x) +1 if x <0
USA	Fractional	x / 100 if x> 0; -100 / x if x <0
Decimal	Hong Kong	x-1
Hong Kong	Indonesia	x if x> = 1; (1 / x) * - 1 if x <1
Hong Kong	Malaysia	x if x <= 1; (1 / x) * - 1 if x> 1

Source: Wikipedia

Profit for a certain period of time is called return on investment (ROI). In the case of sports bets, ROI is the percentage of winning from each bet made, averaged over a distance. The simplified ROI formula for a fixed bet size looks like this:

where P _n is the total profit at a distance, s is the sum of one bet, n is the number of bets (distance). ROI is the main indicator of bettor's success, and, accordingly, is the target indicator of the effectiveness of the predictive model.

Measuring the effectiveness of the model based on the ROI calculated on the historical data of the market rates is a generally accepted approach to research in this area (including in [2] , [4] , [7] ). If the model accuracy is chosen as the target value (the percentage of correct predictions), then with trivial filtering of matches by low coefficients (1.01-1.3), you can approach the accuracy of 90% or more, but for obvious reasons, the ROI will be negative .

Betting strategies

Knowing the coefficient and the estimated probability of the outcome of the match, you can make different decisions, how much to bet and whether to bet at all. Obviously, different strategies end up with a different ROI. As a rule, three basic strategies are used to assess the effectiveness of the predictive model. Let be
s _i = bet size on player i
p _i ^bettor - ^bettor 's estimate of the probability of player's victory i
b _i = net odds on player i , calculated as x-1 for decimal notation of coefficients or x / y for fractional notation.
p _i ^implied - the estimated probability of winning player i , calculated as (1 / x) * 100% for the decimal notation x , or as y / (y + x) for the fractional notation x / y .

1. Bet on the predicted winner

In the simplest strategy, the bettor always puts a fixed amount of q on the predicted winner:

2. Bet on the predicted winner with a high ratio

Bettor can increase profits by making a fixed bet q only on matches, where he has an advantage over a bookmaker, that is, the bettor’s probability estimate for player i’s victory is higher than the probability assumed by the bookmaker’s coefficient. In other words, this strategy avoids betting on the predicted winner, unless the coefficient compensates adequately for the risk of the bet.

3. Bet on the predicted winner by the Kelly criterion

In the previous strategy, the bettor puts a fixed amount if, in his estimation, he has an advantage over the odds over the bookmaker, regardless of the magnitude of this advantage. The Kelly criterion, described by John Kelly in 1956 [3] , can be used to determine the optimal size of a bet based on the estimated advantage of the bettor and the size of his bank. It is proved that in the long run, the Kelly criterion turns out to be more effective than all other strategies.

Bettor puts a share of the maximum bet q on the predicted winner, if in his estimation he has the advantage:

In fact, the maximum bet size q is the percentage of the bettor's bank, which, respectively, changes over time, depending on the success of the previous rates. When evaluating prognostic models, q is often taken as a constant, so that all bets have the same effect on the resulting ROI.

It is important to note that in all three strategies you can not bet on both players. Also, if at the first strategy you need to bet on each match recommended by the model (provided that the estimated probability is never exactly 0.5), then the second and third strategies assume the omission of some matches.

Statistical models

Most modern models for predicting tennis use hierarchical stochastic expressions based on Markov chains. The following is an overview of the underlying concepts.

Markov models

Klaasen and Magnus [1] challenged the IID theory, showing that points in tennis are not distributed independently or in the same way. However, they also showed that deviations from the IID are so small that the use of this assumption often gives good averages. This fact suggests that for each point in a match the outcome of this point does not depend on the previous points. Suppose further that we know the probability of a point being won when each player serves. Let p be the probability that player A wins a point when serving, q is the probability that player B wins a point at his serving. Using the assumption of IID and the probability of winning points, we can construct a Markov chain describing the probability of a player winning a game.

Formally, a Markov chain is a system of transitions between different states in the state space. An important property of the system is the lack of memory, that is, the next state of the system depends only on the current state, and not on the preceding sequence of states. If we take the account in the game for the state space, and for transitions between states - the probability that player A will win or lose a point, we get a Markov chain, reflecting the stochastic progression of the account in the game. The figure below shows the circuit diagram for a single game with player A.'s feeds. Denoting p as the probability of scoring a point when serving and accepting the IID assumption, we get that all transitions meaning a point won by player A have the same probability, and all transitions meaning lost point, have a probability of 1 – p .

Markov chain for the game in the match, where player A gives [2] .

Due to the hierarchical structure of a tennis match, additional Markov chains are built, simulating the progression of points in tiebreaks, sets and matches. For example, in a match model there will be two outgoing transitions from each inconclusive state, marked by the probabilities of winning and losing an individual set by a player. Diagrams of such models can be found in [4] .

Hierarchical expressions

Based on the idea of modeling tennis matches using Markov chains, Barnett and Clarke [5] and O'Malley [6] developed hierarchical expressions of the probability of a certain player's victory in the entire match.

Barnet and Clark describe the probability of player A’s victory in a game when they serve P _game using the following recursive definition:

The boundary values are as follows:

In the above expressions, p is the probability of player A winning a point when serving, x and y are the number of points won, respectively, by players A and B. This expression fully corresponds to the Markov chain in the figure above.

Barnet and Clark also define a similar expression for calculating the probability of winning over sets based on the probabilities of winning individual games and tie-breaks (which also depend on the probabilities of winning when serving). Finally, the probability of winning a match can be calculated using previously defined expressions. It turns out that the final expression for the probability of winning the match depends only on the probability of scoring when each player submits.

Estimated probability of winnings when serving

The question remains how to estimate these probabilities of scoring points when serving for matches not yet played. Barnet and Clark provide a method for estimating such probabilities from historical player statistics:

Where
f _i - percentage of points won when player i serves
g _i - percentage of points won when player i took the ball
a _i - the percentage of the first innings of player i
a _av is the average first serve percentage for all players.
b _i - the percentage of winnings at the first serve of player i
c _i - the winning percentage at the second serve of player i
d _i - the percentage of winnings at the reception of the first serve by player i
e _i - the percentage of winnings at the reception of the second serve by player i

So, for a match between players A and B, we can estimate the probabilities of scoring when players A and B serve, respectively, as f _AB and f _BA , using the following equation:

Where
f _t - the average percentage of points won when serving in a tournament
f _av - the average percentage of points won when serving for all players
g _av - the average percentage of points won at the reception for all players

Modern models

Modern tennis forecasting models are based on the hierarchical stochastic expressions described. Knottenbelt [7] clarified the Barnet model, using to calculate the probability of scoring when submitting only matches with players' common rivals, instead of all past rivals. This approach allows to reduce the error arising from the fact that players in the past met with rivals of different levels.

Madurska [4] further expanded the Knottenbelt common contender model, using different probabilities of scoring when serving for different sets. Thus, the author refused the IID assumption and her model reflects the accumulation of physical fatigue of the player during the match.

Knottenbelt’s overall rival model and Madurski’s guest model are the most up-to-date statistical models, the authors claim that the ROI for their models was 6.8% and 19.6%, respectively, compared to the 2011 Grand Slam WTA matches market. The overall opponent model was also tested on a larger and more diverse sample of 2173 ATP matches in 2011 and showed a ROI of 3.8%.

To be continued

Bibliography

1. FJGM Klaassen and JR Magnus. Are Points in Tennis Independent and Identically Distributed? Evidence From a Dynamic Binary Panel Data Model. Journal of the American Statistical Association, 96: 500–509, 2001.
2. M. Sipko. Machine Learning for the Prediction of Professional Tennis Matches. Technical report, Imperial College London, London, 2015.
3. J. Kelly. A new interpretation of information rate. IRE Transactions on Information Theory, 2 (3): 917–926, 1956.
4. AM Madurska. The Singles Tennis Matches. Technical report, Imperial College London, London, 2012.
5. T. Barnett and SR Clarke. Combining player statistics to predict outcomes of tennis matches. IMA Journal of Management Mathematics, 16: 113–120, 2005.
6. JA O'Malley. Probability Formulas and Statistical Analysis in Tennis. Journal of Quantitative Analysis in Sports, 4 (2), 2008.
7. WJ Knottenbelt, D. Spanias, and AM Madurska. A common-opponent for professional tennis matches. Computers and Mathematics with Applications, 64: 3820–3827, 2012.

Source: https://habr.com/ru/post/306944/

All Articles