Building an animated linear graph of a moving average in R. Retrieving data through the NBA API

Continue to analyze basketball data with R.

In contrast to the previous article , which was purely entertaining in nature, the graphics that will be built in this article may be interesting from the point of view of the team’s analysis of the game for the season.

And we will build graphs of the moving average for the three types of NBA team ratings: attacker, defensive and net-rating (i.e. the difference between the first two). In a nutshell about them. The attacker and defensive ratings are the number of points scored / missed by the team for 100 possessions. NET rating - this is also their difference per one hundred possessions. Anyone interested in learning more about them can read the glossary on the basketball-reference site. There is a calculation formula, which I also implemented with the help of R, but so far I have not published an article about it.

Also I will explain why I will build the graph of the moving average. In each individual match, the proportion of randomness is too large, the indicators jump from 70 to 150, which makes data analysis useless, and the graph itself is more like a cardiogram. If we take the cumulative average, the other extreme is obtained: the schedule looks like damped oscillations, and games at the end of the season, when they are added to 70-75 matches already played, have little effect on the overall figure. Roughly speaking, their "not visible." The moving average in this case is the way out of the stalemate. On the one hand, the influence of chance decreases, on the other hand, there is no excessive accumulation of results. Basketball statistics usually make a 10-match moving average.

Used libraries

library(httr) library(jsonlite) library(tidyverse) library(lubridate) library(zoo) library(ggthemes) library(gganimate)

Getting data using the NBA API

Last time I received data using the NBA Data Retriever extension. This time I will use the NBA API to directly load the necessary data into R.

First, find out where this data comes from. To do this, open the page we need on stats.nba.com and go to the developer tools. Then open Network -> XHR and press F5. In the list that appears, we find a file with a name similar to the name of the page. We need him. After we make sure that we have chosen the correct file, we copy its address in R. In the pictures it looks like this.

open the desired file

the file should look like this

we copy in the R address

Now go to work in R Studio . To get the information we need, use the GET function of the http package. However, in order for the request to be executed correctly (this can be checked by the status_code function, it must be 200), you need to add headers to define the operating parameters of the HTTP transaction.

 ##Adding headers request_headers <- c( "accept-encoding" = "gzip, deflate, sdch", "accept-language" = "en-US,en;q=0.8", "cache-control" = "no-cache", "connection" = "keep-alive", "host" = "stats.nba.com", "pragma" = "no-cache", "upgrade-insecure-requests" = "1", "user-agent" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9" ) #Getting a response request <- GET(adv_box_team, add_headers(request_headers))

We get an answer like this:

But while the data we need is not visible. To get them, we first extract the content the request into the json file with the content function, and then convert it to the list with a function from the jsonlite package with the saying name fromJSON

 boxscore_data <- fromJSON(content(request, as = "text"))

As a result, we get a list that already contains all the information we need and then we just bring it into the form that is needed for work.

Data preparation

To do this, we will do a data table instead of a list, and then add column headings.

 #Convert to tibble data and assigning column names table <- tbl_df(data.frame(boxscore_data$resultSets$rowSet[[1]], stringsAsFactors = FALSE)) names(table) <- toupper(boxscore_data$resultSets$headers[[1]])

toupper is a function that replaces all characters with uppercase. After that, we should have a table with 2460 rows and 46 columns. In principle, it is possible to work with the table in this form, but it is better to exclude unnecessary information for more convenient and quick work.

 ##Select the columns you want to analyze rating <- table %>% select(TEAM_ID, TEAM_ABBREVIATION, TEAM_NAME, GAME_ID, GAME_DATE, MATCHUP, WL, E_OFF_RATING, E_DEF_RATING, E_NET_RATING)

If you look at the source table, you can see two types of the same rating: “normal” and with the prefix E. Without going into details, the E-rating takes into account the pace of the game, therefore it is more accurate. And take it.

Then I want to simplify the names of the ratings. They will have to be entered into the function arguments and it is better to use more familiar to a wide range of users notation: ORTG, DRTG, NRTG. Here you can "get confused" with writing a regular expression and replacing with str_replace , but writing them is still a pleasure and here we can perfectly do without them. We just need to extract the 3, 7, 9, and 12 characters of the current titles, merge them and replace the column names with the resulting character vector. We do all this using the functions of the stringr package: str_sub and str_c (an analogue of the base paste0 ).

 ## Renaming columns with E_OFF_RATING on ORTG rating1 <- rating %>% rename_at(vars(starts_with("E_")), list(~str_c(str_sub(., start = 3, end = 3), str_sub(., start = 7, end = 7), str_sub(., start = 9, end = 9), str_sub(., start = 12, end = 12))))

at in dplyr package dplyr has the same property as the dt[, lapply(.SD, func), .SDols = col1] in the data.table package: the action is applied to several columns simultaneously. Here we select all columns whose names begin with "E_".

As a result, we have the following table, with which we will work in the future:

TEAM_ID	TEAM_ABBREVIATION	TEAM_NAME	GAME_ID	GAME_DATE	MATCHUP	WL	ORTG	DRTG	NRTG
1610612749	MIL	Milwaukee bucks	0021801226	2019-04-10T00: 00: 00	MIL vs. OKC	L	102.4	116.8	-14.4
1610612766	CHA	Charlotte hornets	0021801222	2019-04-10T00: 00: 00	CHA vs. Orl	L	121.4	130.1	-8.6
1610612758	SAC	Sacramento Kings	0021801230	2019-04-10T00: 00: 00	SAC @ POR	L	129.7	136.4	-6.8
1610612748	MIA	Miami heat	0021801221	2019-04-10T00: 00: 00	MIA @ BKN	L	84.2	103.6	-19.4
1610612750	MIN	Minnesota Timberwolves	0021801228	2019-04-10T00: 00: 00	MIN @ DEN	L	98.3	103.7	-5.4

The rolling_offnet_rating_nba function for plotting and animating a moving average.

Again, like last time, we will create a function to make minimal changes to the calculations.

The rolling_offnet_rating_nba function has the following form:

 rolling_offnet_rating_nba <- function(table, name, variable, col1 = col1, col2 = col2)

table is the name of the data table
name is the abbreviation of the command for which graphics will be made ("BOS", "LAL", etc.).
variable - rating that will be calculated (here are two options, ORTG or NRTG, I made a separate function for the protective rating)
col1 and col2 - line color when the value is above / below the average.

Most dplyr functions use non-standard evaluation (NSE ). This is a general term meaning that their evaluation differs from the usual evaluation in R. This makes it easier to write code and work with SQL databases, but the downside is that we cannot replace the value with an equivalent object defined elsewhere.

Dplyr uses Tidy evaluation . Therefore, it is necessary to use special tools (citation functions, operator !!) to solve the problems that arise while programming. You can read more about this here and see here .

The following code takes the name of the function argument and writes the expression that was presented to it. (To understand the operation of the functions enquo and others like it, it is useful to print the output of this function)

 ##Return the entered value in the function argument in the type quosure quo_rating <- enquo(variable) quo_col1 <- enquo(col1) quo_col2 <- enquo(col2)

Next we change the data format of some columns: GAME_DATE from the character column is made in the Date format, and the rating columns are made numeric. Because we apply the as.numeric function to three columns, then instead of mutate use mutate_at . And we sort everything in ascending order of date.

 ##Changing the data type of multiple columns test1 <- table %>% mutate(GAME_DATE = as.Date(ymd_hms(GAME_DATE))) %>% mutate_at(vars(ORTG:NRTG), list(~as.numeric)) %>% arrange(GAME_DATE)

And then we calculate the 10-match moving average of the team we need. For this we use the rollmeanr function from the zoo package. r at the end of the name means that the result should be right aligned. For the first nine games of the season, a moving 10-match average is simply impossible to calculate, so we leave these fields without values, filling them with NA using the fill argument. na.omit removes from the table the rows in which these NAs occur.

 ##The calculation of the moving average team <- test1 %>% filter(TEAM_ABBREVIATION == "DAL") %>% mutate(RATING = rollmeanr(ORTG, k = 10, fill= NA)) %>% na.omit(test1)

The team table looks like this:

TEAM_ID	TEAM_ABBREVIATION	TEAM_NAME	GAME_ID	GAME_DATE	MATCHUP	WL	ORTG	DRTG	NRTG	RATING
1610612742	Dal	Dallas mavericks	0021800150	2018-11-06	DAL vs. Wass	W	116.8	99.2	17.6	105.51
1610612742	Dal	Dallas mavericks	0021800160	2018-11-07	DAL @ UTA	L	98.5	112.0	-13.6	104.92
1610612742	Dal	Dallas mavericks	0021800181	2018-11-10	DAL vs. OKC	W	115.0	101.1	13.9	104.13
1610612742	Dal	Dallas mavericks	0021800193	2018-11-12	DAL @ CHI	W	98.3	91.0	7.3	103.03
1610612742	Dal	Dallas mavericks	0021800210	2018-11-14	DAL vs. Uta	W	117.3	65.8	51.6	105.34

In principle, we have already received the necessary information. You can create a line graph using two lines of code. But the black line on a white background is of little interest from both aesthetic and informative points of view. A further part of the "function body" corrects this.

To begin, we will add data on the average, 10th and 21st (tenth below) rating values, as well as the date of the team’s 10th match (that is, the first for which the moving average is calculated and which remained in the team table after deleting the rows from the NA) .

 ##The average, 10 and 21 ratings in the entire League. average <- league %>% mutate(average = mean(!! quo_rating)) %>% select(average) %>% unique() %>% .$average top10 <- league %>% arrange(desc(!! quo_rating)) %>% select(!! quo_rating) %>% slice(10) top10 <- top10[[1]] bottom10 <- league %>% arrange(desc(!! quo_rating)) %>% select(!! quo_rating) %>% slice(21) bottom10 <- bottom10[[1]] ##Getting the date of the first rollaverage data <- team %>% select(GAME_DATE) %>% arrange(GAME_DATE) data <- data[[1,1]]

Of the previously unused functions, the slice function appears here, which selects the lines by their ordinal number.

Next we choose 2 colors and their name. The data, as well as last time, we take from the table table_color . The name will be used in the header of the graph, to explain which of the colors corresponds to values below the average and which one is higher.

 ##Getting color and color_name selected color color1 <- table_color %>% filter(TEAM_ABBREVIATION == name) %>% select(!! quo_col1) color1 <- color1[[1]] color2 <- table_color %>% filter(TEAM_ABBREVIATION == name) %>% select(!! quo_col2) color2 <- color2[[1]] name1 <- paste0("name_", quo_name(quo_col1)) name2 <- paste0("name_", quo_name(quo_col2)) name_color1 <- table_color %>% filter(TEAM_ABBREVIATION == name) %>% select(name1) name_color1 <- name_color1[[1]] name_color2 <- table_color %>% filter(TEAM_ABBREVIATION == name) %>% select(name2) name_color2 <- name_color2[[1]]

The arguments of the function default values are col1 and col2, this is the first and second colors of commands. In most cases (more precisely in 26), these values do not need to be changed, however, for the four teams the following color should be used in their color palette. In Dallas and Minnesota, the first and second colors are too similar, while Milwaukee and Brooklyn are not visible on a white background. Both that, and another makes it difficult to read the graph, so for them it is worth using the argument col2 = col3.

Next we get the maximum value of the rating for the team. We will need this value to position the text with the rating value on the chart. I want to draw attention to the last line of the code. It so happened that the functions perfectly built the graphs in 89 of 90 cases, but when building a protective rating, Milwaukee gave an error. It turned out that the maximum rating value in Milwaukee is reached twice and ggplot2 naturally begins to swear that aesthetic should be, in our case, either 1 or 73. Therefore, we need the only maximum rating value.

 ##The maximum value of the rating max <- team %>% filter(RATING == max(RATING)) %>% select(RATING) max <- max[[1]]

Build static graphics in `ggplot2`

 ##Building and save a static chart Sys.setlocale("LC_ALL", "C") gg <- ggplot(team, aes(GAME_DATE, RATING)) + geom_hline(yintercept = c(top10, bottom10), col = c("red", "blue")) + annotate(geom = "text", x = as.Date(data) + 2, y = top10 - 0.2, label = "TOP 10", col = "red") + annotate(geom = "text", x = as.Date(data) + 2, y = bottom10 + 0.2, label = "BOTTOM 10", col = "blue") + geom_line(size = 2, col = if_else(team$RATING > average, color1, color2)) + theme_tufte() + labs(title = paste0(team$TEAM_NAME, " 10-Game Rolling ", quo_name(quo_rating)), subtitle = paste0(paste0(name_color1, " - above average ", quo_name(quo_rating)), "\n", paste0(name_color2, " - below average ", quo_name(quo_rating))), caption = "Source: BBall Index Data & Tools\nTelegram: @NBAatlantic, twitter: @vshufinskiy") theme(plot.title = element_text(size = 12, hjust = 0.5), plot.caption = element_text(size = 10), plot.subtitle = element_text(size = 9)) ggsave(paste0(unique(team$TEAM_NAME), quo_name(quo_rating), ".jpeg"), gg, width = 8, units = "in")

From the new here is the use of the if_else function to change the color of the line depending on whether the average rating of the League is higher or lower, as well as the first line changing the locale. This was done to ensure that the abbreviations of the names of the months on the X axis were written in English.

Building a 10-match moving average animation.

In the construction of the animation, I added a few lotions that are impossible in the static version. First, the changing date (in the same way as in the last article changed the year), as well as the value of the rating at a particular point in time. It also changes color depending on whether it is above or below the average.

 ##Building animations anim <- gg + theme(plot.title = element_text(hjust = 0.5, size = 25), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 15), axis.text = element_text(size = 15), axis.title = element_text(size = 18)) + geom_text(aes(x = as.Date(data), y = max + 0.5), label = paste0(quo_name(quo_rating)," ", round(team$RATING, digits = 1)), size = 6, col = if_else(team$RATING > average, color1, color2)) + transition_reveal(GAME_DATE) + labs(title = paste0(team$TEAM_NAME, " 10-Game Rolling ", quo_name(quo_rating)), subtitle = paste0(paste0(name_color1, " - above average ",quo_name(quo_rating)), "\n", paste0(name_color2, " - below average ",quo_name(quo_rating)), "\n", "Date: {frame_along}"), caption = paste0("Source: stats.nba.com\nTelegram: @NBAatlantic, twitter: @vshufinskiy"))

Result

The graph is pretty obvious that Dallas sagged in the second half of February-March. The explanation for this is very simple: it was at this moment of the season that the Mavericks exchanged 4 out of 5 players of their starting five, and the main asset that arrived, Latvian Kristaps Porzingis, did not play a single minute due to the rupture of the cruciate ligaments.

Here I will not delve into the sports component, so if anyone is interested to see the remaining 89 schedules of the season 2018-19, then you are welcome to my blog on sports.ru , where I plan to write an article reviewing the most interesting of them or my Telegram nba channel where i'm going to lay them all out.

GitHub Repository

Source: https://habr.com/ru/post/459142/

All Articles