For a start, a little introductory information. My name is Vladislav and my acquaintance with R took place in August last year. I decided to study a programming language because of its applied nature. From childhood I liked to keep sports statistics. With age, this passion transformed into a desire to somehow analyze these figures and, based on data analysis, to produce, if possible, intelligent thoughts. The problem is that sport in recent years has swept a wave of data, dozens of companies are competing among themselves, trying to count, describe and shove into the neuron any action of a football player, basketball player, baseball player on the court. And Excel for analysis is not suitable categorically. So I decided to study R so that the simplest analysis would not take half a day. Already in the course of the study, interest was added to programming as such, but this is already lyrics.
I want to immediately notice that much of what I write in the future already been in the simpsons it was on Habré in the article Create animated histograms with R. This article, in turn, is a translation of the article Create Trending Animated Bar Charts using R with Medium. Therefore, in order to somehow differ from the above articles, I will try to more fully describe what I am doing, as well as those moments that are not in the original article. For example, I used the colors of the NBA commands to fill the columns, not the standard ggplot2
palette, and the data.table
package rather than dplyr
in data processing. The whole thing I have done in the form of a function, so now it is enough just to write the name of the team and the years for which you need to count the number of victories.
For plotting, I used the data on the number of victories of each of the 30 NBA teams in the last 15 seasons. They were collected from stats.nba.com using the NBA Data Retriever extension, which, through the use of the NBA API, produces csv-files with the necessary statistics. Here is the complete data from my project on Github .
library(data.table) library(tidyverse) library(gganimate)
For data processing, I use data.table
(simply because I have become acquainted with this package before). I also load the tidyverse
package tidyverse
, and not a separate ggplot2
so as not to worry if any idea suddenly ggplot2
during the analysis that requires additional loading of the package from this set. In this particular case, you can get by with ggplot2
, the other ggplot2
packages do not participate. Well, gganimate
sets the graphics in motion.
First you need to bring the data in order. In principle, to build graphs, we need 2 of the 79 columns of the table with “raw” data. You can first select the necessary columns, you can first replace some values. I went the second way.
The table in data.table
has the form dt[i, j, by]
, where by is "responsible" for the grouping of elements. I will group by the TeamName column. And there is a snag. This column displays the names of the teams: Lakers, Celtics, Heat, etc. But during the period under review (from the 2004/05 season) several teams changed their names: New Orleans Hornets became New Orleans Pelicans, Charlotte Bobcats returned the historical name Charlotte Hornets, and Seattle Supersonics became Oklahoma City Thunder. This can cause confusion. The following conversions help to avoid this:
table1 <- table[TeamCity == "New Orleans" & TeamName == "Hornets", TeamName := "Pelicans"][ TeamCity == "New Orleans/Oklahoma City" & TeamName == "Hornets", TeamName := "Pelicans"][ TeamName == "Bobcats", TeamName := "Hornets"][ TeamName == "SuperSonics", TeamName := "Thunder"]
For this time period, the changes are minimal, but if you expand it, then by TeamName it will be very difficult to group and you will need to use a more reliable column. In this data, this is TeamID.
To get started, we get rid of the "extra" information, leaving only those columns that we need to work:
table1 <- table1[ , .(TeamName, WINS)]
The data.table
construct. () Replaces the list
function. A more "classic" choice of columns is table1 <- table1[, c("TeamName", "WINS")]
. After that, the table takes the following form:
Teamname | WINS |
---|---|
Suns | 62 |
Heat | 59 |
Spurs | 59 |
Pistons | 54 |
For animation for each season separately, this is enough, but to calculate the total number of victories over the selected period, you need to calculate the cumulative amount of victories.
table1 <- table1[, CumWins := cumsum(WINS), by = "TeamName"]
Using the cumsum
function cumsum
we get the numbers we need. Use: = instead of = allows you to add a new column to the table, I do not overwrite it with a single CumWins column. by = "TeamName"
groups data by team name and the cumulative amount is by = "TeamName"
for each of the 30 teams separately.
Next, I add a column with the year when each season began. The season in the NBA is from October to May, so it falls on two calendar years. In the designation of the season, the year of its beginning, i.e. Season: 2018 on the graph is the season 2018/19 in reality.
The original table has this data. The SeasonID column shows the figure as 2 (year of the beginning of the season), for example, 22004. You can remove the first two using the stringr
package or the basic functions R, but I went a little different way. It turned out that I first use this column to specify the necessary seasons, then delete and create a column with dates again. Extra action.
I did this as follows:
table1 <- table1[,year := rep(seq(2004, 2018), each = length(unique(table1$TeamName)))]
I was "lucky" that for the selected time period the number of teams in the NBA did not change, so I just repeated the numbers from 2004 to 2018 30 times. Again, if you go down in history, this method will be inconvenient due to the fact that the number of teams in each season will be different, therefore it is preferable to use the option to clear the SeasonID column.
Then add the cumrank column.
table1 <- table1[, cumrank := frank(-CumWins, ties.method = "random"), by = "year"]
It represents the ranking of teams in each season by the number of victories and will be used as the values of the X axis. frank
a faster data.table
analog of the base rank
, minus means ranking in descending order (this can also be done with the help of the argument decreasing = TRUE
. What order will teams with the same number of wins go, therefore ties.method = "random"
. Well, all this is grouped within one year.
And the last conversion of the table is the addition of the value_rel
column.
table1 <- table1[, value_rel := CumWins/CumWins[cumrank==1], by = "year"]
This column is the ratio of the number of victories of each team to the highest figure for the year. The best team has this indicator equal to 1, the rest less, depending on the success of the season.
After all additions, the table has the following form:
Teamname | WINS | Cumwins | year | cumrank | value_rel |
---|---|---|---|---|---|
Spurs | 59 | 59 | 2004 | 3 | 0.9516129 |
Spurs | 63 | 122 | 2005 | one | 1.0000000 |
Spurs | 58 | 180 | 2006 | 2 | 0.9729730 |
Spurs | 56 | 236 | 2007 | one | 1.0000000 |
The table shows only one team to clearly show cumulativeness. All these actions are done, as in the change of names, a cascade of square brackets
table1 <- table1[ ,.(TeamName, WINS)][ , CumWins := cumsum(WINS), by = "TeamName"][ ,year := rep(seq(2004, 2018), each = length(unique(table1$TeamName)))][ , cumrank := frank(-CumWins, ties.method = "random"), by = "year"][ , value_rel := CumWins/CumWins[cumrank==1], by = "year"]
You can immediately go to the construction of graphs, but there is still, I think, one important point: the color of the columns on the graph. You can leave the standard ggplot2
palette, but this is a bad option. Firstly, it seems to me that she is not beautiful. And secondly, it makes it difficult to find a team on a chart. For NBA fans, each team is associated with a specific color: Boston is green, Chicago is red, Sacramento is purple, and so on. Therefore, using the color of a command in a column fill helps to identify it faster, despite the abundance of blue and red.
To do this, create a table table_color
with the name of the team and its main color. Colors are taken from teamcolorcodes.com .
Teamname | TEAM_color |
---|---|
Hawks | # E03A3E |
Celtics | # 007A33 |
Nets | # 000000 |
With the color table you need to do another manipulation. Because When constructing a graph, factors are used, then the order of commands changes. First on the list will be Philadelphia 76, as the only owner of the "digital" name, and then according to the alphabet. So we need to arrange the colors in the same order, and then extract the vector containing them from the table. I did this as follows:
table_color <- table_color[order(TeamName)] cols <- table_color[, "TEAM_color"]
We really build only one graph that contains all 450 (15 seasons * 30 teams) victory indicators, and then we “divide” it by the necessary variable (in our case by year) using the functions from the gganimate
package.
gg <- ggplot(table1, aes(cumrank, group = TeamName, fill = as.factor(TeamName), color = as.factor(TeamName))) + geom_tile(aes(y = CumWins/2, height = CumWins, width = 0.7), color = NA, alpha = 0.8)
First we create a graphic object using the ggplot
function. In the argument aes
specify how the variables from the table will be displayed on the graph. We group them by TeamName, fill
and color
will be responsible for the color of the columns.
True columns to call it is not entirely true. With geom_tile
we "divide" the data on the graph into rectangles. Here is an example of this type of diagram:
You can see how the graph is "divided" into squares (they are obtained from rectangles using the coord_equal()
layer), three in each column. But thanks to the width
argument less than one, our tile takes the form of bars.
geom_text(aes(y = 0, label = paste(TeamName, " ")), vjust = 0.2, hjust = 1, size = 6) + geom_text(aes(y = CumWins, label = paste0(" ",round(CumWins))), hjust = 0, size = 7) + coord_flip(clip = "off", expand = FALSE) + scale_fill_manual(values = cols) + scale_color_manual(values = cols) + scale_y_continuous(labels = scales::comma) + scale_x_reverse() + guides(color = FALSE, fill = FALSE) +
Next, I add two signatures using geom_text
: team name and number of wins. coord_flip
swaps the axes, scale_fill_manual
and scale_color_manual
changes the color of the columns, scale_x_reverse
"expands" the scale_x_reverse
axis. Notice that we take that color from the previously created vector cols
.
The theme
layer contains parameters for customizing the display of the graph. Here it is indicated how the headers and signatures of the axles should be displayed (in any way, what element_blank
tells us in the right-hand side of the equation). We remove the legend, the background, the frame, the grid lines along the Y axis. plot.title
arguments plot.title
, plot.subtitle
, plot.caption
we set the display options for the title, subtitle and caption of the graph. For more details on the value of all parameters, see gglot2
theme(axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank(), legend.position="none", panel.background=element_blank(), panel.border=element_blank(), panel.grid.major=element_blank(), panel.grid.minor=element_blank(), panel.grid.major.x = element_line( size=.1, color="grey" ), panel.grid.minor.x = element_line( size=.1, color="grey" ), plot.title=element_text(size=25, hjust=0.5, face="bold", colour="black", vjust=-1), plot.subtitle = element_text(size = 15), plot.caption =element_text(size=15, hjust=0.5, color="black"), plot.background=element_blank(), plot.margin = margin(2,2, 2, 4, "cm"))
I will not dwell on the use of the transition_states
function, this part is identical to my earlier publication on Habré. As for labs
it creates a headline, subtitle, and a schedule signature. Using {closest_state}
allows you to display on the chart each specific year, the columns from which we are currently seeing.
anim <- gg + transition_states(year, transition_length = 4, state_length = 1) + view_follow(fixed_x = TRUE) + labs(title = "Cumulative Wins by teams in seasons", subtitle = "Season: {closest_state}", caption = "Telegram: @NBAatlantic, Twitter: @vshufiskiy\n Data sourse: stats.nba.com")
nba_cumulative_wins
function for creating graphs.Writing functions simplifies and speeds up the process of getting the result if you need to use the code more than once. Usually, a function in R has the following form:
_ <- function( ) { _ }
First of all, it is worth understanding what parameters you want to change with the help of a function, its arguments will depend on it. The first argument is the name of the data table that is being input. This allows you to rename it if such a desire arises, without changing anything in the function itself. I also want the chart to display any number of commands: from one (which is meaningless) to 30 (no more). I also want to be able to consider any temporary periods within the 15 years for which I have data. All this is implemented in this form of function:
nba_cumulative_wins <- function(table, elements, first_season, last_season){ ... }
where table
is the name of the table with input dataelements
- the names of the commands that should be displayed on the chartfirst_season
- the first season that will be displayed on the chartlast_season
- the last season to be displayed on the chart.
If the argument is very often used with some particular value, then you can set it by default. Then, if it is omitted among the function arguments, this value will be substituted. For example, if you register
nba_cumulative_wins <- function(table, elements, first_season, last_season = 2018)
then the graphics will be built up to the season 2018/19, unless otherwise indicated.
elements
, first_season
, last_season
With the elements
argument, we can specify the name of those commands that we want to see on the chart. This is very convenient when there are 2 or 3 such teams, but if we want to display the entire league, we will have to write elements = c()
and in brackets the names of all 30 teams.
So I decided to "split" the input values for the elements
argument into several groups.
The nba_cumulative_wins
function can build graphs for individual teams, divisions, conferences, or the NBA as a whole. For this, I used the following construction:
select_teams <- unique(table1$TeamName) select_div <- unique(table1$Division) select_conf <- unique(table1$Conference) select_nba <- "NBA" table1 <- if(elements %in% select_teams){ table1[TeamName %in% elements] } else if (elements %in% select_div){ table1[Division %in% elements] } else if(elements %in% select_conf){ table1[Conference %in% elements] } else if(elements == "NBA"){ table1 } else { NULL }
The select_
character vectors contain the names of all 30 teams, 6 divisions, 2 conferences and the NBA, and the unique
function leaves only one unique name, instead of 15 (by the number of years in the data).
Next, using the if...else
, the elements
argument entered is checked to one of the classes ( %in%
used to determine whether the element belongs to the vector), and the table with the data is modified accordingly. Now, if I want to see the results of teams playing in the Southwest Division instead of
elements = c("Mavericks", "Spurs", "Rockets", "Grillies", "Pelicans")
it is enough for me to enter
elements = "Southwest"
, which is much faster and more convenient.
Due to the possibility of choosing the seasons, work with dates also changes. At the very beginning the line is added:
table1 <- table1[SeasonID >= as.numeric(paste(2, first_season, sep = "")) & SeasonID <= as.numeric(paste(2, last_season, sep = ""))]
So I leave in the table only those rows that fall into our chosen time interval. The code for creating the year
column is also changed. Now it looks like this:
table1 <- table1[ ,year := rep(seq(first_season, last_season), each = length(unique(table1$TeamName)))]
In connection with the grouping of elements, the procedure for obtaining the desired colors is complicated. The fact is that in the table table_color
only the names of the teams. Therefore, we need to "expand" our cuts back. To do this, use the if...else
construct again.
elements1 <- if (elements == "NBA"){ c("Hawks", "Celtics", "Nets", "Hornets", "Bulls", "Cavaliers", "Mavericks", "Nuggets", "Pistons", "Warriors", "Rockets", "Pacers", "Clippers", "Lakers", "Grizzlies", "Heat", "Bucks", "Timberwolves", "Pelicans", "Knicks", "Thunder", "Magic", "76ers", "Suns", "Trail Blazers","Kings", "Spurs", "Raptors", "Jazz", "Wizards") } else if (elements == "West") { c("Mavericks","Nuggets", "Warriors", "Rockets", "Clippers", "Lakers", "Grizzlies","Timberwolves", "Pelicans", "Thunder", "Suns", "Trail Blazers","Kings", "Spurs", "Jazz") } else if (elements == "East") { c("Hawks", "Celtics", "Nets", "Hornets", "Bulls", "Cavaliers","Pistons", "Pacers", "Heat", "Bucks", "Knicks", "Magic", "76ers", "Raptors", "Wizards") } else if (elements == "Pacific") { c("Warriors", "Clippers", "Lakers", "Suns", "Kings") } else if (elements == "Southeast") { c("Magic", "Hornets", "Heat", "Hawks", "Wizards") } else if (elements == "Southwest") { c("Mavericks", "Grizzlies", "Pelicans", "Rockets", "Spurs") } else if (elements == "Central") { c("Bucks", "Pacers", "Pistons", "Bulls", "Cavaliers") } else if (elements == "Atlantic") { c("Knicks", "Nets", "Celtics", "Raptors", "76ers") } else if (elements == "Northwest") { c("Nuggets", "Trail Blazers", "Jazz", "Thunder", "Suns") } else { elements }
Next we create a table with the names of the commands that we need, we connect this table with table_color
using the inner_join
function from the dplyr
package. inner_join
includes only observations that are the same in both tables.
table_elements1 <- data.table(TeamName = elements1) table_color <- table_color[order(TeamName)] inner_table_color <- inner_join(table_color, table_elements1) cols <- inner_table_color[, "TEAM_color"]
The function changes the spelling of the title and subtitle. They make this look:
anim <- gg + transition_states(year, transition_length = 4, state_length = 1) + view_follow(fixed_x = TRUE) + labs(title = paste("Cumulative Wins by teams in seasons", first_season, "-", last_season, sep = " "), subtitle = paste(if (elements %in% select_div ){ paste(elements, "Division", sep = " ") } else if (elements %in% select_conf ){ paste("Conference", elements, sep = " ") }, "Season: {closest_state}", sep = " "), caption = "Telegram: @NBAatlantic, Twitter: @vshufiskiy\nData sourse: stats.nba.com")
Further, all this is visualized.
animate(anim, nframes = (last_season - first_season + 1) * (length(unique(table1$TeamName)) + 20), fps = 20, width = 1200, height = 1000, renderer = gifski_renderer(paste(elements[1], "cumwins.gif", sep = "_")))
I picked up the number in nframes
empirically so that the speed increases / decreases depending on the number of selected commands.
I hope my post was interesting. Project ID on Github .
If you are interested in the sports component of these visualizations, you can visit my blog on sports.ru "On both sides of the Atlantic"
Source: https://habr.com/ru/post/458904/
All Articles