My R code for web scraping NBA PBP data

rainmantrail · Post by **rainmantrail** » Mon Dec 07, 2020 11:27 pm

I thought I would share my R code for anyone that would like to use it for webscraping NBA PBP data from BBR. My database is built in stages. I will break these stages up into separate posts.

First, I build a data frame for each season I want to scrape which includes the game info for each game in that season. The resulting columns for this data frame are as follows:

Date,Time,Away,Score_Away,Home,Score_Home,GameID,OTs,Attend,Notes,Start_A1,Start_A2,Start_A3,Start_A4,Start_A5,Start_H1,Start_H2,Start_H3,Start_H4,Start_H5,Season

Here's my code for this part of my process:

Code: Select all

# Script to webscrape all NBA GameIDs from basketball-reference.com
# RainmanTrail
# 11/8/2020

library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
library(tidyverse)

# Define the calendar for which seasons we want to scrape from
years <- c(1997:2020)
months <- c("january","february","march","april","may","june","july",
            "august","september","october","november","december")

# Launch Selenium driver to allow scraping of text served by javascript
rD <- rsDriver(browser="firefox", port=4444L, verbose=F)
remDr <- rD$client

# Scrape GameIDs
for (i in years) {
  games <- data.frame()
  for (j in months) {
    # print(c(j,i))
    url = paste0("https://www.basketball-reference.com/leagues/NBA_",i,"_games-",j,".html")
    remDr$navigate(url)                        # Navigate to url
    raw_html <- remDr$getPageSource()[[1]]     # Get html
    page <- read_html(raw_html)

    # NOTE: This will crap out for Sept 2019 season games. Could add an exception.
    if (!is.na(str_match(raw_html, "div_schedule"))) {
      game_list <- page %>%
        html_node("#schedule") %>%
        html_nodes("[data-stat='box_score_text']") %>%
        html_nodes('a') %>% html_attr('href')
      game_list <- gsub("/boxscores/(.*).html", "\\1", game_list)

      game_tbl <- page %>%
        html_node("#schedule") %>%
        html_table()

      # Fix number of columns for years < 2001 since 'Time' column is missing
      if (i < 2001) { game_tbl$Time = ''; game_tbl <- game_tbl[, c(1,10,2:9)] }

      names(game_tbl) <- c("Date","Time","Away","Score_Away","Home",
                           "Score_Home","GameID","OTs","Attend","Notes")

      # Remove table headers
      game_tbl <- game_tbl[!game_tbl$Date %in% c("Date","Playoffs"), ]
      game_tbl$GameID <- game_list
    } else {
      print("No games this month")
    }

    # OPTIONAL: Scrape the starting lineups for each game and add them to the data frame
    k=1
    for (g in game_tbl$GameID) {
      url = paste0("https://www.basketball-reference.com/boxscores/",g,".html")
      remDr$navigate(url)

      raw_html <- remDr$getPageSource()[[1]]     # Get html
      page <- read_html(raw_html)

      box <- page %>% html_nodes("[class='sortable stats_table now_sortable']")
      start_away <- box[1] %>% html_nodes("[data-stat='player']") %>%
        html_nodes('a') %>% html_attr('href')
      start_away <- gsub("/players/[a-z]/(\\w+).html", "\\1", start_away[1:5])
      start_home <- box[3] %>% html_nodes("[data-stat='player']") %>%
        html_nodes('a') %>% html_attr('href')
      start_home <- gsub("/players/[a-z]/(\\w+).html", "\\1", start_home[1:5])

      game_tbl[k, c("Start_A1","Start_A2","Start_A3","Start_A4","Start_A5")] <- start_away
      game_tbl[k, c("Start_H1","Start_H2","Start_H3","Start_H4","Start_H5")] <- start_home

      k=k+1
    }
    games <- rbind.data.frame(games, game_tbl)
  }

  # Write the results to file
  write.csv(games, file = paste0("NBA_Game_Info_",i,".csv"), row.names = F)
  rm(games, game_tbl)
}

# Close remote driver and stop Selenium server
remDr$close()
rD$server$stop()

rainmantrail · Post by **rainmantrail** » Mon Dec 07, 2020 11:43 pm

Next, I use the 'GameIDs' from the files I created in step 1 above to scrape the PBP data for each game I want to scrape (could be a full season, or multiple seasons, or just playoffs, whatever you want). In my code below, I'm looping through this list of 'GameIDs' to manipulate the URL for each game I want.

This will output a separate CSV file for each game, which contains the PBP data as shown on BBR. However, instead of just scraping the text from their PBP logs, this code digs into the HTML to extract the 'PlayerIDs' instead of just the player names which are shown on their site. Using the unique 'PlayerIDs' is definitely better.

The resulting columns for this data frame are as follows:
Time, Away_Play, Away_Pts, Score, Home_Pts, Home_Play

Here is my code for this part of my process:

Code: Select all

####################################################################
# Script to webscrape all NBA play-by-play data from BBR
# Note these PBP logs begin in the 1996-97 season
# This script requires the GameIDs which are scraped from the
# 'NBA_GameIDs_Scraper.R' file
# RainmanTrail
# 11/8/2020
####################################################################

library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
library(tidyverse)

# This CSV file contains a list of all BBR GameIDs, which was also webscraped
# using the NBA_GameIDs_Scraper.R file. I just combined all seasons into one
# file and created a CSV file which contains only the list of 'GameIDs' I wish to scrape
setwd("~/Google Drive/DATA SCIENCE/NBA")
gm_ids <- read.csv(file = "./PBP/NBA_GameIDs_1997-2020.csv", header = T)

# Launch Selenium driver
rD <- rsDriver(browser="firefox", port=4444L, verbose=F)
remDr <- rD$client

# Scrape GameIDs
for (g in gm_ids$GameID) {
  url = paste0("https://www.basketball-reference.com/boxscores/pbp/", g, ".html")

  remDr$navigate(url)                        # Navigate to url

  raw_html <- remDr$getPageSource()[[1]]     # Get html

  # Fix for older seasons where 'Team' gets possession of the jump ball instead of a player
  raw_html <- gsub('<a href=\"/players/T/Team.html\"></a> gains possession',
                   '<a href=\"/players/t/Team.html\">Team</a> gains possession', raw_html)
  
  # Fix for older seasons where 'NULL' players are credited with an event
  raw_html <- gsub('<a href=\"/players/N/NULL.html\"></a>',
                   '<a href=\"/players/t/Team.html\">Team</a>', raw_html)
  
  # This is an html hack that allows us to scrape the PlayerIDs and CoachIDs for
  # each play instead of their names (which are not unique).
  takeout1 <- '<a href=\"/players/[a-z]/'    # Get PlayerIDs & CoachIDs
  takeout2 <- '<a href=\"/coaches/'
  takeout3 <- '.html\">(.*?)</a>'
  raw_html <- gsub(takeout1, "", raw_html)
  raw_html <- gsub(takeout2, "", raw_html)
  raw_html <- gsub(takeout3, "", raw_html)

  page <- read_html(raw_html)

  pbp.df <- page %>%
    html_node("#div_pbp") %>%
    html_node("table") %>%
    html_table(trim = TRUE, fill = TRUE)

  names(pbp.df) <- c("Time","Away_Play","Away_Pts",
                     "Score","Home_Pts","Home_Play")

  write.csv(pbp.df, file = paste0("./PBP/PBP_Logs/NBA_PBP_",g,".csv"), row.names = F, na='')
}

# Close remote driver and stop Selenium server
remDr$close()
rD$server$stop()

rainmantrail · Post by **rainmantrail** » Tue Dec 08, 2020 1:36 am

The code above should get you a separate CSV file containing the PBP data for each game in your script (depending on which 'GameIDs' you passed through it. The next step (and the far more labor intensive one) is to process these files into something useful. I do have code for this as well, but it's pretty long and includes a ton of manual fixes in it for various games with data errors in the PBP logs. I'm still working through these errors for some of my older seasons. I can share portions of the code corresponding to each part of the process that anyone is interested in. But it's probably too long to just post the entire thing here. Plus, it's still evolving as I work through the older seasons.

The key steps in processing these PBP logs are to extract the data we want using regular expressions. I extract as much information as I can, including who shot the ball, shot outcomes, who assisted, blocked, rebounded, stole, turned the ball over, fouled, was fouled, type of foul, ejections, literally anything and everything you can think of to extract from the text. As an example, here's a snippet from my code for determining which player shot the ball, and for which player assisted him, if an assist was made.

Code: Select all

# Shooting
is_sht <- str_detect(Both_Play, "-pt")
pbp.dat$Shooter <- ''
pbp.dat$Shooter[is_sht] <- sub("(\\w+).*", "\\1", Both_Play[is_sht])

# Assists
is_ast <- str_detect(Both_Play, "assist by")
pbp.dat$Assister <- ''
pbp.dat$Assister[is_ast] <- sub(".*assist by (\\w+).*", "\\1", Both_Play[is_ast])

If we're wanting to figure out which players are on the court for each play, we'll need to write a script that looks ahead using logic to determine which players are on the court, and backfill that info. Here is the function I wrote for handling this step.

Note that occasionally, usually in overtime periods, a player will be on the court for the entire period but never logs a single statistic, no shot attempts, rebounds, turnovers, or even fouls. This results in a missing player from the lineup, which I code in as 'Unknown'. Later, I watch film on NBA.com to figure out who the missing players are, then I fill those back in manually. This happens about a dozen or so times per season.

Code: Select all

# Determine starting players for each quarter by looking ahead in event logs
starters <- function(per, dat, gm_id, home, away, home_tm, away_tm) {
  dat <- filter(dat, Period == per)   # only analyze one period at a time
  on_away <- vector()                 # list to hold players on court for away
  on_home <- vector()                 # list to hold players on court for home
  dont_add <- vector()                # list to hold players that are subbed in
  j=1; k=1; d=1
  dont_add[1] <- 'Coach'; d=d+1
  dont_add[2] <- 'Team';  d=d+1
  for (i in 1:nrow(dat)) {
    if (dat$Sub_In[i] != '') { dont_add[d] <- dat$Sub_In[i]; d=d+1 }
    # Away starters
    if (dat$Event_Team[i] == away) {
      if (dat$Sub_Out[i] != '' & !(dat$Sub_Out[i] %in% dont_add)) {
        on_away[j] <- dat$Sub_Out[i]; j=j+1 }
      if (dat$Shooter[i] != '' & !(dat$Shooter[i] %in% dont_add)) {
        on_away[j] <- dat$Shooter[i]; j=j+1 }
      if (dat$FT_Shooter[i] != '' & !(dat$FT_Shooter[i] %in% dont_add)) {
        on_away[j] <- dat$FT_Shooter[i]; j=j+1 }
      if (dat$Rebounder[i] != '' & !(dat$Rebounder[i] %in% dont_add)) {
        on_away[j] <- dat$Rebounder[i]; j=j+1 }
      if (dat$Assister[i] != '' & !(dat$Assister[i] %in% dont_add)) {
        on_away[j] <- dat$Assister[i]; j=j+1 }
      if (dat$TO_Player[i] != '' & !(dat$TO_Player[i] %in% dont_add)) {
        on_away[j] <- dat$TO_Player[i]; j=j+1 }
      if (dat$Violation[i]=='double lane' & !(dat$Violater[i] %in% dont_add)) {
        on_away[j] <- dat$Violater[i]; j=j+1 }
      if (dat$Thief[i] != '' & !(dat$Thief[i] %in% dont_add)) {
        on_home[k] <- dat$Thief[i]; k=k+1 }
      if (dat$Blocker[i] != '' & !(dat$Blocker[i] %in% dont_add)) {
        on_home[k] <- dat$Blocker[i]; k=k+1 }
      if (dat$Foul_Type[i] %in% give_fouls) {
        if (dat$Fouler[i] != '' & !(dat$Fouler[i] %in% dont_add)) {
          on_home[k] <- dat$Fouler[i]; k=k+1 }
        if (dat$Fouled[i] != '' & !(dat$Fouled[i] %in% dont_add)) {
          on_away[j] <- dat$Fouled[i]; j=j+1 }
      } else if (dat$Foul_Type[i] %in% draw_fouls) {
        if (dat$Fouled[i] != '' & !(dat$Fouled[i] %in% dont_add)) {
          on_home[k] <- dat$Fouled[i]; k=k+1 }
        if (dat$Fouler[i] != '' & !(dat$Fouler[i] %in% dont_add)) {
          on_away[j] <- dat$Fouler[i]; j=j+1 }
      }
    # Home starters
    } else if (dat$Event_Team[i] == home) {
      if (dat$Sub_Out[i] != '' & !(dat$Sub_Out[i] %in% dont_add)) {
        on_home[k] <- dat$Sub_Out[i]; k=k+1 }
      if (dat$Shooter[i] != '' & !(dat$Shooter[i] %in% dont_add)) {
        on_home[k] <- dat$Shooter[i]; k=k+1 }
      if (dat$FT_Shooter[i] != '' & !(dat$FT_Shooter[i] %in% dont_add)) {
        on_home[k] <- dat$FT_Shooter[i]; k=k+1 }
      if (dat$Rebounder[i] != '' & !(dat$Rebounder[i] %in% dont_add)) {
        on_home[k] <- dat$Rebounder[i]; k=k+1 }
      if (dat$Assister[i] != '' & !(dat$Assister[i] %in% dont_add)) {
        on_home[k] <- dat$Assister[i]; k=k+1 }
      if (dat$TO_Player[i] != '' & !(dat$TO_Player[i] %in% dont_add)) {
        on_home[k] <- dat$TO_Player[i]; k=k+1 }
      if (dat$Violation[i]=='double lane' & !(dat$Violater[i] %in% dont_add)) {
        on_home[k] <- dat$Violater[i]; k=k+1 }
      if (dat$Thief[i] != '' & !(dat$Thief[i] %in% dont_add)) {
        on_away[j] <- dat$Thief[i]; j=j+1 }
      if (dat$Blocker[i] != '' & !(dat$Blocker[i] %in% dont_add)) {
        on_away[j] <- dat$Blocker[i]; j=j+1 }
      if (dat$Foul_Type[i] %in% give_fouls) {
        if (dat$Fouled[i] != '' & !(dat$Fouled[i] %in% dont_add)) {
          on_home[k] <- dat$Fouled[i]; k=k+1 }
        if (dat$Fouler[i] != '' & !(dat$Fouler[i] %in% dont_add)) {
          on_away[j] <- dat$Fouler[i]; j=j+1 }
      } else if (dat$Foul_Type[i] %in% draw_fouls) {
        if (dat$Fouler[i] != '' & !(dat$Fouler[i] %in% dont_add)) {
          on_home[k] <- dat$Fouler[i]; k=k+1 }
        if (dat$Fouled[i] != '' & !(dat$Fouled[i] %in% dont_add)) {
          on_away[j] <- dat$Fouled[i]; j=j+1 }
      }
      if (dat$Jumper_1[i] %in% home_tm & !(dat$Jumper_1[i] %in% dont_add)) {
        on_home[k] <- dat$Jumper_1[i]; k=k+1 }
      if (dat$Jumper_2[i] %in% home_tm & !(dat$Jumper_2[i] %in% dont_add)) {
        on_home[k] <- dat$Jumper_2[i]; k=k+1 }
      if (dat$Jump_Poss[i] %in% home_tm & !(dat$Jump_Poss[i] %in% dont_add)) {
        on_home[k] <- dat$Jump_Poss[i]; k=k+1 }
      if (dat$Jumper_1[i] %in% away_tm & !(dat$Jumper_1[i] %in% dont_add)) {
        on_away[j] <- dat$Jumper_1[i]; j=j+1 }
      if (dat$Jumper_2[i] %in% away_tm & !(dat$Jumper_2[i] %in% dont_add)) {
        on_away[j] <- dat$Jumper_2[i]; j=j+1 }
      if (dat$Jump_Poss[i] %in% away_tm & !(dat$Jump_Poss[i] %in% dont_add)) {
        on_away[j] <- dat$Jump_Poss[i]; j=j+1 }
    } else { print(paste0("Error in game ", gm_id)) }
  }

  # Occasionally (usually in OTs) a player will be on the court for the entire
  # period but never does anything at all, resulting in a missing player. I
  # fill these in with the 'Unknown' player ID. I keep track of these 'Unknown' players 
  # and watch film on NBA.com to figure out who the missing players are, then I enter 
  # those manually below.
  on_away <- unique(on_away[on_away != away & on_away != "Team"])
  on_home <- unique(on_home[on_home != home & on_home != "Team"])
  on_away <- c(on_away, rep('Unknown', 4))[1:5]
  on_home <- c(on_home, rep('Unknown', 4))[1:5]

  # Manually fix games with 'Unknown' players
  if (gm_id == '201811050NYK' & per == 'OT2') {
    on_away[5] <- 'holidju01'; on_home[5] <- 'vonleno01'
  } else if (gm_id == '201803260CHO' & per =='OT1') {
    on_away[5] <- 'leeco01'; on_home[5] <- 'bacondw01'
  } else if (gm_id == '201611010MIA' & per =='OT1') {
    on_away[5] <- 'koufoko01'; on_home[5] <- 'waitedi01'
  } else if (gm_id == '201611300OKC' & per =='OT1') {
    on_away[5] <- 'porteot01'; on_home[5] <- 'roberan03'
  } else if (gm_id == '201901030GSW' & per =='OT1') {
    on_away[5] <- 'tuckepj01'
  } else if (gm_id == '201902220OKC' & per =='OT1') {
    on_away[5] <- 'inglejo01'
  } else if (gm_id == '201903290MIN' & per =='OT1') {
    on_away[5] <- 'iguodan01'
  } else if (gm_id == '201904100LAC' & per =='OT1') {
    on_away[5] <- 'cavanty01'
  } else if (gm_id == '201901100SAS' & per =='OT1') {
    on_away[5] <- 'fergute01'
  } else if (gm_id == '201903200MEM' & per =='OT1') {
    on_home[5] <- 'cabocbr01'
  } else if (gm_id == '201811160BOS' & per =='OT1') {
    on_home[5] <- 'morrima03'
  } else if (gm_id == '201811230DEN' & per =='Q3') {
    on_home[5] <- 'hernaju01'
  } else if (gm_id == '201812070BRK' & per =='OT1') {
    on_home[5] <- 'harrijo01'
  }

  # Check to make sure only 5 players on the court
  if (length(on_away) > 5) {print(paste0("Too many starters in game ", gm_id))}

  # Sort playerIDs to make group_by easier later on
  on_away <- sort(on_away)
  on_home <- sort(on_home)

  return(c(on_away, on_home))
}

rainmantrail · Post by **rainmantrail** » Tue Dec 08, 2020 1:46 am

My favorite statistic that I've encountered through this project thus far, is that on November 23, 2018, Juan Hernangomez played the entire 3rd quarter without ever logging a single statistic. Which, I suppose, is in itself a statistic. Truly impressive.

nbacouchside · Post by **nbacouchside** » Mon Oct 23, 2023 6:12 pm

Hi there,

I shot you a private message on the off chance you are still reading the board without logging in (I saw that you were last active in 2021).

APBRmetrics

My R code for web scraping NBA PBP data

My R code for web scraping NBA PBP data

Re: My R code for web scraping NBA PBP data

Re: My R code for web scraping NBA PBP data

Re: My R code for web scraping NBA PBP data

Re: My R code for web scraping NBA PBP data