Extracting NBA data from ESPN

I've been wanting to play with some sports data for a while. Today I decide to stop procastinating and do it. The problems was that after searching a while (15 minutes) for some data I was unable to find the data I wanted. Even in the Basktetball Database (not really sure I undestand the site).

A friend showed me the ESPN stats and ask me if I knew how to scrap the data from a website. I lied and told him Yes. But I know python and its magic powers so after reading 15 minutes I knew how to do it.

I used requests and beautifulsoup to download and scrap the data from the ESPN site. Then used pandas to order, slice, and save the data into simple csv files. Also used iPython notebooks to develop the code faster. And a little bit of my copper project to use the data analysis project structure.

Get the teams

First I needed to get all the team names and URLs. So is as simple as request http://espn.go.com/nba/teams and 32 lines of python.

import copper
import pandas as pd
import requests
from bs4 import BeautifulSoup
copper.project.path = '../../'

url = 'http://espn.go.com/nba/teams'
r = requests.get(url)

soup = BeautifulSoup(r.text)
tables = soup.find_all('ul', class_='medium-logos')

teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
    lis = table.find_all('li')
    for li in lis:
        info = li.h5.a
        url = info['href']

dic = {'url': teams_urls, 'prefix_2': prefix_2, 'prefix_1': prefix_1}
teams = pd.DataFrame(dic, index=teams)
teams.index.name = 'team'
copper.save(teams, 'teams')

This saves teams.csv file with the 30 teams in this format.

Boston Celtics,bos,boston-celtics,http://espn.go.com/nba/team/_/name/bos/boston-celtics
Brooklyn Nets,bkn,brooklyn-nets,http://espn.go.com/nba/team/_/name/bkn/brooklyn-nets

Get games

Then I needed the games information. For this is necessary to read the previous csv file and for each team make a request and parse the data. On this case 60 lines of python produced a 1084 rows csv file with all the games of the current (2013) season. But is as simple as changing a variable to download other seasons information.

import copper
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
copper.project.path = '../../'

year = 2013
teams = copper.read_csv('teams.csv')
BASE_URL = 'http://espn.go.com/nba/team/schedule/_/name/{0}/year/{1}/{2}'

match_id = []
dates = []
home_team = []
home_team_score = []
visit_team = []
visit_team_score = []

for index, row in teams.iterrows():
    _team, url = row['team'], row['url']
    r = requests.get(BASE_URL.format(row['prefix_1'], year, row['prefix_2']))
    table = BeautifulSoup(r.text).table
    for row in table.find_all('tr')[1:]: # Remove header
        columns = row.find_all('td')
            _home = True if columns[1].li.text == 'vs' else False
            _other_team = columns[1].find_all('a')[1].text
            _score = columns[2].a.text.split(' ')[0].split('-')
            _won = True if columns[2].span.text == 'W' else False

            home_team.append(_team if _home else _other_team)
            visit_team.append(_team if not _home else _other_team)
            d = datetime.strptime(columns[0].text, '%a, %b %d')
            dates.append(date(year, d.month, d.day))

            if _home:
                if _won:
                if _won:
        except Exception as e:
            pass # Not all columns row are a match, is OK
            # print(e)

dic = {'id': match_id, 'date': dates, 'home_team': home_team, 'visit_team': visit_team,
        'home_team_score': home_team_score, 'visit_team_score': visit_team_score}

games = pd.DataFrame(dic).drop_duplicates(cols='id').set_index('id')
copper.save(games, 'games')

Get player stats

All the previous data is good but for doing some analysis I need players stats. Then for each game I download all players stats for that game. The result was 55 lines of python (imports included) to generate 27645 rows full with stats.

import copper
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
copper.project.path = '../..'

games = copper.read_csv('games.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))

table = BeautifulSoup(request.text).find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
    array = np.zeros((len(players), len(headers)+1), dtype=object)
    array[:] = np.nan
    for i, player in enumerate(players):
        cols = player.find_all('td')
        array[i, 0] = cols[0].text.split(',')[0]
        for j in range(1, len(headers) + 1):
            if not cols[1].text.startswith('DNP'):
                array[i, j] = cols[j].text

    frame = pd.DataFrame(columns=columns)
    for x in array:
        line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
        new = pd.DataFrame(line, columns=frame.columns)
        frame = frame.append(new)
    return frame

for index, row in games.iterrows():
    request = requests.get(BASE_URL.format(index))
    table = BeautifulSoup(request.text).find('table', class_='mod-data')
    heads = table.find_all('thead')
    bodies = table.find_all('tbody')

    team_1 = heads[0].th.text
    team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
    team_1_players = get_players(team_1_players, team_1)
    players = players.append(team_1_players)

    team_2 = heads[3].th.text
    team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
    team_2_players = get_players(team_2_players, team_2)
    players = players.append(team_2_players)

players = players.set_index('id')
copper.save(players, 'players')

The file looks like this

0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26
0,400277722,Miami Heat,Chris Bosh,37,8-15,0-1,3-4,2,8,10,1,0,3,1,3,+15,19
..... A LOT OF DATA .....


I love python more.

Still a lot of work is needed in order to make sense of all that data. But at least now I have some data.

Next step is probably to insert the data into a postgres database. Or just be crazy and do some machine learning as it is.

The data and code is on github nba.