Currently, I’m working on re-building my database and models. In this series I will take you through the process. The aim is to be as transparent and precise as possible. I will share all my code and all my files for everyone to use. Hopefully, this can be helpful in getting more people into hockey and hockey analytics.
I like to think that one of my biggest strengths is my ability to explain complex matters in an easy-to-understand manner. The goal is to write a series which everyone can follow – Even without any pre-coding experience.
Getting data from the NHL Api
In this first article, we will learn how to extract data directly from the NHL Api. In order to call the NHL Api, you will need to know the Api endpoints, and Zmalski’s Github is a great resource for just that.
I’ve built two NHL “scrapers” – One in Power Query and one in Python. Neither are particular difficult to build from scratch.
The Power Query version
I know it’s uncommon to use Power Query for web scraping and Api calling, but I honestly think it’s a very underrated tool for these kinds of tasks.
The approach is to create a column with all the URL endpoints, then you create a function to call one of the endpoints and finally you invoke that function on the column. It’s quite simple once you get the hang of it.
I’ve made a few YouTube videos on my channel building different NHL scrapers.
Video 1: Getting NHL data based on single date selection:
In the first video we build a tool to get play-by-play and shift data from NHL Api based on a date selection. This is useful if the goal is to do daily updates.
Video 2: Getting NHL shift data from the HTML reports:
In the second video we scrape and clean shift data directly from the HTML reports. This is relevant if you want to get live shift data – The shift data in the NHL Api doesn’t update until after the game.
There’s also some missing shift data in the Api from the beginning of the 2024/2025 season, so unfortunately you can only get that data from HTML reports.
Video 3: Getting NHL data based on season selection:
In the third video we build a tool to get data for a full season at a time. This is relevant if you want to build a database with historic data.
That’s how you can build a scraper in Power Query. The tool build in video 3 is the same tool I used to build my NHL database. It’s also the same tool you can find on my website: https://hockey-statistics.com/scrapers/
The Python version
Before you can get started with Python you need to install it. I personally followed @AlexTheAnalyst’s video on this.
Once you have Jupyter Notebook installed and opened you can start programming. Here’s my code for the Python Scraper:
import requests as req
import pandas as pd
season = '20212022'
#GET ALL THE SKATERS FOR THE SELECTED SEASON
df = req.get(f'https://api.nhle.com/stats/rest/en/skater/bios?limit=-1&start=0&cayenneExp=seasonId={season}')
df = df.json()
Skaters = pd.json_normalize(df, "data")
#RENAME COLUMN 'skaterFullName' to 'Name AND SELECT/ORDER COLUMNS
Skaters = Skaters.rename(columns={"skaterFullName":"Name"})
Skaters = Skaters[['birthDate','draftOverall','draftYear','height','nationalityCode','playerId','shootsCatches','weight','positionCode','Name']]
#GET ALL THE GOALIES FOR THE SELECTED SEASON
df = req.get(f'https://api.nhle.com/stats/rest/en/goalie/bios?limit=-1&start=0&cayenneExp=seasonId={season}')
df = df.json()
Goalies = pd.json_normalize(df, "data")
#RENAME COLUMN 'goalieFullName' to 'Name, ADD COLUMN: 'positionCode' AND SELECT/ORDER COLUMNS
Goalies = Goalies.rename(columns={"goalieFullName":"Name"})
Goalies['positionCode'] = 'G'
Goalies = Goalies[['birthDate','draftOverall','draftYear','height','nationalityCode','playerId','shootsCatches','weight','positionCode','Name']]
#COMBINE THE TWO DATAFRAMES: 'Skaters' AND 'Goalies' - ADD THE THE COLUMN: 'Season'
Players = pd.concat([Skaters,Goalies])
Players['Season'] = season
#RE-OREDER THE COLUMNS TO MATCH THE SQL SCRIPT AND CONVERT THE DATATYPES TO AVOID FLOATS
Players = Players[['Season','birthDate','draftOverall','draftYear','height','nationalityCode','playerId','shootsCatches','weight','positionCode','Name']]
Players = Players.convert_dtypes()
#DROP DUPLICATE PLAYERS
Players = Players.drop_duplicates(subset=['playerId'])
#GET ALL THE TEAMS FOR ALL SEASONS AND CONVERT THE DATATYPES TO AVOID FLOATS
df = req.get(f'https://api.nhle.com/stats/rest/en/team')
df = df.json()
Teams = pd.json_normalize(df, "data")
Teams = Teams.convert_dtypes()
#CREATE A LIST FOR SCHEDULE
schedule = []
#LOOP THROUGH EACH TEAM TO GET THE SCHEDULE FOR EVERY TEAM
for triCode in Teams['triCode']:
url = f'https://api-web.nhle.com/v1/club-schedule-season/{triCode}/{season}'
response = req.get(url)
data = response.json()
data = pd.json_normalize(data, "games")
schedule.append(data)
schedule = pd.concat(schedule, ignore_index=True)
schedule = schedule.drop_duplicates(subset=['id'])
#SELECT/ORDER COLUMNS AND CONVERT THE DATATYPES TO AVOID FLOATS
schedule = schedule[['id','season','gameType','gameDate','startTimeUTC','gameState','awayTeam.abbrev','awayTeam.score',
'homeTeam.abbrev','homeTeam.score','gameOutcome.lastPeriodType']]
schedule = schedule.convert_dtypes()
#ONLY INCLUDE REGULAR AND PLAYOFF GAMES AND SORT THE DATA
no_preseason = schedule['gameType'].isin([2,3])
schedule = schedule[no_preseason]
schedule = schedule.sort_values(by=['id'])
#CREATE A LIST FOR PBP
pbp = []
#LOOP THROUGH EACH GAME TO GET THE PLAY-BY-PLAY FOR THE FULL SEASON
for Game in schedule['id']:
url = f'https://api-web.nhle.com/v1/gamecenter/{Game}/play-by-play'
response = req.get(url)
data = response.json()
data = pd.json_normalize(data, "plays")
data['GameID'] = Game
pbp.append(data)
pbp = pd.concat(pbp, ignore_index=True)
#ADD COLUMNS THAT MIGHT NOT EXISTS IN THE OLDER SEASONS
pbp['LINK_PBP'] = pbp.get('LINK_PBP','')
pbp['situationCode'] = pbp.get('situationCode','')
pbp['homeTeamDefendingSide'] = pbp.get('homeTeamDefendingSide','')
pbp['details.losingPlayerId'] = pbp.get('details.losingPlayerId','')
pbp['details.winningPlayerId'] = pbp.get('details.winningPlayerId','')
pbp['details.xCoord'] = pbp.get('details.xCoord','')
pbp['details.yCoord'] = pbp.get('details.yCoord','')
pbp['details.zoneCode'] = pbp.get('details.zoneCode','')
pbp['details.reason'] = pbp.get('details.reason','')
pbp['details.hittingPlayerId'] = pbp.get('details.hittingPlayerId','')
pbp['details.hitteePlayerId'] = pbp.get('details.hitteePlayerId','')
pbp['details.playerId'] = pbp.get('details.playerId','')
pbp['details.shotType'] = pbp.get('details.shotType','')
pbp['details.shootingPlayerId'] = pbp.get('details.shootingPlayerId','')
pbp['details.awaySOG'] = pbp.get('details.awaySOG','')
pbp['details.homeSOG'] = pbp.get('details.homeSOG','')
pbp['details.blockingPlayerId'] = pbp.get('details.blockingPlayerId','')
pbp['details.assist2PlayerId'] = pbp.get('details.assist2PlayerId','')
pbp['details.assist2PlayerTotal'] = pbp.get('details.assist2PlayerTotal','')
pbp['details.secondaryReason'] = pbp.get('details.secondaryReason','')
pbp['details.typeCode'] = pbp.get('details.typeCode','')
pbp['details.drawnByPlayerId'] = pbp.get('details.drawnByPlayerId','')
pbp['details.servedByPlayerId'] = pbp.get('details.servedByPlayerId','')
#SELECT/ORDER COLUMNS AND CONVERT THE DATATYPES TO AVOID FLOATS
pbp = pbp[['GameID','LINK_PBP','eventId','periodDescriptor.number','timeInPeriod','situationCode','homeTeamDefendingSide','typeCode',
'typeDescKey','sortOrder','details.eventOwnerTeamId','details.losingPlayerId','details.winningPlayerId','details.xCoord','details.yCoord',
'details.zoneCode','details.reason','details.hittingPlayerId','details.hitteePlayerId','details.playerId','details.shotType',
'details.shootingPlayerId','details.goalieInNetId','details.awaySOG','details.homeSOG','details.blockingPlayerId',
'details.scoringPlayerId','details.scoringPlayerTotal','details.assist1PlayerId','details.assist1PlayerTotal',
'details.assist2PlayerId','details.assist2PlayerTotal','details.awayScore','details.homeScore','details.secondaryReason',
'details.typeCode','details.descKey','details.duration','details.committedByPlayerId','details.drawnByPlayerId','details.servedByPlayerId']]
pbp = pbp.convert_dtypes()
#CREATE A LIST FOR shifts
shifts = []
#LOOP THROUGH EACH GAME TO GET THE shifts FOR THE FULL SEASON
for Game in schedule['id']:
url = f'https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId={Game}'
response = req.get(url)
data = response.json()
data = pd.json_normalize(data, "data")
shifts.append(data)
shifts = pd.concat(shifts, ignore_index=True)
#ADD COLUMNS THAT MIGHT NOT EXISTS IN THE OLDER SEASONS
shifts['gameId'] = shifts.get('gameId','')
shifts['endTime'] = shifts.get('endTime','')
shifts['period'] = shifts.get('period','')
shifts['playerId'] = shifts.get('playerId','')
shifts['shiftNumber'] = shifts.get('shiftNumber','')
shifts['startTime'] = shifts.get('startTime','')
shifts['teamAbbrev'] = shifts.get('teamAbbrev','')
#SELECT/ORDER COLUMNS AND CONVERT THE DATATYPES TO AVOID FLOATS
shifts = shifts[['gameId','endTime','period','playerId','shiftNumber','startTime','teamAbbrev']]
shifts = shifts.convert_dtypes()
#WRITE THE DATA TO CSV FILES
Players.to_csv(r'C:\Public\NHL\Players.csv',index=False, header=True)
Teams.to_csv(r'C:\Public\NHL\Teams.csv',index=False, header=True)
schedule.to_csv(r'C:\Public\NHL\Schedule.csv',index=False, header=True)
pbp.to_csv(r'C:\Public\NHL\pbp.csv',index=False, header=True)
shifts.to_csv(r'C:\Public\NHL\shift.csv',index=False, header=True)
Power Query vs. Python
For the Power Query tool to work you need to have access to Excel and the tool still won’t work on a Mac computer. On the other hand, Excel is much less scary than coding for most people so hopefully this illustrates that you don’t need coding to get started with hockey analytics.
Most analysts use Python, so it’s probably worth learning – especially when it comes to statistical modelling (we will get to that later).
As I said earlier, I think Power Query is a very underrated scraping tool, and the Power Query tool is a fair bit faster than running the Python script (even without writing the CSV files). It takes around 12 minutes for a full season in Power Query and around 20 minutes for a full season in Python on my computer. Not a gamechanger but still worth noting.
Next article – Cleaning and transforming the NHL data in MySQL
In the next chapter we will use MySQL to clean and transform the data scraped using the Power Query tool. I will be doing things slightly differently from the videos on my channel, but I will make sure to share all the SQL code.
Next Article: https://hockey-statistics.com/2025/05/15/hockey-analytics-cleaning-and-transforming-nhl-data-in-mysql/
Hockey Analytics Competition
I still plan to host a hockey analytics competition, but it likely won’t be until after the summer – Maybe in September to mark the start of the season.
The re-building of my database is in part to have data for an upcoming competition.
Contact
Please reach out if you have any comments/questions or if you want to share your own work – Maybe you have built an NHL scraper in R Studio. I will gladly post it on my platforms.
You can contact me on: hockeystatistics.com@gmail.com
4 thoughts on “Hockey Analytics – Getting data directly from the NHL Api”