Indications that shot location data is flawed – Depends on where games are being played

Abstract

The goal in this article is to determine whether xG data is impacted by where the games are being played. Do certain arenas impact the shot location data one way or the other?

All data for this article is 5v5 data from www.evolving-hockey.com

Henrik Lundqvist vs. Tuukka Rask

A big part of the inspiration for this article came from looking at Henrik Lundqvist’s GSAx numbers. Either Lundqvist is superhuman, or there’s something off about the expected goals numbers for the New York Rangers. Here’s a list of the top 10 goalies in terms of 5v5 GSAx from Evolving-Hockey:

PlayerGPFAxGAxGA/FASv%dFSv%GSAAGSAx
Henrik Lundqvist764239001465.80.061392.720.9292.26219.81
Jaroslav Halak50415439891.80.057892.560.4541.3168.82
Corey Crawford48614564855.60.058792.750.4257.9361.58
Braden Holtby46814614829.50.056892.600.3739.4654.48
Sergei Bobrovsky50715854900.90.056892.670.3354.5751.92
John Gibson2879305535.90.057692.570.5326.0848.93
Jonas Hiller40412179683.50.056192.540.4029.5748.46
Roberto Luongo626186231029.20.055392.800.2582.0647.18
Carey Price682210191166.10.055592.750.2291.4046.06
Cam Ward611192941139.10.059092.020.23-24.7044.09

If we instead look at the top 10 GSAA goalies, we see another picture. Lundqvist is still really good, but nowhere near superhuman.

PlayerGPFAxGAxGA/FASv%dFSv%GSAAGSAx
Tuukka Rask53616011830.90.051993.060.1797.2827.93
Pekka Rinne657199221056.80.053092.850.0795.8314.84
Henrik Lundqvist764239001465.80.061392.720.9292.26219.81
Carey Price682210191166.10.055592.750.2291.4046.06
Roberto Luongo626186231029.20.055392.800.2582.0647.18
Tim Thomas3179530510.30.053593.160.2073.3619.28
Tomas Vokoun31610062530.50.052793.030.1168.7411.48
Corey Crawford48614564855.60.058792.750.4257.9361.58
Sergei Bobrovsky50715854900.90.056892.670.3354.5751.92
Ryan Miller650202601113.70.055092.540.0751.8013.72

I’ve also looked at the shot quality faced, and I have defined Shot Quality (SQ) as expected goals per fenwick. This is a metric I will use numerous times during this article. If we look at the shot quality faced amongst goalies with at least 200 games played, we find Lundqvist at the very top. So, according to Evolving-Hockey’s xG-model, Lundqvist has faced the hardest shots – 6.1 percent of all unblocked shots against is expected go in.

Here’s the top 10 in SQ (GP>200):

PlayerGPFAxGAxGA/FASv%dFSv%GSAAGSAx
Henrik Lundqvist764239001465.80.061392.720.9292.26219.81
Ondrej Pavelec39812142729.20.060191.860.10-31.2512.24
Nikolai Khabibulin2126069363.40.059991.73-0.16-17.82-9.61
Cam Ward611192941139.10.059092.020.23-24.7044.09
Cam Talbot31410210601.30.058992.350.356.5835.25
Corey Crawford48614564855.60.058792.750.4257.9361.58
Evgeni Nabokov3449766572.10.058692.000.15-7.3715.09
Petr Mrazek2637661445.30.058192.260.171.5313.32
Martin Jones3279675562.30.058191.23-0.39-67.89-37.73
Chris Mason2336150356.20.057991.45-0.37-27.93-22.83

And here’s the other end of the spectrum – the bottom 10 in SQ (GP>200):

PlayerGPFAxGAxGA/FASv%dFSv%GSAAGSAx
Niklas Backstrom37110956534.20.048892.20-0.695.45-75.78
Devan Dubnyk52016285841.20.051792.42-0.2323.32-36.85
Andrei Vasilevskiy2608215425.50.051892.79-0.1538.49-12.49
Miikka Kiprusoff39011316587.20.051991.95-0.58-13.08-65.79
Tuukka Rask53616011830.90.051993.060.1797.2827.93
Darcy Kuemper2156806353.40.051992.41-0.1912.15-12.64
Peter Budaj2777721406.50.052691.78-0.50-21.97-38.53
Tomas Vokoun31610062530.50.052793.030.1168.7411.48
Pekka Rinne657199221056.80.053092.850.0795.8314.84
Robin Lehner3019817522.70.053292.49-0.2521.32-24.29

Some teams obviously play a higher risk game, and therefore allows higher danger chances, so we could just accept the data. However, if we look at the numbers Home vs. Away, we see some interesting trends.

The table below shows the shot quality faced being much higher at home than it is on the road for Henrik Lundqvist. This would suggest that NYR plays a much more high-risk game at home, but then we would expect a much lower save percentage as well. We don’t see that, and so the GSAx at home is much higher than the GSAx on the road. The numbers show a great Lundqvist on the road, and a completely unreal Lundqvist at home. Maybe the home ice data is inflated.

LundqvistGPFAxGAxGA/FASv%dFSv%GSAx
Away35711713663.80.056692.890.5564.83
Home40712187802.10.066892.561.27155.13
Total764239001466.00.062092.720.92219.96

If we do the same analysis for Tuukka Rask, we see a totally different picture. His numbers on the road is very comparable to Lundqvist’s road numbers, but at home he’s way, way worse – in fact he’s below average in terms GSAx at home.

RaskGPFAxGAxGA/FASv%dFSv%GSAx
Away2567839425.80.053792.870.3930.81
Home2808172405.20.050293.23-0.03-2.80
Total53616011831.00.051893.060.1728.01

These examples could indicate, that there’s something wrong with the xG data. That shot location tracking is different depending on the arena. However, this is just anecdotal evidence, so I will take a more general approach now.

Arena Effect

You would expect some correlation between Shot quality and goalscoring, so if the Shot quality is higher at certain arenas, you would expect the goalscoring to be higher as well. To test this, I’ve defined Shot result (SR) as goals per fenwick.

So, Shot quality is simply expected goals per fenwick and Shot result is actual goals per fenwick. How well does SQ and SR correlate? I’ve looked at 5v5 team data since the 2007/2008 season. The graph below shows how Shot quality correlates with Shot result for each team in every season:

I honestly expected a greater correlation, but obviously goaltending and shooting ability plays an integral part in goalscoring as well.

Let’s now look at the correlation at home versus on the road. I’m using overall Shot quality and overall Shot result, so it’s:

SQ = (xGF+xGA) / (FF+FA)

SR = (GF+GA) / (FF+FA)

This means that both teams shooting is accounted for. If there’s no Arena effect then we should see similar correlation Home and Away.

The correlation Away is much greater than it is at home. This indicates that there is an Arena effect on xG and therefore on the shot location data.

I’ve also looked at the correlation between Shot quality and Shot result, when I’m adding the data from all seasons. In other words, it’s how each team has performed from 2007 to 2020 – Atlanta, Winnipeg and Vegas are included even though their dataset is smaller.

Now, we see the Arena effect much clearer. When we use a sample size this large the noise from goaltending and shooting ability becomes much smaller.

How can we then define this Arena effect? I’ve simply defined Arena effect as the Shot quality at home minus the Shot quality away:

Arena effect = SQ(Home) – SQ(Away)

Or

Arena effect = (xGF(H)+xGA(H)) / (FF(H)+FA(H)) – (xGF(A)+xGA(A)) / (FF(A)+FA(A))

The thought process is that the tracking differences accumulate at home, but evens out on the road. The Away data can therefore be seen as a baseline.

With this definition of Arena effect, we can compare teams. Here are the top 10 teams in terms of Arena effect.

Top 10 – Arena Effect:

TeamSeasonSR HomeSR AwaySQ HomeSQ AwayArena Effect
NYR09/100.05630.05340.07630.05400.02228
NYR11/120.05400.05830.07390.05510.01879
NYR10/110.05970.05040.07590.05750.01846
NYR08/090.04770.05590.07170.05600.01563
NYI14/150.05810.05970.07190.05650.01546
NYI13/140.05730.05880.06950.05440.01504
NYI12/130.05980.06170.06950.05550.01392
NYR07/080.04970.05450.06600.05320.01289
WPG14/150.06030.04740.06600.05320.01275
NYR12/130.05460.04580.06660.05450.01214

This list is dominated by early NYR teams, giving a plausible explanation for Henrik Lundqvist’s superhuman GSAx stats.

Bottom 10 – Arena Effect:

TeamSeasonSR HomeSR AwaySQ HomeSQ AwayArena Effect
T.B12/130.06260.06680.04420.0600-0.01577
T.B14/150.05920.06480.04940.0602-0.01079
PIT09/100.06360.06240.05100.0608-0.00980
ARI12/130.05780.04530.04870.0585-0.00977
T.B13/140.05460.05110.04770.0574-0.00970
TOR12/130.06310.06220.04810.0578-0.00969
PIT12/130.05840.05990.04920.0588-0.00964
BUF16/170.04740.05200.04690.0562-0.00932
BUF07/080.07430.05710.04970.0590-0.00928
MIN07/080.05440.06150.04250.0517-0.00923

Not quite as dominated by one particular team, but we see T.B appear numerous times. Here is the overall result, if we look at all seasons combined

Arena Effect – All seasons:

TeamSR HomeSR AwaySQ HomeSQ AwayArena Effect
NYR0.05570.05450.06680.05610.01063
NYI0.05950.05580.06470.05630.00838
S.J0.05440.05610.05930.05470.00465
WPG0.05780.05730.05990.05530.00461
ATL0.06260.06060.06120.05680.00436
EDM0.05720.06040.06110.05690.00416
CAR0.05280.05710.06090.05690.00407
CHI0.06000.05840.05900.05690.00218
PHI0.05610.05820.05800.05600.00204
ANA0.05330.05520.05710.05550.00165
DAL0.05470.05820.05600.05520.00085
MTL0.05450.05530.05620.05550.00070
L.A0.05080.05310.05500.05450.00055
VAN0.05810.05420.05590.05560.00025
DET0.05700.05580.05520.05510.00013
CGY0.06070.05550.05530.05520.00010
OTT0.05760.05780.05540.0555-0.00013
N.J0.05830.05310.05560.0559-0.00030
WSH0.05670.05890.05610.0565-0.00035
STL0.05760.05510.05510.0555-0.00045
VGK0.05860.05910.05790.0584-0.00050
NSH0.05550.05640.05390.0546-0.00072
CBJ0.05780.05770.05470.0558-0.00109
PIT0.05950.05820.05450.0574-0.00286
BOS0.05360.05400.05140.0545-0.00309
COL0.05920.05530.05220.0558-0.00355
FLA0.05590.05670.05250.0561-0.00355
ARI0.05120.05280.05150.0553-0.00381
TOR0.05550.06040.05360.0577-0.00416
BUF0.05600.05500.04970.0564-0.00667
MIN0.05620.05570.04760.0546-0.00701
T.B0.06010.06010.05010.0579-0.00780

The New Yorker teams tops the list, whereas T.B, MIN and BUF have the lowest Arena effect.

If we look at the top and bottom teams from the first two tables, we won’t find any current teams. This could indicate, that the shot tracking has become more streamlined, and it’s all a problem of the past.

I’ve therefore looked at the average Arena effect (positive or negative) over time to see if the effect is trending downwards:

The effect is definitely smaller now compared to earlier, but it’s still pretty significant. And something crazy happened in the shortened 2012/2013 season. Obviously, the sample was smaller, but I don’t think that alone can explain such a spike in Arena effect.

Arena Effect and GSAx

Previously in the article I looked at specific goalies, and how Shot quality affected their numbers. Now I will take a more general approach, and look at how Arena effect correlates with a team’s home ice GSAx:

There’s a pretty good correlation, and if we look at data from all seasons combined instead of single-season data, the correlation is even greater:

Overall, there seems to be a really good correlation between home ice GSAx and Arena effect, so people should be very cautious when using GSAx as their preferred goalie metric.

Arena Effect and GAR

Clearly these Arena effects doesn’t just impact goaltending, so now I will turn my attention towards GAR – more specifically towards the even strength components of GAR (offense and defense).

Unfortunately, I can’t isolate GAR numbers in terms of home ice, so I will have to look at both home and away data. Here’s the correlation between GAR_EVO (offense) and Arena effect on the team level:

So, there’s no correlation between the two at the team level. Theoretically, Arena effects could still impact the GAR_EVO of certain type of players, but at the team level there appear to be no impact.

And, here’s the correlation between Arena effect and GAR_EVD (defense):

Now we see some inverse correlation, meaning a high Arena effect decreases the team GAR_EVD, whereas a low Arena effect increases the GAR_EVD. This isn’t particularly surprising, since GAR_EVD relies heavily on xGA.

I’ve done the same for the xGAR model with similar results, although not as clear:

Finally, I did the same analysis of my own model (sGAA), which you can read about here. The results were pretty much the same:

Discussion

The findings in this article raise quite a few questions. First of all, why do we see these differences in Arena Effect? The rink dimensions are exactly the same in every NHL Arena, so you really shouldn’t see such big differences in the shot location data. The only viable explanation I can come with lies in the tracking process. Every game is tracked manually with a specific team of trackers associated with each Arena. From a scientific standpoint, this isn’t a great way to track data, since even small tracking differences accumulate onto specific teams. Instead they should randomize the tracking teams, so that the differences would even out.

Another question raised by these findings, is how to use and interpret this newfound information. I’ve already shown that there’s a correlation between Arena effect and GSAx and there’s an inverse correlation between Arena effect and the defensive GAR components.

So, how does the Arena effect affect xGF%? A high Arena effect increases the xG totals, whereas a low Arena effect decreases the xG totals. The Arena effect should therefore primarily impact the extremities. A high Arena effect pushes players further away from the average (xGF% = 50), and a low Arena effect pulls players towards league average.

On a team like NYR with a high Arena effect this means that a player like Artemi Panarin has a xGF% that seems better than it really is, whereas a player like Kaapo Kakko seem worse than he really is.

One analytical approach to all of this, is to only look at road data. This way the tracking differences should even out. It’s easy to do on www.naturalstattrick.com, but unfortunately home/away is not a sorting criteria on www.Evolving-hockey.com.

Perspective

The positive in all of this, is that the Arena effects seem to be fairly consistent from year to year, so it’s possible to adjust for it. I’ve already tried factoring in Arena effects when looking at GSAx, and it does increase the repeatability of GSAx. Goaltending is still unpredictable, but at least this helps ever so slightly.

In the near future shot tracking will become automated and all of this will be obsolete, but until then adjustments to the current xG models are needed.

Besides from using data from www.evolving-hockey.com, I’ve also calculated the Arena effect using xG-models from www.naturalstattrick.com and www.moneypuck.com.

The findings from NaturalStatTrick were very similar, but MoneyPuck already accounts for the arena differences. However, the approach used on MoneyPuck is very different, so I still think some adjustment is needed. For goalie evaluations I would definitely recommend using MoneyPuck or only looking at road data.

All data in this article is from www.Evolving-hockey.com.

Also thanks to www.naturalstatrick.com and www.Moneypuck.com.

3 thoughts on “Indications that shot location data is flawed – Depends on where games are being played

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: