Game projection model – In-season Model (Part II)

In this article I will start building my game projection model. I’d recommend reading the two first articles of the series before you continue reading this post:

The model construction

The idea is to build a game projection model that actually consists of two separate models:

  1. A pre-season model
  2. An in-season model

The goal of the pre-season model is to give every player a starting value when the season starts. The goal of the in-season model is to adjust the player values solely based on in-season information.

In the previous articles, I let every player start at the exact same value and then considered how different in-season information would affect the game predictions. In other words, I laid the groundwork for building the in-season model.

For me it’s very important that the in-season model is easily updated. It must be something I can update daily within 5-10 minutes. The preseason model only needs to be updated once every year, so it can theoretically be much more complex.

In this article I will focus on building the in-season model.

Building the in-season model

In the first article, I already looked at the potential variables to include in my in-season model. The next step is to try and combine these variables in the smartest possible way.

There are three parameters I can modify when building the in-season model:

  1. Which variables to include.
  2. How many games to gather information from.
  3. How to weigh recent games vs. older games.

In the previous articles I always used information from the previous 50 in-season games and weighed all the games equally. Theoretically you could use a different number of games and/or weigh games based on recency.

Metric combinations

To a start, I’m just using the data collected in the previous articles. From this data I want to test three different metric combinations:

  1. EV xG+/-, EV G+/-, PP G+/-, SH G+/-, Individual Points, GSAx
  2. EV xG+/-, PP G+/-, SH G+/-, GAx, GSAx
  3. EV xGA, EV GF, PP G+/-, SH G+/-, GSAx

Combination 1 is somewhat similar to Dom Luszczyszyn’s Game Score model, whereas combinations 2 and 3 are similar (although simpler) to Evolving-Hockey’s xGAR and GAR models respectively.

I combined the metrics and refined the data to find the weight of each metric. The results can be found in this table:

ModelLog loss
Combination 10.6748
Combination 20.6752
Combination 30.6759

This is the visualization we get, if we look at the log loss as a function of game number. I’ve only included the trendlines to make the graph easier to interpret:

We see that combination 1 is slightly better than combinations 2 and 3. We also see that the log loss of the 3 model combinations is close to the market log loss at the tail end of the seasons.

This is what we get if we zoom in on the last part of the seasons – You could say we give the models 70 games to learn:

Now the performance of the models is comparable to the market. We also see that the log loss has stabilized, so you could say the models have stopped learning at this point. This isn’t surprising since most players will have played more than 50 games at this point.

Conclusion:

If you combine different variables, you can build a relatively simple in-season model that can compete with the market. The model just needs to be fed enough information.

Combination 1 showed the best results, so this is the combination I will build on. I think you could create a good model based on any of the combinations, but the plan all along was to build a Game Score like metric. So, I’m pleased that combination 1 showed the most promising results.

Number of games to include?

Next step is to try and change the number of games to include in the predictions. So far, I’ve based a player’s value on his performance in the previous 50 in-season games. Changing the number of games to include requires a fair bit of calculations though. For that reason, I’m just changing the number of games to include all in-season games and the last 25 in-season games respectively.

Here’s how changing the number of games affect each variable:

VariableLog loss_AllLog loss_50Log loss_25
EV xG+/-0.68030.67940.6793
EV G+/-0.67900.67900.6808
PP G+/-0.68410.68400.6860
SH G+/-0.68620.68610.6872
GAx0.68570.68540.6857
GSAx0.68620.68630.6869
iPoints0.67930.67940.6809

Decreasing the number of games to 25 generally makes the predictions worse. Increasing the number of games mostly has no or limited effect, but it does decrease the predictive power of EV xG+/- quite significantly.

You could of course try to optimize the number of games for each variable, but I think I prefer the simplicity and interpretability of keeping the number of games at 50 for all the variables.

Increasing the weight on recent games

The last thing we can do, is to increase the weight put on recent games. Theoretically, you would like to put more weight on the most recent games. The problem is that it’s difficult to do from a calculation standpoint. The dataset consists of 300K+ rows, and it would require a lot of time and/or computer power to differentiate the weight based on recency (at least with my relatively limited coding/math skills).

I know how to do it, but I’m not sure it’s worth the time and effort required. In the end, you could probably build a slightly better prediction model, but I doubt the difference is worth the effort. I might look in to this later on, but that feels more like an offseason project.

Summary

The in-season model is player-based meaning every player has a value and that value changes based on his performance in the 50 previous in-season games. In building the model I let every player start each season at exact same value (value of 0). This way I can isolate the effect of the in-season model.

I’m going with combination 1, so the model depends on the following variables:

  • On-ice EV xG+/-
  • On-ice EV G+/-
  • On-ice PP G+/- above average
  • On-ice SH G+/- above average
  • GSAx
  • Individual points above average (depending on position and role)

Here’s the prediction performance compared to the market (closing line):

In the beginning of each season the market is obviously doing significantly better than the in-season model – Expecting every player to start the season at the exact same level is not the best approach!

This leads us to the next step: building the pre-season model, so that we can determine a starting value for every player.

From prediction to description

The main goal of the in-season model is to predict the very next game, but what if we used the same model to describe performance.

In other words, what if we calculate a score based on the variables mentioned above and the weight put on each variable will be the same as I used in the in-season model.

Here’s the top 20 seasonal performances (from 14/15 to 19/20) if we do that:

PlayerSeasonScore
CONNOR.MCDAVID201620170.04281
JOE.THORNTON201520160.04239
BRAYDEN.POINT201820190.03967
VICTOR.HEDMAN201820190.03845
CLAUDE.GIROUX201720180.03796
RYAN.SUTER201620170.03767
PATRICE.BERGERON201620170.03713
BRAD.MARCHAND201620170.03694
MARK.GIORDANO201820190.03666
JOE.PAVELSKI201520160.03639
MAX.PACIORETTY201920200.03564
JOHN.KLINGBERG201720180.03406
BRAD.MARCHAND201820190.03404
BRAD.MARCHAND201920200.03362
KEVIN.SHATTENKIRK201420150.03330
MARK.STONE201920200.03320
ALEX.OVECHKIN201520160.03268
RYAN.O’REILLY201820190.03241
JUSTIN.SCHULTZ201620170.03241
PATRICE.BERGERON201820190.03241

Perhaps not the perfect top 20 list, but also not the worst list you could come up with. The model isn’t built to be a player evaluation tool, but it does seem to pass the smell test.

These are top 20 best players overall in the entire timeframe:

PlayerScore
PATRICE.BERGERON0.1610
NIKITA.KUCHEROV0.1545
BRAD.MARCHAND0.1516
SIDNEY.CROSBY0.1496
VICTOR.HEDMAN0.1472
PATRIC.HORNQVIST0.1344
JOE.PAVELSKI0.1338
KRIS.LETANG0.1256
RYAN.SUTER0.1253
EVGENI.MALKIN0.1199
TOREY.KRUG0.1183
BRENT.BURNS0.1124
DAVID.PASTRNAK0.1079
JOHN.KLINGBERG0.1049
VLADIMIR.TARASENKO0.1039
JARED.SPURGEON0.1028
MARK.STONE0.1020
STEVEN.STAMKOS0.0971
BRAYDEN.POINT0.0953
CONNOR.MCDAVID0.0943

Here’s the top 20 players per 60 minutes (TOI>2000 minutes):

PlayerScore/60
PATRICE.BERGERON0.00126
PATRIC.HORNQVIST0.00124
BRAD.MARCHAND0.00113
NIKITA.KUCHEROV0.00113
BRAYDEN.POINT0.00108
SIDNEY.CROSBY0.00107
ANDREI.SVECHNIKOV0.00102
EVGENI.MALKIN0.00101
DAVID.PASTRNAK0.00100
PAVEL.DATSYUK0.00097
JOE.PAVELSKI0.00094
ANTHONY.CIRELLI0.00092
VICTOR.HEDMAN0.00090
VLADIMIR.TARASENKO0.00087
JAKE.DEBRUSK0.00083
STEVEN.STAMKOS0.00082
TOREY.KRUG0.00080
KRIS.LETANG0.00079
AUSTON.MATTHEWS0.00079
CONNOR.MCDAVID0.00079

Clearly, some players are underrated (Matthews, McDavid, MacKinnon) while others are overrated (Hornqvist)… But the model is relatively simple, so this is to be expected. Even in much more complex player evaluation models (like GAR and xGAR), you’ll find plenty of player rankings you disagree on.

If we sum the player scores for each team and compare it to the team goal differential, we get this correlation:

Perspective

Like I mentioned earlier the next step will be to build the pre-season model and then combine the two models.

My approach to building the pre-season model will be similar to the way I build the in-season model. I will isolate the performance of the pre-season model by not changing the player values as the season progresses.

So, when I built the in-season model I let every player start at the same value and then changed the value based on in-season performance. When I’m building the pre-season model I will find a starting value for each player, and then not change this value throughout the season. This way the two models should be relatively independent when I’m combining them in the end.

Data from http://www.Evolving-Hockey.com

One thought on “Game projection model – In-season Model (Part II)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: