Predicting the Premier League top scorers

Fredrik Olsson

Fredrik Olsson

Data Scientist

Feb 23
2024
Abstract color background 1

Introduction

With the current Premier League season 2023/2024 moving rapidly towards the final phase of the season, we are going to take a small step back in time and focus on the last three seasons. While there is of course more to the sport than just scoring goals, it is arguably the most enticing part of the game. So, in this blogpost, we will outline an approach to try and predict the top scorers in the Premier League for the season 2022/2023 using Machine Learning techniques and player data from previous seasons.

Problem

More specifically, we want to build a model that - given a set of information on the player - can predict the number of goals that player will score during the Premier League season. With such a model, we can make predictions on all the Premier League players and thus get a list of predicted top scorers and their respective goal counts.

While this is a regression problem, it doesn't quite fit the standard Linear Regression problem, and the answer lies with the target variable: number of goals. In the standard Linear Regression, the target variable is assumed to follow a normal distribution, i.e. YiN(μi,σ2)Y_i \sim N(\mu_i, \sigma^2), and we model the expected value as a linear function of our features:

μi=β0+β1X1+...+βpXp=Xiβ\mu_i = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p = X_i \beta

This however assumes that the target can take on any real number, which in our case is not true. The number of goals scored is a count data variable, i.e. positive integers only. Therefore, we instead assume that the number of goals follows a Poisson distribution, i.e. YiPoi(μi)Y_i \sim Poi(\mu_i), and we model the logarithm of the expected value as a linear function of our features:

ln(μi)=β0+β1X1+...+βpXp=Xiβ\ln(\mu_i) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p = X_i \beta

This is what is called a Poisson Regression problem.

Datasets

Alright, so let's dive into the datasets that we are going to use, in order to solve this poisson regression problem that we have. We have three different sets, one from 2022/2023 and two from the last two Premier League seasons before that. All data comes from transfermarkt.com. For the previous two seasons we have the following features - as well as the target variable (number of goals scored) - available to us in the dataset:

We will use the one from the season 2020/2021 as our training data, and the one from the season 2021/2022 as the validation set, when training and evaluating a model that we can use to make predictions on the dataset for the current season 2022/2023. The observant reader will notice that there are some features in the train and validation sets that we don't have available before the season starts (when we want to make our predictions for the 2022/2023 season) namely

  • games

  • subs_on

  • subs_off

  • assists

We actually don't have placement either, but here we will use the value from the previous season. Here it's more about giving the model an understanding of the level of the team the player plays for, rather than the exact position the team finishes in. But when it comes to the number of games (and substitutions on and off) and the assists, we will have to deal with not having those features available when we make the predictions for the 2022/2023 season.

Note that goalkeepers and defenders have been excluded from the datasets, so we are only working with midfielders and attackers. All players not scoring any goals at all during the previous seasons, have also been removed from the (training and validation) datasets.

Modelling - basic features

Let's start to build some models! We'll start simple and use only the relevant features that are available to us in all datasets:

  • position

  • age

  • market_value

  • placement

Model types

Since we have simple tabular data, we will make use of the tree-based models:

  • Random Forest Regressor

  • XGBoost Regressor

that often performs best on this type of data. Note that we use Poisson Deviance as the split criterion for the Random Forest Regressor, and Poisson objective function in the XGBoost Regressor, to handle this Poisson Regression problem properly.

Eval metrics

We will also need an evaluation metric to find the best model, both in terms of model types and hyper-parameters. A natural choice in the case of Poisson Regression is to use Mean Poisson Deviance:

D(y,y^)=1ni=0n12(yiln(yi/y^i)+y^iyi)D(y, \hat{y}) = \frac{1}{n} \sum_{i=0}^{n-1} 2 \left( y_i \ln(y_i / \hat{y}_i) + \hat{y}_i - y_i \right)

As with many loss functions the evaluation value we get is quite hard to interpret, more than that lower is better. We will here instead use a metric that has a natural interpretation in our case and captures what we want our model to achieve, namely Mean Absolute Error:

MAE(y,y^)=1ni=0n1yiy^iMAE(y, \hat{y}) = \frac{1}{n} \sum_{i=0}^{n-1} \| y_i - \hat{y}_i \|

The MAE tells us by how many goals our predictions on average differ from the actual number of goals scored by the players. This aligns well with wanting a model that can predict the number of goals a player will score, and the value is easy to interpret.

Best model

After training and hyper-parameter tuning the models on the training (2020/2021) and validation (2021/2022) datasets, the best model we got (according to validation MAE) was a XGBoost Regressor model:

Features

Model type

Training MAE

Validation MAE

Basic

XGBoost

1.4265

2.1397

So, on average, this model's goal prediction on a player differ with a little more than 2 goals, compared to the actual number of goals scored.

Modelling - fully extended features

Okay, so we have our first model. Now, let's try to improve the results! We are going to ignore the fact that we have some features that are not available to us in the dataset for the season 2022/2023, and let the model use them anyway. In other words, let's include:

  • games

  • subs_on

  • subs_off

  • assists

in the model as well. Hopefully that will improve the model performance, and we will deal with the issues of these features not being available in all datasets later on.

Best model

The inclusion of very relevant extra information such as number of games played etc, helped the models reach a better performance. Again the XGBoost model (with a different set of hyper-parameters) was the best performer:

Features

Model type

Training MAE

Validation MAE

Basic

XGBoost

1.4265

2.1397

Fully extended

XGBoost

1.1163

1.7314

By adding these extra features to the model, we managed to get our predictions ~0.4 goals closer to the actual value on average.

Feature importances

From the results, we would obviously like to include these extra features, but the problem unfortunately still stands that they are missing from the 2022/2023 season dataset that we want to make predictions on, since we want to make predictions at the start of the season before this data is known to us. There are solutions for that, which we will look at later, but it's definitely easier to deal with one missing feature than to deal with four of them at once. Therefore, we will look at feature importances for the model, and see if we can find one of these four that are more important than the others, and try to use just that one.

We will use the feature importance technique called Permutation Feature Importance. The idea is that - for one feature at the time - we randomly shuffle the values for that feature between the rows in the dataset, and thus break the relationship between that feature and the target variable. We then compare the difference in model score (in this case validation MAE) on the original data and the new distorted data with the shuffled feature. This difference gives us an indication of how much the model depends on that feature. Due to the random nature of the procedure, we do it several times for every feature and end up with the following box-plot for the feature importances (on both the train and validation datasets):

We can clearly see that the games feature, together with position and market_value, proves to be the most important features for the model. The `subs_on` feature also seems to be pretty useful for the model, but let's stick with only one missing feature.

Modelling - extended features

Before we try to deal with the problem that the games feature is missing in the dataset for the 2022/2023 season, let's train and validate the model using the first set of features (see "Basic features") together with only the games feature and not the other ones added for the previous model, and see how well it performs in the setting where that feature is available to us.

Best model

The XGBoost model still performs best among the chosen model types, and comparing the validation MAE between this one ("Extended features") and the previous model ("Fully extended features"), we are not that far off from the performance we got when also including the other extra features:

Features

Model type

Training MAE

Validation MAE

Basic

XGBoost

1.4265

2.1397

Fully extended

XGBoost

1.1163

1.7314

Extended

XGBoost

1.2427

1.7747

Data imputation

At last, we are going to deal with the games feature being missing from the last season's dataset which is the one we want to make the predictions on. So to simulate this setting, we will remove the values for this feature in the validation dataset, and then try to impute reasonable values.

The method we will use for this is called kk-Nearest Neighbours ($k$-NN) Imputation. Here we will make use of the fact that the games feature is available in the training data. The idea is that, for every data point in the validation dataset, find the kk (5 in our case) most similar data points in the training dataset based on the features we do have available, and then impute the average of those data points' games values. So, we are basically modelling the missing feature using the available ones.

Best model with data imputation

So for our final model evaluation we will evaluate the models using the same features as in the previous case ("Extended features") but now with our games feature being imputed in the validation set - which is the setting we will have in our prediction for the 2022/2023 season's top scorers. Also in this final case, the XGBoost model came out on top as the best option, with the following evaluation scores:

Features

Model type

Training MAE

Validation MAE

Basic

XGBoost

1.4265

2.1397

Fully extended

XGBoost

1.1163

1.7314

Extended

XGBoost

1.2427

1.7747

Extended - with data imputation

XGBoost

1.2796

2.0896

As we can see, we unfortunately get a performance decrease compared to our previous model where the games feature was available in the validation dataset. When we looked at feature importances before, this proved to be a very important feature for the model, and the data imputation didn't quite seem to fully replace the quality of the true actual data for the games feature.

But this model is however an improvement on the first one we made, and since these two are the only ones we can actually use to make predictions on the dataset we want to predict, this is our best model to solve the problem at hand!

Top scorers 2022/2023

Using our best model option, the "Extended features - with data imputation" variant of the XGBoost, we get the following predicted top 15 goal scorers for the 2022/2023 season in Premier League:

Name

Position

Team

Goal prediction

Harry Kane

Forward

Tottenham

23

Mohamed Salah

Winger

Liverpool

22

Kevin De Bruyne

Attacking Midfield

Manchester City

19

Heung-min Son

Winger

Tottenham

17

Erling Haaland

Forward

Manchester City

17

Jarrod Bowen

Winger

West Ham

12

Bernardo Silva

Attacking Midfield

Manchester City

12

Bruno Fernandes

Attacking Midfield

Manchester United

11

Cristiano Ronaldo

Forward

Manchester United

11

Jack Grealish

Winger

Manchester City

10

James Maddison

Attacking Midfield

Leicester City

10

Gabriel Jesus

Forward

Arsenal

10

Richarlison

Forward

Tottenham

9

Raheem Sterling

Winger

Chelsea

9

Dominic Calvert-Lewin

Forward

Everton

9

Below follows the actual top 15 goal scorers:

Name

Position

Team

Actual goals

Erling Haaland

Forward

Manchester City

36

Harry Kane

Forward

Tottenham

30

Ivan Toney

Forward

Brentford

20

Mohammed Salah

Winger

Liverpool

19

Callum Wilson

Forward

Newcastle

18

Marcus Rashford

Winger

Manchester United

17

Martin Ødegaard

Attacking Midfield

Arsenal

15

Ollie Watkins

Forward

Aston Villa

15

Gabriel Martinelli

Forward

Arsenal

15

Bukayo Saka

Winger

Arsenal

14

Alexander Mitrovic

Forward

Fulham

14

Harvey Barnes

Winger

Leicester City

13

Rodrigo

Forward

Leeds United

13

Gabriel Jesus

Forward

Arsenal

11

Miguel Almirón

Winger

Newcastle

11

Conclusion

Comparing the predicted and actual list of top 15 goal scorers in Premier League season 2022/2023, this clearly proved to be a very tricky problem. Only four players show up in both the actual and predicted table: Harry Kane, Mohammed Salah, Erling Haaland and Gabriel Jesus.

I believe that the main thing making this problem really hard, is that there are some very important factors involved that we unfortunately do not have any control over in our feature set:

  • Team form: teams performing exceptionally well compared to previous seasons like Arsenal playing on the highest level they have been on for many years, with players like Martin Ødegaard, Gabriel Martinelli and Bukayo Saka making it into the list of top scorers.

  • Player form: either players finding their goal scoring form compared to previous seasons, like Marcus Rashford, or players loosing theirs like Jarred Bowen.

  • New players like Erling Haaland on whom we do not have any historical data.

  • Injuries and January transfers causing players (aside from player form) to play fewer games than expected. For example Cristiano Ronaldo who left the Premier League in the January transfer window.

Apart from the lack of features, we are also working with very small datasets with high variance, which is not an optimal setting for a successful machine learning solution.

To mitigate some of these issues, and to try and improve the model performance in general, a few ideas to make the model better could be:

  • Increase the amount of data:

    • Include more historical data from previous Premier League seasons. Would probably have to deal with the time aspect in the modelling, i.e. that the goal scoring behaviour tend to change over time.

    • Include data from other leagues. Other leagues can of course vary compared to the Premier League in terms of goal scoring so that would have to be taken care of in the modelling phase.

  • Explore the possibilities of finding additional useful player data that can be added to the feature set.

  • While contradicting the efforts of increasing the amount of data, we could set a higher limit than at least 1 goal to be included in the training and validation data. The idea is to model only players that have a chance of making the list of top scorers and not diluting the dataset with players that only score a few goals.

  • We could try a more robust modelling of the missing games feature than just using simple $k$-NN imputation.

  • We could try to do some more feature engineering on the data we have, in order to maybe find some useful combination of different features that the model itself does not find during the model training.

All in all, a very fun project with plenty of room for improvements!

Published on February 23rd 2024

Last updated on February 23rd 2024, 15:34

Fredrik Olsson

Fredrik Olsson

Data Scientist