Predicting Football with Poisson Models

Football is the most unpredictable of the major sports. A team can dominate possession, create twice as many chances, hit the woodwork three times—and still lose 1–0. With an average of roughly 2.7 goals per Premier League match, every single goal carries enormous weight. One lucky deflection can separate a title winner from a runner-up.

That unpredictability is what makes the sport compelling. It is also what makes modelling it so difficult—and so interesting. This project set out to answer a deceptively simple question: given only the identity of the two teams and who is playing at home, how well can we predict the outcome of every match in a Premier League season?

380 Matches per Season

1.53 Avg Home Goals

1.14 Avg Away Goals

Why Poisson?

Goals in football are rare, discrete events. A team scores 0, 1, 2, maybe 3 goals in a game—almost never more. This makes the Poisson distribution a natural candidate. It describes the probability of a given number of events occurring in a fixed interval when those events happen independently at a constant average rate. If a team scores at a rate of 1.75 goals per game, the Poisson distribution tells us the probability of them scoring 0, 1, 2, 3, or more goals in any specific match.

A key property of the Poisson distribution is equidispersion: the mean equals the variance. Real Premier League data from 2017–18 shows only minor departures from this property. When plotting the mean number of goals against the variance for each team, the points cluster tightly around the diagonal line—a strong signal that the Poisson assumption holds.

If the mean roughly equals the variance, the Poisson distribution is doing its job. And in the Premier League, it does.

Of course, a raw Poisson model that treats every team the same would be useless. Manchester City and Huddersfield do not score at the same rate. The trick is to let each team have its own scoring rate, determined by its offensive strength and its opponent's defensive weakness. That is where regression comes in.

Three Models, One Idea

The project tested three progressively refined models on the full 2017–18 Premier League season. All share the same core concept: model each team's expected goals as a Poisson variable whose rate depends on team-specific attack and defence parameters plus a home advantage term.

Double Poisson

The baseline. Each team's goals follow an independent Poisson distribution. The expected number of goals for the home team in match i is driven by the home team's attacking strength, the away team's defensive weakness, a league-wide intercept, and a home advantage bonus. The away team's expected goals are modelled symmetrically, minus the home effect.

Double Poisson Model log(λ_home) = μ + home + att_home + def_away
log(λ_away) = μ + att_away + def_home

Each team has an attack parameter and a defence parameter. The home effect γ gives a blanket bonus to whoever is playing at their own ground.

The limitation is the independence assumption: in reality, a team conceding a goal often pushes harder to equalise, introducing a subtle correlation between the two scorelines.

Bivariate Poisson

This model extends the Double Poisson by allowing positive correlation between the two teams' goals through a shared component λ₃. If λ₃ = 0, it collapses back to the Double Poisson. In the 2017–18 data, however, the observed correlation was slightly negative (−0.13). Since the Bivariate Poisson only permits positive correlation, the model estimated λ₃ as zero—making it identical to the Double Poisson in practice.

Dixon & Coles

The most refined of the three. Dixon and Coles (1997) observed that standard Poisson models systematically mispredict low-scoring outcomes: 0–0 and 1–1 draws appear more frequently than the model expects, while 1–0 and 0–1 results are also off. Their fix is a correction factor τ that adjusts the joint probability only for scorelines where both teams score one or fewer goals, leaving everything else untouched.

This correction introduces a dependence parameter ρ that, unlike the Bivariate Poisson, can be negative—exactly what the data called for.

Key result: The Dixon & Coles model achieved the best fit on the 2017–18 data, with the lowest AIC (2183.6) compared to the Double Poisson (2184.8) and Bivariate Poisson (2186.8). The improvement is modest but consistent.

What the Parameters Reveal

Once fitted, the model's parameters tell a clear story about each team's quality. The attacking parameter captures how many goals a team generates beyond the league average; the defensive parameter captures how many they prevent.

Manchester City topped the attacking rankings by a wide margin—no surprise for a team that scored 106 goals and won the title with 100 points. Liverpool sat second in attack. On the defensive side, Manchester United led with the best defence parameter, followed by Manchester City and Tottenham. An interesting outlier was Swansea: the weakest attack in the league, but a surprisingly solid defence.

The home advantage was estimated at 0.29 on the log scale. In practical terms, if two teams of equal strength met, the home side's win probability would jump from 37% to 44%—a substantial and statistically significant effect, with a 95% confidence interval of (0.17, 0.41).

0.29 Home Effect (log)

44% Home Win Prob (equal teams)

−0.13 Goal Correlation

Simulating the League

Fitting a model is one thing; testing whether it can reproduce reality is another. Using the Dixon & Coles parameters, the entire 380-match season was simulated 10,000 times. For each simulated season, every match was generated from the model's predicted goal distributions, points were tallied, and the table was ranked.

The results were strikingly close to reality.

Team	Actual Pts	Expected Pts	Diff	P(Title)	P(Top 4)	P(Releg.)
Man City	100	93.9	+6.1	88%	100%	0%
Man United	81	77.7	+3.3	3%	83%	0%
Tottenham	77	76.3	+0.7	3%	78%	0%
Liverpool	75	79.1	−4.1	5%	87%	0%
Chelsea	70	67.7	+2.3	0%	28%	0%
Arsenal	63	66.4	−3.4	0%	22%	0%
Burnley	54	49.3	+4.7	0%	0%	2%
Huddersfield	37	32.7	+4.3	0%	0%	59%
Swansea	33	33.7	−0.7	0%	0%	52%
Stoke	33	33.3	−0.3	0%	0%	55%
West Brom	31	35.9	−4.9	0%	0%	41%

Manchester City outperformed their expected points by about six—impressive even by the model's already high estimation. The model gave them an 88% chance of winning the title, reflecting their dominance but also acknowledging that even dominant teams face variance across 38 matches.

Liverpool finished fourth with 75 points, but the model expected 79. According to 10,000 simulations, they had a better chance of finishing second than fourth.

At the bottom, the story is just as revealing. West Brom were relegated with 31 points, but the model expected them to earn closer to 36—enough to potentially survive. Their relegation probability was only 41%, far from certain, while Huddersfield (59%) and Stoke (55%) were the teams most likely to go down. Sometimes the model says the margins between survival and relegation are razor-thin, and the final table reflects one of many possible outcomes rather than the inevitable one.

Prediction: Reading the Season at Halftime

The final test was purely predictive. Using only the first 25 matchdays of data (roughly the end of January 2018), the model was re-estimated and then asked to simulate the remaining 13 rounds.

At that stage, Manchester City had 68 points and the model gave them a 100% chance of winning the title—the championship was essentially decided. The battle for Champions League places was the real drama: Chelsea (50 pts), Liverpool (50 pts), and Tottenham (48 pts) were separated by just two points. The model gave Liverpool the best odds at 84%, Chelsea 66%, and Tottenham only 50%. In reality, Tottenham outperformed expectations in the final stretch and claimed that fourth spot.

Newcastle, sitting on just 24 points at matchday 25, had a 26% probability of relegation. They finished with 44 points—nearly eight more than the model expected—and survived comfortably. That gap between prediction and reality highlights both the model's value and its limits: it captures the structural dynamics of the league, but football will always produce surprises.

Limitations & What Could Come Next

These models are intentionally simple. They know nothing about individual players, injuries, fatigue, tactical setups, or motivation. They use only team identities and home/away status. The fact that they still produce reasonable predictions says something about how much of football's outcome is driven by underlying team quality.

More sophisticated versions could incorporate recent form (weighting recent matches more heavily using a time-decay function), player availability, or even bookmaker odds as prior information. The data used here was limited to scorelines; modern tracking data—recording player and ball positions 20 times per second—opens up entirely different modelling approaches, from expected goals (xG) to expected threat (xT).

Perhaps the most interesting extension would be team-specific home effects. This project assumed every team benefits equally from playing at home, but anyone who has watched Anfield under the floodlights or visited Turf Moor on a rainy Tuesday would suspect otherwise.

The Bigger Picture

Football resists prediction. It is low-scoring, chaotic, and unforgiving of small mistakes. But that does not mean it is purely random. The Poisson-based framework captures a surprising amount of the game's structure with remarkably few inputs: just attack strength, defence strength, and home advantage.

The gap between expected and actual points is where the interesting questions live. When a team consistently outperforms its expected points, is that mental strength, tactical genius, or just luck? When a team underperforms, is the manager to blame, or were the margins always going to catch up? These models do not answer those questions directly—but they give us the right baseline to start asking them.