WC 2026 · Forecasting Oxford Football Forecasting

§ Models · the methodology hub

A ladder of models, not a single oracle

Six models in escalating sophistication — each one a published, validated rung the next must clear — combined into one ensemble and played out 1.1 million times. This is the whole method, in the open: the equations, the data each model eats, its out-of-sample score, and a clear verdict on what it can and cannot do.

The headline is deliberately modest. The production ensemble scores an out-of-sample RPS of 0.1891, a hair under the de-vigged market's 0.1905 (−0.0014) — but at an effective n = 3 tournaments the confidence intervals overlap heavily. The supported claim is that the model matches the market; it does not beat it. What it earns is a blend weight w* = 0.66 (CI excludes 0): the features carry marginal information the market does not fully price.

6

Models in the ladder

+ the de-vigged market as the ceiling benchmark

0.1891

Ensemble OOS RPS (expanding)

vs market 0.1905 — matches, does not beat

152

Out-of-sample matches

across 3 tournaments with market odds (Copa·Euro·WC 2024/22)

1.1M

Tournament simulations

of the official 48-team bracket, full FIFA tiebreakers

Fig. M1 Six rungs · escalation order · skill meter to the ceiling

The ladder — Elo floor to simulator, market as ceiling

Read top-down: the simulator wraps the ensemble, which pools the three goal-model kernels (Dixon-Coles, Bayesian, LightGBM) that all sit on the Elo state. The skill meter shows how close each rung's out-of-sample RPS gets to the best achievable, with the market ceiling marked.

The kernels are tightly bunched and the ensemble closes the last sliver to the ceiling — gains come from combining diverse-but-comparable models, not from one dominant model.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · out-of-sample RPS over 152 matches · 3 tournaments

Fig. V11 Expanding window · lower is better · whisker = conservative 95% CI

Out-of-sample RPS with CIs — the ensemble edges, but within noise

The ensemble (0.1891) and the market (0.1905) lead a tight pack, but their intervals — 0.1607–0.2164 and 0.1660–0.2151 — overlap almost entirely. The Elo floor (0.1938) is the bar every model clears.

ensemble (our production model) anchors (floor & ceiling) conservative 95% CI

At n = 3 the ranking is suggestive, not decisive: the ensemble's lead over the market is well inside the combined uncertainty. The order is reported as measured, and the blend-weight test carries the inferential load.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · CI = the wider of tournament-block bootstrap and leave-one-tournament jackknife

Finding 01 · the ceiling

The ensemble matches the market — it does not beat it

On out-of-sample RPS the ensemble (0.1891) edges the de-vigged market (0.1905), but the gap (−0.0014) is well inside the bootstrap interval, which includes zero. The defensible claim is parity. Where the model earns its keep is the convex blend: w* = 0.66 with a CI of [0.20, 1.11] that excludes 0 — the features add marginal information beyond the closing line.

See the full validation →

Finding 02 · global club coverage

Extending club coverage worldwide flipped the GBM

The squad signal was built from top-5-Europe club data, which is sparse for non-European squads — a coverage bias that suppressed the signal it was meant to carry. Swapping to a genuinely global club panel lifted average coverage from 48% to 85% and moved the LightGBM model from worst to best-non-market. The gains land off-UEFA and on low-coverage teams — directionally exactly as predicted, though not significant at n = 3.

How the de-biasing works →

Finding 03 · the n = 3 ceiling

Simple beat fancy — and simple is what shipped

A stacked meta-learner scored worse out-of-sample than a plain equal-weight average (+0.0021 RPS, inside the ±0.0022 bootstrap SE), so the production combiner is the simple average — the forecast-combination puzzle, confirmed here. With only three out-of-sample tournaments, every interval is wide; two confirmatory tests were pre-registered, everything else is multiplicity-controlled, and the expanding-window number (never the optimistic LOTO) is quoted as expected skill.

Inside the ensemble →
Why a ladder, and how the rungs are scored

The six models are not competitors to be selected down to one winner on three tournament folds — that is the multiple-testing trap. They are an ensemble basis with controlled disagreement: each is tuned on the match-level backtest (tens of thousands of internationals, large n), then combined and validated once on the tournaments. Dixon-Coles and the Bayesian model disagree precisely on draw-heavy knockout football; the LightGBM model adds non-linear feature interactions; pooling diverse-but-comparable kernels is what sharpens the forecast.

Every RPS on this page is the expanding-window out-of-sample number (train on tournaments before the test fold, never after) over 152 matches in 3 tournaments — the realism protocol. The leave-one-tournament-out (LOTO) number, which leaks future tournaments into past training, is reported on each model page as an optimistic ceiling, never as the headline. The Elo floor is the bar every model must clear before its layer is added.