WC 2026 · Forecasting Oxford Football Forecasting

§ About · the colophon

What this is, how it was made, and what it cannot claim

This page is the project's reference: the method in plain English, every technical term defined, the data layer described, the pipeline summarised so it can be re-run, and the scope and limitations stated plainly. The one-line summary is the same one that sits under every chart on the site — the ensemble matches the de-vigged market on out-of-sample accuracy (0.1891 vs 0.1905); it does not significantly beat it, and at three out-of-sample tournaments the margin sits inside the noise.

Every forecast on this site rests on a single idea: a national team's chances come from two things that do not always agree. The first is history — the long record of who has beaten whom, distilled into a self-updating strength rating that stretches back to the nineteenth century. The second is the squad on the plane today — the clubs its players turn out for, the minutes they are getting, the market that prices their talent. A team can be richer than its results, or be winning more than its price tag says it should. The model's job is to hold both pictures at once.

Those two pictures are turned into match odds by a ladder of models, each a little more ambitious than the last. At the bottom is Elo — cheap, transparent, reproducible from the results file alone, and the floor everything else must clear. Above it, a goals model (Dixon–Coles) that predicts scorelines, a Bayesian model that deliberately shrinks its opinion of teams it has barely seen, and a gradient-boosting learner that reads 36 engineered features per team. No single one of these is trusted on its own. They are pooled — a weighted geometric average of their probabilities — into one ensemble, which out of sample was both more accurate than any individual learner and steadier than a cleverer combination rule we tried and rejected.

The ensemble only gives the odds for a single match. To get a champion, the real 48-team bracket is simulated 1,100,000 times — every group, every tie-breaker (down to the head-to-head rule we had to correct by hand), the eight best third-placed teams, extra time and shoot-outs calibrated to hundreds of real ones — and the share of simulations a team wins becomes its title probability. Because that share is an average over simulations, it carries a small, reported margin of its own.

The headline: the model matches the market. It does not beat it.

That last point matters most. Measured the hard way — trained only on matches before each test tournament, never on the answers — the ensemble's accuracy (0.1891) edges the de-vigged bookmaker consensus (0.1905) and clears the Elo floor (0.1938). But the margin over the market is a fraction of its own uncertainty, and the test set is only three out-of-sample tournaments. So the claim we stand behind is parity, not conquest: a model that has earned a seat next to the market, carries genuinely independent information (a 66%-weight blend improves on either alone), and is transparent enough that you can check every step below.

Fig. A1 Each rung, its role, and its out-of-sample RPS (lower is better)

The model ladder — floor to ceiling

Seven anchors between an Elo floor and the de-vigged market ceiling. The locked forecast is the ensemble of three of them — Dixon-Coles, the global gradient booster and the Bayesian hierarchical model — run through the Monte-Carlo simulator.

  1. floor
    Elo

    Self-updating strength from results; movement-of-victory and importance weighted.

    0.1938RPS
  2. baseline
    Dixon-Coles

    Bivariate-Poisson goals with low-score correction and host gating.

    0.1926RPS
  3. shrinkage
    Bayesian hierarchical

    Partial pooling so newcomers borrow strength; NUTS/MAP, horseshoe priors.

    0.1927RPS
  4. features
    LightGBM-Poisson (global)

    Gradient boosting on 36 features with monotone constraints; the de-biasing win.

    0.1921RPS
  5. combination
    Ensemble

    Simple-average log-opinion pool; beat stacking out-of-sample.

    0.1891RPS
  6. tournament
    Monte-Carlo simulator

    1.1M simulations of the real 48-team bracket with full FIFA tiebreakers.

  7. ceiling
    Market consensus

    De-vigged bookmaker probabilities; the market benchmark we match, not beat.

    0.1905RPS

The simulator turns match odds into a tournament; it is not itself scored on match RPS, so its cell reads “—”. The market is a benchmark, not a model we fit.

Read top to bottom as increasing ambition. The ensemble is the only rung that clears both the Elo floor and (just) the market rail — but inside the noise, which is why the operative verb is “matches”.

Source · Oxford Football Forecasting model — expanding-window RPS over 152 held-out matches.

The kernels, for the record

The exact equations behind the ladder, rendered from the same strings the model pages read. Full derivations live on each model page; these are here so the method is on the table, not behind a link.

Elo update

Dixon–Coles goals

Log-opinion pool (the ensemble)

Champion probability (the simulator)

Why “matches, not beats” is the headline

Hypothesis H1 — that the ensemble beats the Elo floor — gives ΔRPS −0.0046 with a conservative 95% interval of [−0.0122, +0.0061], which includes zero (NOT supported (CI includes 0)). The blend-weight test H2 is the one that clears the bar: the optimal model weight w* = 0.656 sits in a 95% interval [0.20, 1.11] that excludes zero, so the model adds information to the market even though it does not, on its own, significantly beat it. With n = 3, the effect and its uncertainty are reported in the same breath.

Scoring a probabilistic forecast

RPS Ranked Probability Score
The headline accuracy metric for ordered three-way outcomes (home win / draw / away win). It rewards probability placed near the truth and penalises confident misses, summed over the cumulative distribution — lower is better. Every model on this site is ranked by out-of-sample RPS over 152 held-out matches; the ensemble scores 0.1891.
Brier score
The mean squared error of a probability forecast against the realised 0/1 outcome. RPS is its order-aware cousin for ranked categories; Brier is the special case for a single binary event. We report RPS as the primary metric because football outcomes are ordered (a draw sits between the two wins).
Calibration & ECE Expected Calibration Error
Calibration asks whether events predicted at 30% actually happen about 30% of the time. ECE is the average gap between predicted and observed frequency across probability bins — small ECE means the numbers can be read as fair odds, not just rankings.
MC-SE Monte-Carlo standard error
Every champion and stage probability is an average over simulated tournaments, so it carries sampling noise of its own. The MC-SE is the standard error of that average (it shrinks as 1/√sims); at 1,100,000 simulations the leaders' champion odds are pinned to roughly ±0.1 of a percentage point. It is shown as a whisker on the charts.

The models

Elo
A self-updating team-strength rating: each result nudges both teams up or down by an amount that depends on the surprise, the margin of victory and the match importance. It is the reproducible floor of the ladder — cheap, transparent, and surprisingly hard to beat.
Dixon–Coles
The classic football goals model: each side scores as a Poisson process driven by its attack and the opponent's defence, with a low-score correction that fixes the well-known under-prediction of 0–0, 1–0, 0–1 and 1–1. Host advantage enters as an explicit bump.
Bivariate Poisson
A goals model in which the two teams' scores are not assumed independent — a shared component lets a tight, cagey match depress both totals together. It is the structure underneath the scoreline grids you see on the match pages.
Partial pooling / shrinkage
Hierarchical estimation that pulls each team's parameters part-way toward the global average, by an amount set by how much data that team has. Debutants and short-history sides are shrunk hardest — a principled way to say "we are less sure about you" rather than over-fitting a thin record.
Horseshoe prior
A heavy-tailed Bayesian prior used in the hierarchical model: it shrinks most coefficients hard toward zero while still allowing a few genuinely large effects to escape. It keeps the squad-quality signal sparse and disciplined instead of letting every feature claim a little credit.
Log-opinion pool
The rule that combines the models: multiply their probability distributions (a weighted geometric mean), then renormalise. The locked forecast uses a simple-average pool of Dixon-Coles, the global gradient booster and the Bayesian hierarchical model — which beat a learned stacking weight out of sample (+0.0021 RPS, inside the noise), so the simpler rule was kept.

The market benchmark

De-vig / Shin
Bookmaker odds embed a margin (the "vig" or overround) that makes the implied probabilities sum to more than one. De-vigging strips it out; the Shin method does so while accounting for informed money, yielding a fair-probability consensus. That de-vigged consensus is the ceiling this project measures itself against.
Blend weight w*
When the model and the de-vigged market disagree, how much should you trust the model? The optimal out-of-sample blend put roughly 66% weight on the model (w* = 0.66, 95% CI excluding zero) — evidence the model carries real, independent information, even though on its own it only matches the market.

Uncertainty & validation

Conformal prediction
A distribution-free way to attach a guaranteed coverage level to predictions. Calibrated for 90% coverage on the W/D/L outcome, the held-out sets actually contained the truth 90.8% of the time; the sets are wider for short-history teams, which is the model admitting where it knows less.
Draw-luck
The gap between a team's raw strength ("Power") and its fixture-aware odds ("Reality") — i.e. how kind or cruel the bracket it was handed is. Positive means a softer path than strength alone implies; across WC2026 it is small for almost everyone, so the draw is close to fair.
Decoupling g
The projection residual between a team's current squad value and what its recent results would predict — a single number for "better squad than form, or better form than squad". Regressed on out-of-sample success the slope is positive but its interval spans zero (b = +0.03, 95% CI [−0.31, +0.55]), so g is reported as a description, not a law.
LOTO leave-one-tournament-out
A backtest protocol that holds out an entire past tournament, trains on the rest, and predicts the held-out one — repeated for each. It is the optimistic cousin of the strictly forward-looking expanding-window protocol; the (more conservative) expanding number is quoted as expected skill.
BH-FDR Benjamini–Hochberg false-discovery-rate
A multiplicity correction for when you test many subgroups at once: it controls the expected share of false "wins" among the claimed ones. Applied to the subgroup audit, zero of the nine strata survive at n = 3 — the verdict that none of the apparent edge cases is established.
8

Curated data tables

forecasts, rankings, features, validation

2,401

Rows across the tables

every figure on the site traces to one of them

352

Documented fields

a field dictionary spans all 13 tables

12

Database tables

4,269 rows in one queryable database

The site reads from 8 curated tables — the locked forecast itself, the group and knockout probabilities, the two rankings (Power and Reality), the match-level head-to-heads, the engineered team features and the full validation results — 2,401 rows in all. A field dictionary documents every table and every field with its type and a plain-English note, so nothing on the site depends on an undocumented column.

The same layer is packed into a single queryable database of 12 core entities — teams, players, coaches, matches, features, model results, the forecast and the rankings (4,269 rows) — which powers the in-browser SQL console on the Data page. Each table carries a SHA-256 checksum, and the locked forecast's own hash — reproduced in §4 below — stamps the exact vintage of every number on the site.

Provenance & re-use

The data layer consists of derived outputs of the project (forecasts, engineered features, validation results). It is not a redistribution of the licensed upstream feeds it was built from — see the credits below for the original sources and their own terms. The locked-forecast hash identifies the vintage unambiguously.

The pipeline

  1. 01

    Raw data — 20 sources

    49,445 international results (1872→2026), Elo histories, a global club-stats feed of 107 leagues across 68 countries, squads, the odds feed, environment and structure tables.

  2. 02

    A 61-step pipeline

    A numbered, ordered chain that cleans the raw feeds, engineers 35 processed tables including the 36-feature team panel, fits the 7 models, runs the backtest, simulates the bracket and exports the site's data layer.

  3. 03

    A seeded, repeatable simulation

    The Monte-Carlo simulator runs from a fixed seed and uses common random numbers across teams, so the 1,100,000 simulations are bit-for-bit repeatable and re-draws are compared on the same randomness.

  4. 04

    Strict fit-cut at 2026-06-07

    Models train only on results dated on or before 2026-06-07 — strictly pre-kickoff — and never on the 152 odds-tournament matches or any tournament stage labels. No refit after kickoff.

  5. 05

    A single source of truth

    Everything the site shows is reconciled to one locked forecast file. The site's data layer is regenerated from it by the pipeline's export step — no number is ever edited by hand.

What “out-of-sample” actually firewalls

Two protocols guard against leakage. The expanding window trains on everything before each held-out tournament and predicts forward — the number we quote as expected skill. Leave-one-tournament-out holds a whole tournament out and trains on the rest; it is more optimistic, so it is reported alongside but not headlined. In neither case does any model ever see a test-tournament match, a closing odds line, or a stage label during training. The 152-match test set stays untouched throughout.

01

Three tournaments is a small test set

The out-of-sample evidence is 152 matches across three odds tournaments (plus five for the stage models). At that resolution confidence intervals are wide by construction — the de-biasing gain, the decoupling slope and the draw-luck advantages all have intervals that include zero. We treat this as the ceiling on every claim, not a footnote.

02

It matches the market; it does not beat it

On the hard, forward-looking test the ensemble (0.1891) edges the de-vigged market (0.1905) by −0.0014 — a margin far smaller than its own uncertainty (H1 interval includes zero). The defensible claim is parity. Anyone promising a market-beating edge from three tournaments is over-reading the data.

03

Squad coverage is fixed but not perfect

Extending squad coverage worldwide was the project's central data contribution: a global feed of 68 countries lifted average squad coverage from roughly a half to 85%. It is still uneven: the residual gaps are concentrated in the lower-coverage confederations — AFC above all — so odds for some Asian and Oceanian sides rest on a thinner club-data base than those for European ones. The conformal sets widen there to say so.

04

Built on free and licensed public data

The inputs are public results, a free-tier global club-stats API, public Elo, market value and odds feeds. That makes the project reproducible and open, but it inherits those feeds' limits — coverage holes, occasional staleness, and value/odds series that are themselves estimates. The published bundle is the derived data, not the licensed raw feeds.

05

The newcomer mechanism can't be tested out-of-sample

The hierarchical model shrinks debutants toward the field and lets their squad value speak where their international record cannot. It is a principled mechanism — but with so few true newcomers in the held-out tournaments, it is the one part of the model that cannot be properly validated out of sample. For genuine first-timers, read the odds as the model's best-structured guess, with the widest uncertainty on the site.

06

A pre-tournament snapshot

This is a forecast locked on 2026-06-09, before a ball is kicked. It does not update for injuries, form swings or in-tournament results after the lock; the whole point of the checksum is that it is a fixed, falsifiable prediction. Late squad changes after the fit-cut are not reflected.

Data sources

  • International results — the public match-results corpus, 49,445 games (1872→2026).
  • Elo ratings — the World Football Elo histories.
  • Club & player stats — API-Football, 107 leagues across 68 countries (the de-biasing feed).
  • Market values — Transfermarkt squad valuations.
  • Bookmaker odds — the-odds-api closing lines, de-vigged (Shin).
  • Structure & context — FIFA bracket and draw, venues/environment, coaches, squads.

Each feed is used under its own terms; this site redistributes only the derived tables in §3, not the licensed raw data.

Tools & methods

  • Modelling — NumPyro / JAX (Bayesian NUTS), LightGBM (gradient boosting), NumPy & pandas.
  • Simulation — a bespoke Python Monte-Carlo bracket engine with full FIFA tie-breakers.
  • Data layer — DuckDB & Parquet; DuckDB-WASM for in-browser queries.
  • Site — Astro (static), ECharts (lazy islands), KaTeX (build-time math).
  • Design — Fraunces, Archivo, Newsreader & IBM Plex Mono.

Open-source throughout; the analytical choices, and any errors, are the project's own.

This build

Design system
Oxford Football Forecasting
Forecast locked
2026-06-09
Data vintage
pre-tournament (WC2026 not yet played)
Models
7 + simulator
Simulations
1,100,000
Pipeline
61 steps · fixed seed

Locked forecast SHA-256 bdc9589096…7dc0a7

If you read nothing else: every number carries its uncertainty, and the model matches the market without beating it. Everything above is here so you can check that for yourself.