WC 2026 · Forecasting Oxford Football Forecasting

§ Validation · the credibility layer

Validation

A forecast is only as good as its out-of-sample record. Every model is scored against the same 152 held-out matches across 3 past tournaments, with the uncertainty that sample size forces. The headline is parity, not conquest — the ensemble matches the market and clears the Elo floor, but with only three out-of-sample tournaments almost nothing separates cleanly. The intervals, the calibration, the coverage and the multiplicity correction are all reported below.

0.1891

Ensemble out-of-sample RPS

expanding window · edges the market’s 0.1905

90.8%

Conformal coverage observed

against a 90.0% nominal target — calibrated uncertainty

0 / 8

Subgroups beating Elo (BH-FDR)

zero survive multiplicity control at n = 3

n = 3

Out-of-sample tournaments

152 matches — the ceiling on every claim on this page

Source · Oxford Football Forecasting model — the 152-match out-of-sample backtest (ensemble RPS 0.18912, market 0.19053, conformal coverage 90.789%).

Fig. V11 Lower is better · whisker = conservative 95% CI · floor & ceiling rails

RPS with CIs — the ensemble edges, inside the noise

Under the expanding window the ensemble (0.1891) leads the market (0.1905) by −0.0014 and clears the Elo floor (0.1938) — but the intervals are wide and overlapping. The dashed rails mark the floor (Elo) and ceiling (market); read each model against them.

  1. 1 Ensemble 0.1891
  2. 2 Market consensus ceiling 0.1905
  3. 3 LightGBM (global) 0.1921
  4. 4 Dixon-Coles 0.1926
  5. 5 Bayesian hier. 0.1927
  6. 6 LightGBM (top-5) 0.1937
  7. 7 Elo-only floor 0.1938
ensemble (production model) anchors (floor & ceiling) conservative 95% CI

At n = 3 tournaments the order is suggestive, not decisive — every contender's interval straddles the market rail. The blend-weight test below is where the signal is real.

Source · Oxford Football Forecasting model — RPS over 152 matches / 3 tournaments; CI = the wider of a tournament-block bootstrap and a leave-one-tournament jackknife

H1 · confirmatory — does the model beat the floor?

Ensemble vs Elo: −0.0046 RPS

H1 NOT supported (CI includes 0)

The ensemble scores −0.0046 against the Elo floor — better, but the conservative 95% interval [-0.0122, 0.0061] includes zero. With three tournaments the improvement over a pure ratings model cannot be distinguished from chance.

H2 · confirmatory — do the features add anything?

Optimal blend weight w* = 0.66

H2 SUPPORTED (CI excludes 0)

The convex blend that minimises out-of-sample RPS puts weight 0.66 on our model with a CI of [0.20, 1.11] that excludes zero. So the model carries information beyond the closing line — even though it does not, on its own, beat it outright.

Why two protocols — and which number we trust

Expanding window trains only on tournaments strictly before the one being scored, mirroring how the model is actually used pre-tournament; it is the number we quote as expected skill. Leave-one-tournament-out (LOTO) lets the model see future tournaments when scoring a past one — it is optimistic and we report it only as an upper bound. The ensemble's LOTO RPS (0.1830) is better than its expanding (0.1891), exactly the gap you would expect; we never headline the LOTO figure.

The intervals are deliberately conservative: for each model we take the wider of a tournament-block bootstrap (resampling whole tournaments, B = 3000) and a leave-one-tournament jackknife. Blocking by tournament keeps matches from the same event together, so the CI reflects the true unit of replication — the tournament, of which there are three.

Fig. V12 Pooled H / D / A outcome events · 10 equal-width bins · point area ∝ events in bin

Reliability curves — predicted vs observed, against the diagonal

Each line tracks one model's observed outcome frequency against its predicted probability; the grey 45° line is perfect calibration. The market hugs the diagonal most tightly (ECE 0.017), with the global LightGBM next; the ensemble sits mid-pack (ECE 0.048), tracking the diagonal well through the dense mid-range where ~90% of outcomes fall.

perfectly calibrated predicted probability (%) observed frequency (%) Ensemble · predicted 7%, observed 13% (23 events) Ensemble · predicted 15%, observed 12% (75 events) Ensemble · predicted 26%, observed 28% (156 events) Ensemble · predicted 34%, observed 35% (72 events) Ensemble · predicted 45%, observed 35% (40 events) Ensemble · predicted 55%, observed 61% (36 events) Ensemble · predicted 64%, observed 55% (33 events) Ensemble · predicted 74%, observed 88% (17 events) Ensemble · predicted 82%, observed 50% (4 events) Market consensus · predicted 7%, observed 9% (32 events) Market consensus · predicted 15%, observed 15% (85 events) Market consensus · predicted 25%, observed 27% (136 events) Market consensus · predicted 34%, observed 33% (75 events) Market consensus · predicted 45%, observed 45% (31 events) Market consensus · predicted 55%, observed 53% (32 events) Market consensus · predicted 65%, observed 64% (33 events) Market consensus · predicted 74%, observed 68% (25 events) Market consensus · predicted 84%, observed 71% (7 events) LightGBM (global) · predicted 7%, observed 14% (28 events) LightGBM (global) · predicted 15%, observed 13% (70 events) LightGBM (global) · predicted 26%, observed 25% (139 events) LightGBM (global) · predicted 33%, observed 33% (90 events) LightGBM (global) · predicted 44%, observed 49% (45 events) LightGBM (global) · predicted 55%, observed 55% (31 events) LightGBM (global) · predicted 65%, observed 59% (34 events) LightGBM (global) · predicted 75%, observed 85% (13 events) LightGBM (global) · predicted 84%, observed 67% (6 events) Dixon-Coles · predicted 7%, observed 11% (35 events) Dixon-Coles · predicted 16%, observed 16% (70 events) Dixon-Coles · predicted 25%, observed 25% (138 events) Dixon-Coles · predicted 35%, observed 39% (85 events) Dixon-Coles · predicted 45%, observed 41% (41 events) Dixon-Coles · predicted 56%, observed 43% (37 events) Dixon-Coles · predicted 65%, observed 62% (21 events) Dixon-Coles · predicted 74%, observed 83% (23 events) Dixon-Coles · predicted 86%, observed 67% (6 events) Bayesian hier. · predicted 8%, observed 20% (20 events) Bayesian hier. · predicted 15%, observed 10% (79 events) Bayesian hier. · predicted 25%, observed 30% (172 events) Bayesian hier. · predicted 34%, observed 31% (49 events) Bayesian hier. · predicted 45%, observed 36% (44 events) Bayesian hier. · predicted 55%, observed 61% (36 events) Bayesian hier. · predicted 65%, observed 53% (36 events) Bayesian hier. · predicted 74%, observed 86% (14 events) Bayesian hier. · predicted 81%, observed 67% (6 events) Elo-only · predicted 8%, observed 13% (15 events) Elo-only · predicted 16%, observed 19% (101 events) Elo-only · predicted 25%, observed 27% (132 events) Elo-only · predicted 33%, observed 31% (84 events) Elo-only · predicted 45%, observed 43% (37 events) Elo-only · predicted 56%, observed 55% (29 events) Elo-only · predicted 65%, observed 55% (29 events) Elo-only · predicted 75%, observed 81% (27 events) Elo-only · predicted 84%, observed 0% (2 events)

Expected calibration error

  1. Market consensus 0.017
  2. LightGBM (global) 0.024
  3. Elo-only 0.032
  4. Dixon-Coles 0.036
  5. Ensemble 0.048
  6. Bayesian hier. 0.065

Lower ECE = better calibrated. The market leads, then the global LightGBM; the ensemble lands mid-pack — log-pooling sharpens probabilities, which costs a little calibration in the thin tails. Bold lines on the chart are the ensemble and the market; the pack is drawn faint to keep them legible.

Every model is broadly calibrated where the data is thick. The ensemble's larger ECE is not a body-of-the-distribution problem — it comes almost entirely from one sparse high-probability bin (predicted ≈82%, observed 50% on just 4 events), the kind of tail wobble n = 3 cannot pin down.

Source · Oxford Football Forecasting model — reliability bins per model on the held-out matches; ECE is the event-weighted mean |predicted − observed| over non-empty bins. Lower is better

Fig. V13 Nominal 90.0% · observed over 152 matches

Coverage hits its 90% target — observed 90.8%

Across all 152 held-out matches the 90%-target sets contained the actual result 90.8% of the time — within a whisker of nominal, and erring slightly conservative. Per-tournament it ranges from 84% to 100%, the spread you expect from three small slices.

90% target overall 90.8%
per tournament overall 90% nominal

Mean prediction-set size by subgroup — “we know less” shows up as a wider set

Record above value (over-achievers) proxy n = 97 2.67 89%
Low-coverage squads n = 92 2.64 91%
Hosts (USA · CAN · MEX) n = 11 2.64 100%
Star-dependent proxy n = 71 2.63 90%
Value above record (under-achievers) proxy n = 87 2.62 90%
Powerhouses n = 34 2.62 85%
Heat / altitude / travel proxy n = 52 2.56 85%
Short-history sides n = 32 2.53 88%

set size out of 3 outcomes · the trailing % is that subgroup’s coverage

The uncertainty quantification holds: stated 90% sets really do cover ~90% of outcomes, so the probabilities mean what they say.

Source · Oxford Football Forecasting model — split-conformal prediction sets over W/D/L, 90% target; coverage and mean set size overall, per tournament and per stratum
What the set size is telling you

A conformal set lists every outcome the model cannot confidently rule out at the 90% level. When the model is sure (a clear favourite vs a minnow) the set collapses toward a single outcome; when it is unsure the set widens toward all three. Mean set size is therefore a direct read of confidence — and it is largest for exactly the strata where the evidence is thinnest. The “proxy” tag marks strata defined with a 2026-squad characteristic projected back onto the historical backtest (e.g. star-dependence, the decoupling sign), so those rows are indicative rather than clean out-of-sample.

Fig. V14 Negative (left, green) = ensemble beats Elo · whisker = conservative 95% CI

ΔRPS vs the Elo floor by stratum — encouraging, but none survive BH-FDR

The point estimates lean negative in 7 of 8 strata — the ensemble out-scores Elo almost everywhere, with the single largest edge on powerhouses and the low-coverage group the de-biasing targeted also tilting its way. 1 of 8 strata have a raw interval that excludes zero, but after Benjamini–Hochberg control for 8 comparisons, none do.

← ensemble better worse than Elo → Powerhouses: ΔRPS −0.0081 vs Elo (95% CI −0.0225, −0.0006). Does not survive BH-FDR. Powerhouses n = 34 −0.0081 CI excl. 0 Short-history sides: ΔRPS +0.0002 vs Elo (95% CI −0.0031, +0.0044). Does not survive BH-FDR. Short-history sides n = 32 +0.0002 CI incl. 0 Hosts: ΔRPS −0.0071 vs Elo — CI not estimable (too few tournament blocks). Does not survive BH-FDR. Hosts n = 11 −0.0071 CI n/a Low-coverage squads: ΔRPS −0.0048 vs Elo (95% CI −0.0171, +0.0076). Does not survive BH-FDR. Low-coverage squads n = 92 −0.0048 CI incl. 0 Star-dependent: ΔRPS −0.0031 vs Elo (95% CI −0.0227, +0.0165). Does not survive BH-FDR. Star-dependent n = 71 −0.0031 CI incl. 0 Record above value (over-achievers): ΔRPS −0.0035 vs Elo (95% CI −0.0146, +0.0083). Does not survive BH-FDR. Record above value (over-achievers) n = 97 −0.0035 CI incl. 0 Value above record (under-achievers): ΔRPS −0.0063 vs Elo (95% CI −0.0197, +0.0178). Does not survive BH-FDR. Value above record (under-achievers) n = 87 −0.0063 CI incl. 0 Heat / altitude / travel: ΔRPS −0.0028 vs Elo (95% CI −0.0249, +0.0043). Does not survive BH-FDR. Heat / altitude / travel n = 52 −0.0028 CI incl. 0
ensemble beats Elo (ΔRPS < 0) Elo beats ensemble BH-FDR survivors: 0 of 8

The de-biasing points the right way, but n = 3 tournaments cannot certify it. The directional pattern and the null multiplicity result are reported together — neither alone.

Source · Oxford Football Forecasting model — ΔRPS = ensemble − Elo per stratum, with tournament-block bootstrap / jackknife CIs and a Benjamini–Hochberg FDR flag across the strata
Why “0 survive” is the right reading

Running 8 subgroup comparisons multiplies the chance that one looks “significant” by luck alone. Benjamini–Hochberg controls the false-discovery rate across the whole family; here it rejects nothing — so no subgroup-specific superiority claim is made. The hosts stratum has no interval (“CI not estimable”) because, across three tournaments, host matches fall in too few blocks for the tournament-block resampler to form one — another face of the same n = 3 limit. In summary: the ensemble's per-stratum edges over Elo are real in direction and uncertain in magnitude.

Temporal firewall

Every model is fit only on data dated on or before 2026-06-07, strictly before kick-off; the backtest scores tournaments the expanding window had not yet seen. No future information — results, odds, squads or form — reaches a prediction it is later graded on.

Tournament-block resampling

All uncertainty is computed by resampling whole tournaments, never individual matches. Matches from one event are correlated; blocking keeps them together so every interval reflects the real unit of replication — and makes explicit that there are only 3 of them.

Pre-registered tests

The two confirmatory hypotheses — H1 (beat the Elo floor) and H2 (the blend weight excludes zero) — were fixed in advance. Everything else, including the 8-stratum audit, is treated as exploratory and corrected for multiplicity, so we cannot cherry-pick a flattering slice after the fact.

Conservative protocol, paired comparisons

The headline number is always the expanding-window RPS, never the optimistic LOTO. Model comparisons are run on identical simulation draws, so differences reflect the models — not Monte-Carlo noise. The locked forecast is SHA-256 stamped in the footer.

The standing limit

Everything on this page is bounded by one number: n = 3. Three out-of-sample tournaments (152 matches) is enough to demonstrate parity with the market and calibrated uncertainty, and to point the de-biasing in the right direction — but not enough to certify a significant edge over a strong ratings baseline. That is why every figure here carries a wide interval, why we quote the conservative protocol, and why the defensible headline is the modest one: the model matches the market; it does not beat it.

Keep reading