Validation · Oxford Football Forecasting

0.1891

Ensemble out-of-sample RPS

expanding window · edges the market’s 0.1905

90.8%

Conformal coverage observed

against a 90.0% nominal target — calibrated uncertainty

0 / 8

Subgroups beating Elo (BH-FDR)

zero survive multiplicity control at n = 3

n = 3

Out-of-sample tournaments

152 matches — the ceiling on every claim on this page

Source · Oxford Football Forecasting model — the 152-match out-of-sample backtest (ensemble RPS 0.18912, market 0.19053, conformal coverage 90.789%).

§ 01

Out-of-sample performance

Ranked-probability score on the held-out matches — lower is better. Seven models between an Elo floor and the de-vigged market ceiling, each with its conservative 95% interval (the wider of a tournament-block bootstrap and a leave-one-tournament jackknife). Toggle between the expanding-window protocol (the number we quote as expected skill) and the optimistic leave-one-tournament-out.

Fig. V11 Lower is better · whisker = conservative 95% CI · floor & ceiling rails

RPS with CIs — the ensemble edges, inside the noise

Under the expanding window the ensemble (0.1891) leads the market (0.1905) by −0.0014 and clears the Elo floor (0.1938) — but the intervals are wide and overlapping. The dashed rails mark the floor (Elo) and ceiling (market); read each model against them.

ensemble (production model) anchors (floor & ceiling) conservative 95% CI

At n = 3 tournaments the order is suggestive, not decisive — every contender's interval straddles the market rail. The blend-weight test below is where the signal is real.

Source · Oxford Football Forecasting model — RPS over 152 matches / 3 tournaments; CI = the wider of a tournament-block bootstrap and a leave-one-tournament jackknife

H1 · confirmatory — does the model beat the floor?

Ensemble vs Elo: −0.0046 RPS

H1 NOT supported (CI includes 0)

The ensemble scores −0.0046 against the Elo floor — better, but the conservative 95% interval [-0.0122, 0.0061] includes zero. With three tournaments the improvement over a pure ratings model cannot be distinguished from chance.

H2 · confirmatory — do the features add anything?

Optimal blend weight w* = 0.66

H2 SUPPORTED (CI excludes 0)

The convex blend that minimises out-of-sample RPS puts weight 0.66 on our model with a CI of [0.20, 1.11] that excludes zero. So the model carries information beyond the closing line — even though it does not, on its own, beat it outright.

Why two protocols — and which number we trust

Expanding window trains only on tournaments strictly before the one being scored, mirroring how the model is actually used pre-tournament; it is the number we quote as expected skill. Leave-one-tournament-out (LOTO) lets the model see future tournaments when scoring a past one — it is optimistic and we report it only as an upper bound. The ensemble's LOTO RPS (0.1830) is better than its expanding (0.1891), exactly the gap you would expect; we never headline the LOTO figure.

The intervals are deliberately conservative: for each model we take the wider of a tournament-block bootstrap (resampling whole tournaments, B = 3000) and a leave-one-tournament jackknife. Blocking by tournament keeps matches from the same event together, so the CI reflects the true unit of replication — the tournament, of which there are three.

§ 02

Calibration — do the probabilities mean what they say?

A model is calibrated if, when it says 30%, the thing happens about 30% of the time. We pool all three outcomes of every match (home / draw / away — 456 outcome-probabilities over 152 matches), bin by predicted probability, and plot the observed frequency against it. Points on the diagonal are perfectly calibrated; the expected calibration error (ECE) is the average vertical gap.

Fig. V12 Pooled H / D / A outcome events · 10 equal-width bins · point area ∝ events in bin

Reliability curves — predicted vs observed, against the diagonal

Each line tracks one model's observed outcome frequency against its predicted probability; the grey 45° line is perfect calibration. The market hugs the diagonal most tightly (ECE 0.017), with the global LightGBM next; the ensemble sits mid-pack (ECE 0.048), tracking the diagonal well through the dense mid-range where ~90% of outcomes fall.

Expected calibration error

Lower ECE = better calibrated. The market leads, then the global LightGBM; the ensemble lands mid-pack — log-pooling sharpens probabilities, which costs a little calibration in the thin tails. Bold lines on the chart are the ensemble and the market; the pack is drawn faint to keep them legible.

Every model is broadly calibrated where the data is thick. The ensemble's larger ECE is not a body-of-the-distribution problem — it comes almost entirely from one sparse high-probability bin (predicted ≈82%, observed 50% on just 4 events), the kind of tail wobble n = 3 cannot pin down.

Source · Oxford Football Forecasting model — reliability bins per model on the held-out matches; ECE is the event-weighted mean |predicted − observed| over non-empty bins. Lower is better

§ 03

Conformal coverage

Beyond a single most-likely outcome, the model emits a prediction set (some combination of home / draw / away) calibrated to contain the truth 90% of the time. The test: does it? By design, the set is allowed to be larger when the model genuinely knows less — so a wider set for a debutant is a feature, not a failure.

Fig. V13 Nominal 90.0% · observed over 152 matches

Coverage hits its 90% target — observed 90.8%

Across all 152 held-out matches the 90%-target sets contained the actual result 90.8% of the time — within a whisker of nominal, and erring slightly conservative. Per-tournament it ranges from 84% to 100%, the spread you expect from three small slices.

90% target overall 90.8%

per tournament overall 90% nominal

Mean prediction-set size by subgroup — “we know less” shows up as a wider set

Record above value (over-achievers) proxy n = 97 2.67 89%

Low-coverage squads n = 92 2.64 91%

Hosts (USA · CAN · MEX) n = 11 2.64 100%

Star-dependent proxy n = 71 2.63 90%

Value above record (under-achievers) proxy n = 87 2.62 90%

Powerhouses n = 34 2.62 85%

Heat / altitude / travel proxy n = 52 2.56 85%

Short-history sides n = 32 2.53 88%

set size out of 3 outcomes · the trailing % is that subgroup’s coverage

The uncertainty quantification holds: stated 90% sets really do cover ~90% of outcomes, so the probabilities mean what they say.

Source · Oxford Football Forecasting model — split-conformal prediction sets over W/D/L, 90% target; coverage and mean set size overall, per tournament and per stratum

What the set size is telling you

A conformal set lists every outcome the model cannot confidently rule out at the 90% level. When the model is sure (a clear favourite vs a minnow) the set collapses toward a single outcome; when it is unsure the set widens toward all three. Mean set size is therefore a direct read of confidence — and it is largest for exactly the strata where the evidence is thinnest. The “proxy” tag marks strata defined with a 2026-squad characteristic projected back onto the historical backtest (e.g. star-dependence, the decoupling sign), so those rows are indicative rather than clean out-of-sample.

§ 04

The subgroup audit — where (if anywhere) the model wins

If the global club panel adds information where top-5 European feeds see least, the gains should land off-UEFA and on low-coverage squads. ΔRPS against the Elo floor is tested in 8 strata. The directions are encouraging — the ensemble beats Elo in most strata — but the decisive test is multiplicity: with 8 simultaneous comparisons at n = 3, do any survive a Benjamini–Hochberg false-discovery correction?

Fig. V14 Negative (left, green) = ensemble beats Elo · whisker = conservative 95% CI

ΔRPS vs the Elo floor by stratum — encouraging, but none survive BH-FDR

The point estimates lean negative in 7 of 8 strata — the ensemble out-scores Elo almost everywhere, with the single largest edge on powerhouses and the low-coverage group the de-biasing targeted also tilting its way. 1 of 8 strata have a raw interval that excludes zero, but after Benjamini–Hochberg control for 8 comparisons, none do.

ensemble beats Elo (ΔRPS < 0) Elo beats ensemble BH-FDR survivors: 0 of 8

The de-biasing points the right way, but n = 3 tournaments cannot certify it. The directional pattern and the null multiplicity result are reported together — neither alone.

Source · Oxford Football Forecasting model — ΔRPS = ensemble − Elo per stratum, with tournament-block bootstrap / jackknife CIs and a Benjamini–Hochberg FDR flag across the strata

Why “0 survive” is the right reading

Running 8 subgroup comparisons multiplies the chance that one looks “significant” by luck alone. Benjamini–Hochberg controls the false-discovery rate across the whole family; here it rejects nothing — so no subgroup-specific superiority claim is made. The hosts stratum has no interval (“CI not estimable”) because, across three tournaments, host matches fall in too few blocks for the tournament-block resampler to form one — another face of the same n = 3 limit. In summary: the ensemble's per-stratum edges over Elo are real in direction and uncertain in magnitude.

§ 05

Leakage & method guardrails

Out-of-sample numbers are only meaningful if the test set was genuinely unseen. Here is how the train/test boundary was firewalled — and the standing limits that remain.

Temporal firewall

Every model is fit only on data dated on or before 2026-06-07, strictly before kick-off; the backtest scores tournaments the expanding window had not yet seen. No future information — results, odds, squads or form — reaches a prediction it is later graded on.

Tournament-block resampling

All uncertainty is computed by resampling whole tournaments, never individual matches. Matches from one event are correlated; blocking keeps them together so every interval reflects the real unit of replication — and makes explicit that there are only 3 of them.

Pre-registered tests

The two confirmatory hypotheses — H1 (beat the Elo floor) and H2 (the blend weight excludes zero) — were fixed in advance. Everything else, including the 8-stratum audit, is treated as exploratory and corrected for multiplicity, so we cannot cherry-pick a flattering slice after the fact.

Conservative protocol, paired comparisons

The headline number is always the expanding-window RPS, never the optimistic LOTO. Model comparisons are run on identical simulation draws, so differences reflect the models — not Monte-Carlo noise. The locked forecast is SHA-256 stamped in the footer.

The standing limit

Everything on this page is bounded by one number: n = 3. Three out-of-sample tournaments (152 matches) is enough to demonstrate parity with the market and calibrated uncertainty, and to point the de-biasing in the right direction — but not enough to certify a significant edge over a strong ratings baseline. That is why every figure here carries a wide interval, why we quote the conservative protocol, and why the defensible headline is the modest one: the model matches the market; it does not beat it.

Keep reading

Models →The seven-model ladder and the log-pool ensemble these scores grade. Data →The de-biasing coverage maps behind the subgroup audit — query it yourself. Forecast →The locked champion probabilities this validation underwrites.