The production forecast. It combines the goal-model kernels by a weighted log-opinion pool of their scoreline grids — coherent in totals and handicap, and sharper-yet-calibrated than any single kernel when their errors disagree.
0.1891
OOS RPS · expanding
the headline skill (realism protocol)
0.1830
OOS RPS · LOTO
optimistic ceiling (leaks future folds)
#1
Leaderboard rank
of 7 · CI 0.1607–0.2164
§ 01
The intuition
In plain English, before any mathematics.
Combines the kernels by a log-opinion pool: the product of each model's renormalized scoreline grid raised to a weight, which keeps totals and Asian-handicap coherent and is sharper-yet-calibrated than any single kernel when their errors disagree. A simple average (equal weights) beat stacking out-of-sample — the forecast-combination puzzle, confirmed here at n=3.
§ 02
Mathematical specification
How three scoreline distributions become one — and how the pooled forecast is blended with the market. The combination happens at the grid level, so every derived price stays coherent.
The log-opinion pool (geometric, weight-regularized to equal)
01The log-opinion pool
The production forecast is a weighted geometric mean of the Dixon–Coles, Bayesian and LightGBM scoreline grids, renormalised over the whole grid. Pooling full grids — not just win/draw/loss — keeps totals, handicaps and outcome probabilities mutually consistent.
The pooled scoreline grid — a renormalised geometric mean
02In log space
Equivalently, the pool is a linear blend of log-probabilities. A log pool is sharper than any single kernel yet stays calibrated when the kernels’ errors disagree — and Dixon–Coles and the Bayesian model disagree precisely on draw-heavy knockout football.
logPens(x,y)=k∑wklogPk(x,y)−logZ
The same pool, computed stably in log space
03The weights
Weights live on the simplex and are regularised toward equal. A stacked combiner fitted this way scored worse out-of-sample than the plain equal-weight pool — the forecast-combination puzzle — so the production weights are exactly one third each. The numbers are in the section below.
Simplex weights, regularised toward equal — equal weights won
04The score being minimised
RPS — the ranked probability score over the ordered outcomes home, draw, away — penalises probability mass by how far it sits from what happened: a near-miss draw forecast beats a confident wrong winner. It is the primary score everywhere on this site; lower is better.
The ranked probability score on the ordered W/D/A outcome
05The market blend
The de-vigged bookmaker consensus is treated as one more expert. The convex blend weight is learned on the log-odds scale, fitted strictly out-of-sample — chosen on two odds-carrying tournaments, tested on the third. A weight above zero with a confidence interval excluding zero is the strongest available evidence that the model carries information the market does not fully price; the fitted value is reported below.
OOS RPS — expanding (headline) and LOTO (optimistic)
On the headline expanding window this model scores 0.1891 — −0.0046 below the Elo floor (0.1938) and −0.0014 versus the market ceiling (0.1905).
Elo floor 0.1938market 0.1905
Expanding0.1891
LOTO0.1830
Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1607–0.2164). Lower (left) is better.
It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.
Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments
§ 06
Why simple beat fancy
The forecast-combination puzzle, and the market-blend the model genuinely earns.
A stacked meta-learner — a regularized, non-negative, sum-to-one combiner fit by
nested cross-validation — was run against a plain equal-weight average. The stack scored
worse out-of-sample: +0.0021 RPS, comfortably
inside the ±0.0022 bootstrap standard error. So the production
combiner is the simple average — the well-known forecast-combination puzzle, reported as a
clean negative rather than re-tuned until something won on three folds.
Where the ensemble does add value is against the market. Treating the de-vigged
closing line as one more expert and learning the convex blend gives a weight
w* = 0.66 on the model, with a CI of
[0.20, 1.11] that excludes zero. The model
carries marginal information beyond the market — the strongest available evidence the
features are real — even though it does not beat the market outright.
0.1891
Simple average · OOS RPS
the production combiner (adopted)
0.1912
Stacking · OOS RPS
+0.0021 vs simple — inside ±0.0022 SE
0.66
Market-blend weight w*
CI [0.20, 1.11] excludes 0
§ 07
Strengths & limits
What this model is good for — and where it is weak.
Strengths
Headline OOS RPS 0.1891 (expanding) — matches the market