WC 2026 · Forecasting Oxford Football Forecasting

Combination Combination · the production model

Ensemble — a log-opinion pool of the kernels

The production forecast. It combines the goal-model kernels by a weighted log-opinion pool of their scoreline grids — coherent in totals and handicap, and sharper-yet-calibrated than any single kernel when their errors disagree.

0.1891

OOS RPS · expanding

the headline skill (realism protocol)

0.1830

OOS RPS · LOTO

optimistic ceiling (leaks future folds)

#1

Leaderboard rank

of 7 · CI 0.1607–0.2164

Combines the kernels by a log-opinion pool: the product of each model's renormalized scoreline grid raised to a weight, which keeps totals and Asian-handicap coherent and is sharper-yet-calibrated than any single kernel when their errors disagree. A simple average (equal weights) beat stacking out-of-sample — the forecast-combination puzzle, confirmed here at n=3.

The log-opinion pool (geometric, weight-regularized to equal)

01The log-opinion pool

The production forecast is a weighted geometric mean of the Dixon–Coles, Bayesian and LightGBM scoreline grids, renormalised over the whole grid. Pooling full grids — not just win/draw/loss — keeps totals, handicaps and outcome probabilities mutually consistent.

The pooled scoreline grid — a renormalised geometric mean

02In log space

Equivalently, the pool is a linear blend of log-probabilities. A log pool is sharper than any single kernel yet stays calibrated when the kernels’ errors disagree — and Dixon–Coles and the Bayesian model disagree precisely on draw-heavy knockout football.

The same pool, computed stably in log space

03The weights

Weights live on the simplex and are regularised toward equal. A stacked combiner fitted this way scored worse out-of-sample than the plain equal-weight pool — the forecast-combination puzzle — so the production weights are exactly one third each. The numbers are in the section below.

Simplex weights, regularised toward equal — equal weights won

04The score being minimised

RPS — the ranked probability score over the ordered outcomes home, draw, away — penalises probability mass by how far it sits from what happened: a near-miss draw forecast beats a confident wrong winner. It is the primary score everywhere on this site; lower is better.

The ranked probability score on the ordered W/D/A outcome

05The market blend

The de-vigged bookmaker consensus is treated as one more expert. The convex blend weight is learned on the log-odds scale, fitted strictly out-of-sample — chosen on two odds-carrying tournaments, tested on the third. A weight above zero with a confidence interval excluding zero is the strongest available evidence that the model carries information the market does not fully price; the fitted value is reported below.

The convex model–market blend and its fitted weight

Symbol key

kernel k’s renormalized scoreline-grid probability
the number of kernels pooled (3: Dixon–Coles, Bayesian, LightGBM)
kernel k’s weight on the simplex — the production pool is equal-weight
the pooled grid — a weighted geometric mean, renormalized to sum to one
the normalising constant of the pool
the probability simplex the weights live on
the regularisation strength pulling weights toward equal
the outcome indicator — 1 for the result that happened, 0 otherwise
the convex model–market blend weight, fit out-of-sample
  • Renormalized scoreline grids of the Dixon-Coles, Bayesian and LightGBM kernels
  • Out-of-fold RPS to fit the simplex weights (regularized toward equal)

Fig. M·Ensemble Conceptual schematic

Ensemble — wired end to end

Dixon-Coles grid Pₖ(x,y) Bayesian grid Pₖ(x,y) LightGBM grid Pₖ(x,y) ∏ₖ Pₖ(x,y)^wₖlog-opinion pool, Σwₖ=1 one coherent grid weights regularized → equal (simple average won out-of-sample)
Source · Oxford Football Forecasting model · structural diagram, not a data plot

Fig. V11 Lower is better · floor = Elo-only · ceiling = de-vigged market

OOS RPS — expanding (headline) and LOTO (optimistic)

On the headline expanding window this model scores 0.1891 — −0.0046 below the Elo floor (0.1938) and −0.0014 versus the market ceiling (0.1905).

Expanding 0.1891
LOTO 0.1830

Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1607–0.2164). Lower (left) is better.

It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments

A stacked meta-learner — a regularized, non-negative, sum-to-one combiner fit by nested cross-validation — was run against a plain equal-weight average. The stack scored worse out-of-sample: +0.0021 RPS, comfortably inside the ±0.0022 bootstrap standard error. So the production combiner is the simple average — the well-known forecast-combination puzzle, reported as a clean negative rather than re-tuned until something won on three folds.

Where the ensemble does add value is against the market. Treating the de-vigged closing line as one more expert and learning the convex blend gives a weight w* = 0.66 on the model, with a CI of [0.20, 1.11] that excludes zero. The model carries marginal information beyond the market — the strongest available evidence the features are real — even though it does not beat the market outright.

0.1891

Simple average · OOS RPS

the production combiner (adopted)

0.1912

Stacking · OOS RPS

+0.0021 vs simple — inside ±0.0022 SE

0.66

Market-blend weight w*

CI [0.20, 1.11] excludes 0

Strengths

  • Headline OOS RPS 0.1891 (expanding) — matches the market
  • Coherent totals/handicap (grid-level pooling)
  • Simple-average robustness beat stacking out-of-sample

Limits

  • MATCHES the market, does not beat it (CI vs market includes 0)
  • n=3 tournaments — every CI is wide by design
  • Stacking not adopted (no out-of-sample gain beyond bootstrap SE)