Combination Combination · the production model

Ensemble — a log-opinion pool of the kernels

The production forecast. It combines the goal-model kernels by a weighted log-opinion pool of their scoreline grids — coherent in totals and handicap, and sharper-yet-calibrated than any single kernel when their errors disagree.

0.1891

OOS RPS · expanding

the headline skill (realism protocol)

0.1830

OOS RPS · LOTO

optimistic ceiling (leaks future folds)

Leaderboard rank

of 7 · CI 0.1607–0.2164

§ 01

The intuition

In plain English, before any mathematics.

Combines the kernels by a log-opinion pool: the product of each model's renormalized scoreline grid raised to a weight, which keeps totals and Asian-handicap coherent and is sharper-yet-calibrated than any single kernel when their errors disagree. A simple average (equal weights) beat stacking out-of-sample — the forecast-combination puzzle, confirmed here at n=3.

§ 02

Mathematical specification

How three scoreline distributions become one — and how the pooled forecast is blended with the market. The combination happens at the grid level, so every derived price stays coherent.

P_{ens} (x, y) \propto k \prod P_{k} (x, y)^{w_{k}}, \sum_{k} w_{k} = 1, w_{k} \to equal (regularized)

The log-opinion pool (geometric, weight-regularized to equal)

01The log-opinion pool

The production forecast is a weighted geometric mean of the Dixon–Coles, Bayesian and LightGBM scoreline grids, renormalised over the whole grid. Pooling full grids — not just win/draw/loss — keeps totals, handicaps and outcome probabilities mutually consistent.

P_{ens} (x, y) = \frac{\prod _{k = 1}^{K} P _{k} ( x , y ) ^{w_{k}}}{\sum _{x^{'}, y^{'}} \prod _{k = 1}^{K} P _{k} ( x ^{'} , y ^{'} ) ^{w_{k}}}, K = 3

The pooled scoreline grid — a renormalised geometric mean

02In log space

Equivalently, the pool is a linear blend of log-probabilities. A log pool is sharper than any single kernel yet stays calibrated when the kernels’ errors disagree — and Dixon–Coles and the Bayesian model disagree precisely on draw-heavy knockout football.

lo g P_{ens} (x, y) = k \sum w_{k} lo g P_{k} (x, y) - lo g Z

The same pool, computed stably in log space

03The weights

Weights live on the simplex and are regularised toward equal. A stacked combiner fitted this way scored worse out-of-sample than the plain equal-weight pool — the forecast-combination puzzle — so the production weights are exactly one third each. The numbers are in the section below.

w^{⋆} = ar g w \in Δ^{2} min RPS_{out-of-fold} (P_{ens} (w)) + γ w - \frac{1}{3} 1_{2}^{2}

Simplex weights, regularised toward equal — equal weights won

04The score being minimised

RPS — the ranked probability score over the ordered outcomes home, draw, away — penalises probability mass by how far it sits from what happened: a near-miss draw forecast beats a confident wrong winner. It is the primary score everywhere on this site; lower is better.

RPS = \frac{1}{2} k = 1 \sum 2 (j \leq k \sum (p_{j} - o_{j}))^{2}, (p_{1}, p_{2}, p_{3}) = (p_{H}, p_{D}, p_{A})

The ranked probability score on the ordered W/D/A outcome

05The market blend

The de-vigged bookmaker consensus is treated as one more expert. The convex blend weight is learned on the log-odds scale, fitted strictly out-of-sample — chosen on two odds-carrying tournaments, tested on the third. A weight above zero with a confidence interval excluding zero is the strongest available evidence that the model carries information the market does not fully price; the fitted value is reported below.

p_{blend} (w) = w p_{model} + (1 - w) p_{market} (log-odds scale), w^{*} = ar g w min RPS (p_{blend}, actual)

The convex model–market blend and its fitted weight

Symbol key

$P_{k} (x, y)$: kernel k’s renormalized scoreline-grid probability
$K$: the number of kernels pooled (3: Dixon–Coles, Bayesian, LightGBM)
$w_{k}$: kernel k’s weight on the simplex — the production pool is equal-weight
$P_{ens} (x, y)$: the pooled grid — a weighted geometric mean, renormalized to sum to one
$Z$: the normalising constant of the pool
$Δ^{2}$: the probability simplex the weights live on
$γ$: the regularisation strength pulling weights toward equal
$o_{j}$: the outcome indicator — 1 for the result that happened, 0 otherwise
$w^{*}$: the convex model–market blend weight, fit out-of-sample

§ 03

What data it uses

The inputs this model reads — and only these.

Renormalized scoreline grids of the Dixon-Coles, Bayesian and LightGBM kernels
Out-of-fold RPS to fit the simplex weights (regularized toward equal)

§ 04

How it works

A schematic of the model wired end to end.

Fig. M·Ensemble Conceptual schematic

Ensemble — wired end to end

Source · Oxford Football Forecasting model · structural diagram, not a data plot

§ 05

Out-of-sample skill

Where this model lands between the Elo floor and the market ceiling, on both backtest protocols.

Fig. V11 Lower is better · floor = Elo-only · ceiling = de-vigged market

OOS RPS — expanding (headline) and LOTO (optimistic)

On the headline expanding window this model scores 0.1891 — −0.0046 below the Elo floor (0.1938) and −0.0014 versus the market ceiling (0.1905).

Expanding 0.1891

LOTO 0.1830

Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1607–0.2164). Lower (left) is better.

It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments

§ 06

Why simple beat fancy

The forecast-combination puzzle, and the market-blend the model genuinely earns.

A stacked meta-learner — a regularized, non-negative, sum-to-one combiner fit by nested cross-validation — was run against a plain equal-weight average. The stack scored worse out-of-sample: +0.0021 RPS, comfortably inside the ±0.0022 bootstrap standard error. So the production combiner is the simple average — the well-known forecast-combination puzzle, reported as a clean negative rather than re-tuned until something won on three folds.

Where the ensemble does add value is against the market. Treating the de-vigged closing line as one more expert and learning the convex blend gives a weight w* = 0.66 on the model, with a CI of [0.20, 1.11] that excludes zero. The model carries marginal information beyond the market — the strongest available evidence the features are real — even though it does not beat the market outright.

0.1891

Simple average · OOS RPS

the production combiner (adopted)

0.1912

Stacking · OOS RPS

+0.0021 vs simple — inside ±0.0022 SE

0.66

Market-blend weight w*

CI [0.20, 1.11] excludes 0

§ 07

Strengths & limits

What this model is good for — and where it is weak.

Strengths

Headline OOS RPS 0.1891 (expanding) — matches the market
Coherent totals/handicap (grid-level pooling)
Simple-average robustness beat stacking out-of-sample

Limits

MATCHES the market, does not beat it (CI vs market includes 0)
n=3 tournaments — every CI is wide by design
Stacking not adopted (no out-of-sample gain beyond bootstrap SE)