Shrinkage Shrinkage · partial pooling for newcomers

Bayesian hierarchical — newcomers borrow strength

The same Poisson goal core, but each team’s attack and defence is latent, with a prior centred on what its history and squad value imply. A short-history qualifier is pulled toward that informed prior with calibrated uncertainty instead of an overconfident guess.

0.1927

OOS RPS · expanding

the headline skill (realism protocol)

0.1905

OOS RPS · LOTO

optimistic ceiling (leaks future folds)

Leaderboard rank

of 7 · CI 0.1805–0.2050

§ 01

The intuition

In plain English, before any mathematics.

Same Poisson goal core, but each team's attack/defence is a latent quantity with a prior centred on what its Elo (history) and squad value imply. A short-history newcomer with a near-flat likelihood is pulled toward that covariate-informed prior with calibrated uncertainty (James-Stein shrinkage) instead of an overconfident guess. It also emits the decoupling g as a posterior quantity with a credible interval.

§ 02

Mathematical specification

The same Poisson goal kernel, with attack and defence promoted to latent quantities under a covariate-informed hierarchical prior. Every prior is stated; the amount of pooling is learned, not hand-tuned.

att_{i} \sim N (a_{0} + κ_{a} \tilde{h}_{i} + β_{a}^{⊤} \tilde{s}_{i} + δ_{a}^{⊤} \tilde{c}_{i}, σ_{att}), Y^{home}, Y^{away} \sim Poisson (λ)

The hierarchical prior on latent strength, over the Poisson match likelihood

01The match likelihood

The emission is the Dixon–Coles kernel — two log-linear Poisson rates with the low-score correction — fitted at match resolution on the 49,445-match international record. What changes is where the strengths come from.

Y^{home} \sim Poisson (λ), Y^{away} \sim Poisson (λ^{'}), lo g λ = μ + h_{adv} 1_{host} + att_{i} - def_{j}

The Poisson match likelihood under the latent strengths

02The hierarchical prior

History is the prior mean; squad covariates shift strength off that mean; context must earn its way in. A team’s attack is drawn from a Normal centred on what its Elo and squad imply, with a learned pooling scale — the structural reason newcomers are handled by construction.

att_{i} \sim N (a_{0} + κ_{a} \tilde{h}_{i} + β_{a}^{⊤} \tilde{s}_{i} + δ_{a}^{⊤} \tilde{c}_{i}, σ_{att}^{2}), def_{i} \sim N (d_{0} + κ_{d} \tilde{h}_{i} + β_{d}^{⊤} \tilde{s}_{i} + δ_{d}^{⊤} \tilde{c}_{i}, σ_{def}^{2})

Latent attack and defence under the covariate-informed prior

03The stated priors

Every hyper-prior is declared. The history loading κ is a ridge — history is the prior mean and expected to matter. The context loading δ is a tight ridge: off unless the data insist. The pooling scales are half-Normal, and the dependence parameter ρ matches the Dixon–Coles clip.

μ \sim N (lo g 1.35, 0.25), h_{adv} \sim N (0.25, 0.1), a_{0}, d_{0} \sim N (0, 0.5), κ \sim N (0, 0.5) δ \sim N (0, 0.1), σ_{att}, σ_{def} \sim HalfNormal (0.3), ρ \sim N (0, 0.1) clipped to \pm 0.2

The declared hyper-priors of the hierarchy

04The horseshoe on squad covariates

The prior expectation is that only a handful of the squad covariates genuinely matter. The regularised horseshoe encodes exactly that: local scales let a few loadings escape to full size while the rest are pinned near zero, and the slab bounds the survivors. The global scale is set small (τ₀ = 0.05) to match a prior guess of three to five active features.

β_{k} ∣ λ_{k}, τ, c \sim N (0, τ^{2} \tilde{λ}_{k}^{2}), \tilde{λ}_{k}^{2} = \frac{c ^{2} λ _{k}^{2}}{c ^{2} + τ ^{2} λ _{k}^{2}}, λ_{k} \sim C^{+} (0, 1), τ \sim C^{+} (0, 0.05)

The regularised horseshoe prior on the squad loadings β

05Coverage-aware measurement error

Club-form coverage is itself a strength signal — a naive model reads “little data” as “weak team”, which is circular. The fix is a measurement-error layer: observed squad form is the true signal plus noise whose variance grows as coverage falls, so thinly-observed teams shrink automatically toward their value-implied prior.

s_{i}^{form,obs} = s_{i}^{form} + ζ_{i}, ζ_{i} \sim N (0, σ_{form}^{2} / cov_{i}), σ_{form} \sim HalfNormal (0.2)

Squad form enters with coverage-scaled error

06What shrinkage does

In the conjugate Normal approximation the posterior is a precision-weighted compromise between a team’s own record and its prior mean. Data-rich powerhouses keep their maximum-likelihood estimate — the model nests Dixon–Coles — while a short-history qualifier, whose likelihood is nearly flat, is pulled to what its Elo and squad value imply. That is the James–Stein argument: shrinkage strictly dominates the noisy per-team estimate in total risk.

E [att_{i} ∣ data] \approx ω_{i} att_{i}^{ML} + (1 - ω_{i}) m_{i}, ω_{i} = \frac{n _{i}}{n _{i} + σ _{ε}^{2} / σ _{att}^{2}}

Partial pooling — the posterior as a precision-weighted average

07The decoupling residual

History and squad value correlate at 0.82, so a structural “history X%, squad Y%” split is not identified. The decoupling g is therefore defined as a projection residual — squad strength minus what history predicts — which stays well-conditioned even when the individual loadings are not, and ships with a posterior standard deviation.

g_{i} = \tilde{s}_{i} - E [\tilde{s}_{i} ∣ \tilde{h}_{i}]

g — squad quality above or below what history implies

Symbol key

$att_{i}, def_{i}$: team i’s latent attack and defence — the quantities with a prior
$a_{0}, d_{0}$: the global attack and defence intercepts
$κ \tilde{h}_{i}$: history loading × the standardized history index (Elo + national-team form)
$β^{⊤} \tilde{s}_{i}$: squad covariates (value, club form, age structure) under the horseshoe prior
$δ^{⊤} \tilde{c}_{i}$: context covariates (travel, heat, fitness) under a tight ridge
$σ_{att}, σ_{def}$: the pooling scales — how far teams may stray from the prior (learned)
$λ_{k}, τ$: horseshoe local and global shrinkage scales — a few covariates escape, the rest pin to zero
$c^{2}$: the horseshoe slab variance — bounds the covariates that do escape
$ζ_{i}$: measurement error on observed club form, inflated where data coverage is thin
$ω_{i}$: the pooling weight — how much team i’s posterior trusts its own matches vs the prior
$g_{i}$: the decoupling residual — squad strength above or below what history predicts

§ 03

What data it uses

The inputs this model reads — and only these.

49,445 international results (the match likelihood)
History index h (Elo + national-team form, one combined component)
Squad covariates s (value, club form, age) under a horseshoe prior
Coverage-scaled measurement error on squad form

§ 04

How it works

A schematic of the model wired end to end.

Fig. M·Bayesian Conceptual schematic

Bayesian hierarchical — wired end to end

Source · Oxford Football Forecasting model · structural diagram, not a data plot

§ 05

Out-of-sample skill

Where this model lands between the Elo floor and the market ceiling, on both backtest protocols.

Fig. V11 Lower is better · floor = Elo-only · ceiling = de-vigged market

OOS RPS — expanding (headline) and LOTO (optimistic)

On the headline expanding window this model scores 0.1927 — −0.0010 below the Elo floor (0.1938) and +0.0022 versus the market ceiling (0.1905).

Expanding 0.1927

LOTO 0.1905

Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1805–0.2050). Lower (left) is better.

It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments

§ 06

Partial pooling, and the decoupling g

Why a 48-team field with debutants needs shrinkage — and what the model says about history vs squad.

A World Cup with 48 teams includes nations the international record has barely seen. A per-team maximum-likelihood estimate of their attack and defence is unbiased but wildly noisy — a near-flat likelihood over a handful of matches. The hierarchical prior fixes this the way James–Stein shrinkage does: it pulls each team’s strength toward what its Elo and squad value imply (the prior mean), by an amount the data learns. On data-rich powerhouses the posterior all but equals the MLE, so the model nests Dixon-Coles and cannot do meaningfully worse; on newcomers the shrunk estimate strictly dominates the noisy one in total risk.

The same machinery emits the project’s scientific target, the decoupling g: how far a team’s squad quality sits above or below what its history predicts. Because history and squad value are badly collinear (the records correlate strongly), g is reported as a well-conditioned projection residual with a credible interval — not a fragile “history X%, squad Y%” split, which the data cannot identify.

+0.031

Decoupling slope b — squad-above-record g on stage reached

tournament-clustered SE 0.173 · n = 118 team-tournaments

includes 0

95% CI on the slope

[−0.31, +0.55] — not significant at n = 3

0.23

g vs pre-baked gap (sanity corr)

the model-based g aligns with the engineered residual

Reading g

The slope is positive — teams whose squad value runs ahead of their history do tend to over-perform their Elo-implied stage — but the credible interval [−0.31, +0.55] includes zero. With only 5 tournaments of out-of-sample history, the direction is suggestive and the effect is not statistically resolved. It is surfaced as a measured posterior quantity with its uncertainty, never as a confident structural decomposition.

§ 07

Strengths & limits

What this model is good for — and where it is weak.

Strengths

Handles 48-team newcomers by construction (partial pooling)
Quantifies the decoupling g with a credible interval
Pooling strength is learned, not hand-tuned

Limits

History h and squad value s are weakly identified (corr 0.82)
Posterior corr(kappa,beta) reported, not a structural % split
Heavier to fit (NUTS / MAP fallback)