Features Features · the de-biasing win

LightGBM-Poisson — flexible features, globally fair

A gradient-boosted Poisson goal model that learns non-linear effects of 36 feature differences, with monotone constraints encoding football priors. This is where the de-biasing lands: a genuinely global squad signal moved it from worst to best-non-market.

0.1921

OOS RPS · expanding

the headline skill (realism protocol)

0.1913

OOS RPS · LOTO

optimistic ceiling (leaks future folds)

Leaderboard rank

of 7 · CI 0.1569–0.2230

§ 01

The intuition

In plain English, before any mathematics.

A gradient-boosted Poisson goal model that learns flexible, non-linear effects of the 36 engineered feature differences (x_i - x_j), with monotone constraints encoding football priors (more squad value should not lower expected goals). This is the model where the de-biasing win lands: switching the squad signal from Understat top-5 to the global API-Football panel moved it from worst to best-non-market.

§ 02

Mathematical specification

A boosted-tree stand-in for the linear goal rate: the same Poisson target, flexible feature effects, and football priors enforced as monotone constraints.

lo g E [goals_{ij}] = f (x_{i} - x_{j}, context), f = boosted trees, objective = poisson

A boosted-tree log-goal-rate on the feature differences

01The model

One booster, applied twice per fixture: reading the features from i’s perspective gives i’s goal rate, reading them from j’s gives the reply. Features enter as team differences; venue context — the heat-by-climate-gap interaction and the host flag — enters at the reference-team level, because heat hurts the unacclimatised side in absolute terms.

lo g E [Y_{i \to j}] = F (z_{ij}), F = t = 1 \sum T ν f_{t}, λ_{ij} = e^{F (z_{ij})}, μ_{ij} = e^{F (z_{j i})}

A Siamese boosted-tree log goal rate, applied in both directions

02The objective

Trees are grown on the Poisson likelihood itself — the natural loss for goal counts — never on win/draw/loss labels, which would discard the count information the simulator needs. Each boosting round fits the gradient of the deviance with its curvature as weights.

L = m \sum (e^{F (z_{m})} - y_{m} F (z_{m})), g_{m} = e^{F (z_{m})} - y_{m}, h_{m} = e^{F (z_{m})}

The Poisson deviance with its gradient and curvature

03Monotone constraints

The cheapest, highest-leverage regulariser at this sample size: directions football already knows are enforced, not learned. Raising the Elo, national-team form, squad-value, club-form or fitness advantage can only raise expected goals; a hotter venue for the less-acclimatised side can only lower them. The two decoupling gaps — form-versus-value and Elo-versus-value — are deliberately left unconstrained: their shape is the research question, and constraining them would assume the answer.

z_{k} ↑ \Rightarrow F (z) ↑ (k \in M^{+}), z_{k} ↑ \Rightarrow F (z) ↓ (k \in M^{-})

Sign constraints on the tree splits, feature by feature

04The grid, with a fixed dependence

The booster learns strength only. The low-score dependence is one global nuisance, fixed at ρ₀ = −0.10 and applied to the assembled grid exactly as in Dixon–Coles — a clean separation of strength from dependence that a small sample cannot be trusted to learn jointly. Missing inputs route natively: a squad with no club-form data falls back to Elo and value at each split.

P (x, y) = τ_{λ, μ}^{(ρ_{0})} (x, y) \frac{e ^{- λ} λ ^{x}}{x !} \frac{e ^{- μ} μ ^{y}}{y !}, ρ_{0} = - 0.10

The boosted rates feed the same corrected scoreline grid

Symbol key

$z_{ij}$: the model input: engineered feature differences x_i − x_j plus venue context
$F$: the boosted-tree ensemble — the log goal rate
$f_{t}, ν, T$: a single tree, the learning rate, and the number of boosting rounds
$λ_{ij}, μ_{ij}$: the two goal rates — one booster applied in both directions
$g_{m}, h_{m}$: the Poisson gradient and curvature each boosting round fits
$M^{+}, M^{-}$: the feature sets with increasing / decreasing monotone constraints
$ρ_{0}$: the fixed global low-score dependence (−0.10) — applied to the grid, not learned

§ 03

What data it uses

The inputs this model reads — and only these.

36 engineered team features (global squad panel), entered as differences
Monotone constraints on the Elo / value / form differences
Native missing-data routing + coverage flags

§ 04

How it works

A schematic of the model wired end to end.

Fig. M·LightGBM Conceptual schematic

LightGBM-Poisson (global) — wired end to end

Source · Oxford Football Forecasting model · structural diagram, not a data plot

§ 05

Out-of-sample skill

Where this model lands between the Elo floor and the market ceiling, on both backtest protocols.

Fig. V11 Lower is better · floor = Elo-only · ceiling = de-vigged market

OOS RPS — expanding (headline) and LOTO (optimistic)

On the headline expanding window this model scores 0.1921 — −0.0017 below the Elo floor (0.1938) and +0.0015 versus the market ceiling (0.1905).

Expanding 0.1921

LOTO 0.1913

Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1569–0.2230). Lower (left) is better.

It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments

§ 06

The de-biasing — worst to best-non-market

The squad signal was Eurocentric. Making it global is what moved this model.

The squad-quality features were first built from top-5-European club data, which is sparse for non-European squads. That is a coverage confound: a naive model reads “low coverage” as “weak team,” which is circular and suppresses the squad signal exactly where it should add the most. Switching to a genuinely global club panel lifted average squad-form coverage from 47.8% to 85.0%, and moved this LightGBM model from worst to best-non-market on out-of-sample RPS.

Fig. V10 Weakest-first · Understat top-5 → global API-Football

Squad-form coverage, before → after, by confederation

The uplift is largest exactly where the bias bit hardest: OFC 15%→92%, AFC 17%→71%, while UEFA — already well covered — barely moves (73%→92%).

OFCn=1 15%→92%+77

AFCn=9 17%→71%+54

CONCACAFn=6 30%→81%+50

CAFn=10 45%→88%+44

CONMEBOLn=6 54%→85%+30

UEFAn=16 73%→92%+19

before (top-5 only) after (global panel) last column = percentage-point uplift

The fix is biggest off-UEFA and on low-coverage squads — directionally exactly as predicted. The GBM gains land there too, though they are not significant at n = 3.

Source · Understat · API-Football (global club coverage) · computed coverage fractions per squad

Was it coverage, or league-weighting? (the placebo)

The gain could have come from either seeing more players or re-weighting leagues. A placebo test that re-weighted leagues without adding coverage did not reproduce the lift — so coverage, not league-weighting, drove the result. Native missing-data routing means the trees still fall back to Elo and value when a squad’s club form is genuinely unobserved (it is missing for roughly 56% of player-rows overall).

§ 07

Strengths & limits

What this model is good for — and where it is weak.

Strengths

Captures non-linear feature interactions
Native missing-data routing (squad form missing ~56%)
The de-biasing A/B is cleanest here (placebo-controlled)

Limits

Gains off-UEFA / low-coverage are not significant at n=3
Less interpretable than the Bayesian prior structure
Needs careful monotone constraints to stay football-sane