WC 2026 · Forecasting Oxford Football Forecasting

Features Features · the de-biasing win

LightGBM-Poisson — flexible features, globally fair

A gradient-boosted Poisson goal model that learns non-linear effects of 36 feature differences, with monotone constraints encoding football priors. This is where the de-biasing lands: a genuinely global squad signal moved it from worst to best-non-market.

0.1921

OOS RPS · expanding

the headline skill (realism protocol)

0.1913

OOS RPS · LOTO

optimistic ceiling (leaks future folds)

#3

Leaderboard rank

of 7 · CI 0.1569–0.2230

A gradient-boosted Poisson goal model that learns flexible, non-linear effects of the 36 engineered feature differences (x_i - x_j), with monotone constraints encoding football priors (more squad value should not lower expected goals). This is the model where the de-biasing win lands: switching the squad signal from Understat top-5 to the global API-Football panel moved it from worst to best-non-market.

A boosted-tree log-goal-rate on the feature differences

01The model

One booster, applied twice per fixture: reading the features from i’s perspective gives i’s goal rate, reading them from j’s gives the reply. Features enter as team differences; venue context — the heat-by-climate-gap interaction and the host flag — enters at the reference-team level, because heat hurts the unacclimatised side in absolute terms.

A Siamese boosted-tree log goal rate, applied in both directions

02The objective

Trees are grown on the Poisson likelihood itself — the natural loss for goal counts — never on win/draw/loss labels, which would discard the count information the simulator needs. Each boosting round fits the gradient of the deviance with its curvature as weights.

The Poisson deviance with its gradient and curvature

03Monotone constraints

The cheapest, highest-leverage regulariser at this sample size: directions football already knows are enforced, not learned. Raising the Elo, national-team form, squad-value, club-form or fitness advantage can only raise expected goals; a hotter venue for the less-acclimatised side can only lower them. The two decoupling gaps — form-versus-value and Elo-versus-value — are deliberately left unconstrained: their shape is the research question, and constraining them would assume the answer.

Sign constraints on the tree splits, feature by feature

04The grid, with a fixed dependence

The booster learns strength only. The low-score dependence is one global nuisance, fixed at ρ₀ = −0.10 and applied to the assembled grid exactly as in Dixon–Coles — a clean separation of strength from dependence that a small sample cannot be trusted to learn jointly. Missing inputs route natively: a squad with no club-form data falls back to Elo and value at each split.

The boosted rates feed the same corrected scoreline grid

Symbol key

the model input: engineered feature differences x_i − x_j plus venue context
the boosted-tree ensemble — the log goal rate
a single tree, the learning rate, and the number of boosting rounds
the two goal rates — one booster applied in both directions
the Poisson gradient and curvature each boosting round fits
the feature sets with increasing / decreasing monotone constraints
the fixed global low-score dependence (−0.10) — applied to the grid, not learned
  • 36 engineered team features (global squad panel), entered as differences
  • Monotone constraints on the Elo / value / form differences
  • Native missing-data routing + coverage flags

Fig. M·LightGBM Conceptual schematic

LightGBM-Poisson (global) — wired end to end

xᵢ − xⱼ36 feature diffs+ native NaN routing boosted trees · objective = poisson monotone: ↑value, ↑Elo ⇒ ↑λ (never ↓) log E[goals]→ grid
Source · Oxford Football Forecasting model · structural diagram, not a data plot

Fig. V11 Lower is better · floor = Elo-only · ceiling = de-vigged market

OOS RPS — expanding (headline) and LOTO (optimistic)

On the headline expanding window this model scores 0.1921 — −0.0017 below the Elo floor (0.1938) and +0.0015 versus the market ceiling (0.1905).

Expanding 0.1921
LOTO 0.1913

Bar fills to the model’s RPS on the floor–ceiling axis; the whisker on the expanding bar is the conservative 95% CI (0.1569–0.2230). Lower (left) is better.

It clears the Elo floor; the gap to the market is small and — at n = 3 — inside the bootstrap interval.

Source · Oxford Football Forecasting model · Bookmaker consensus (de-vigged closing odds) · 152 matches · 3 tournaments

The squad-quality features were first built from top-5-European club data, which is sparse for non-European squads. That is a coverage confound: a naive model reads “low coverage” as “weak team,” which is circular and suppresses the squad signal exactly where it should add the most. Switching to a genuinely global club panel lifted average squad-form coverage from 47.8% to 85.0%, and moved this LightGBM model from worst to best-non-market on out-of-sample RPS.

Fig. V10 Weakest-first · Understat top-5 → global API-Football

Squad-form coverage, before → after, by confederation

The uplift is largest exactly where the bias bit hardest: OFC 15%→92%, AFC 17%→71%, while UEFA — already well covered — barely moves (73%→92%).

OFCn=1 15%92%+77
AFCn=9 17%71%+54
CONCACAFn=6 30%81%+50
CAFn=10 45%88%+44
CONMEBOLn=6 54%85%+30
UEFAn=16 73%92%+19
before (top-5 only) after (global panel) last column = percentage-point uplift

The fix is biggest off-UEFA and on low-coverage squads — directionally exactly as predicted. The GBM gains land there too, though they are not significant at n = 3.

Source · Understat · API-Football (global club coverage) · computed coverage fractions per squad
Was it coverage, or league-weighting? (the placebo)

The gain could have come from either seeing more players or re-weighting leagues. A placebo test that re-weighted leagues without adding coverage did not reproduce the lift — so coverage, not league-weighting, drove the result. Native missing-data routing means the trees still fall back to Elo and value when a squad’s club form is genuinely unobserved (it is missing for roughly 56% of player-rows overall).

Strengths

  • Captures non-linear feature interactions
  • Native missing-data routing (squad form missing ~56%)
  • The de-biasing A/B is cleanest here (placebo-controlled)

Limits

  • Gains off-UEFA / low-coverage are not significant at n=3
  • Less interpretable than the Bayesian prior structure
  • Needs careful monotone constraints to stay football-sane