The six models are not competitors to be selected down to one winner
on three tournament folds — that is the multiple-testing trap. They are an
ensemble basis with controlled disagreement: each is tuned on the match-level
backtest (tens of thousands of internationals, large n), then combined and validated
once on the tournaments. Dixon-Coles and the Bayesian model disagree precisely
on draw-heavy knockout football; the LightGBM model adds non-linear feature
interactions; pooling diverse-but-comparable kernels is what sharpens the forecast.
Every RPS on this page is the expanding-window out-of-sample number
(train on tournaments before the test fold, never after) over 152 matches in 3
tournaments — the realism protocol. The leave-one-tournament-out (LOTO) number, which
leaks future tournaments into past training, is reported on each model page as an
optimistic ceiling, never as the headline. The Elo floor is the bar every model must
clear before its layer is added.