Modelling rationale

Methods

How the study is built, and why. It all follows from one idea: promotion is a clean case of domain shift, and the clubs we care about are a small group on which a good average can quietly be wrong. For the sources and the data dictionary, see Data.

01Why domain shift

A top division holds about twenty clubs, and only three of them are newly promoted in any season. So a model fitted to the whole league is shaped by the established majority. Its average accuracy comes mostly from those clubs, and there is no reason to assume the chances it hands the three newcomers are just as trustworthy. This is what statisticians call dataset shift: the group we care about is drawn from a slightly different world than the bulk of the training data, yet the same rule scores them all. Promoted clubs are a clean example. They are clearly labelled, they come round every year, and they sit inside a league that is otherwise the same, which makes the boundary a tidy place to ask how far a pooled model really travels.

02A multi-task target

We model first-season performance three ways: survival as a yes or no, final points as a number, and finishing position as a rank. Doing all three serves one purpose, which is to cross-check. An effect that turns up on one measure but not the others is probably a fluke of that measure; an effect that holds across all three is telling us something about the data. The three targets sit on the same club-season row and are judged the same way.

03The model ladder

The ladder is kept short on purpose, and we only add complication where it pays for itself. The survival baseline is a regularised logistic model,

$$ \Pr(\text{survive}=1 \mid x) \;=\; \sigma\!\big(\beta_0 + \beta^{\top} x\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}. $$

A gradient-boosting model (HistGradientBoosting) sits alongside it as a more flexible check, with matching versions for points and for finishing position. We judge how well clubs are told apart by the area under the ROC curve, where 0.5 is a coin toss; calibration by the reliability curve and the Brier score; points by the average error against simply guessing the group mean; and finishing position by rank correlation. The chances themselves are not the point. What matters every time is the comparison between the two groups, not the headline average.

04The common feature set

The main feature set \(x\) is held to things that both groups have: form in the division below, pre-season Elo, the Elo gap to the field, and how the club came up. That limit is a requirement, not a shortcut. A fair test of whether a model built on established clubs carries over to promoted ones can only use features both groups share. The richest features, such as expected goals, squad make-up and learned style, exist for the top flight but not for the second tier a promoted club is leaving. That gap in what we can even measure is itself part of the shift, and we look at it separately, as a robustness layer, on the seasons from 2014 onward. A lean model also makes statistical sense: there are only 439 promoted club-seasons, and a heavier model would simply overfit them.

05Temporal validation

Every figure on this site is worked out strictly out of time. To test the model, we train it on all the seasons before a target year and apply it to that year, then roll forward through the data,

$$ \text{train}:\{\,s:\operatorname{year}(s)\lt Y\,\}\;\longrightarrow\;\text{test}:\{\,s:\operatorname{year}(s)=Y\,\}, \quad Y=2005,\dots,2025. $$

A shuffled split will not do here. Promotion and relegation change who is in the league every year, so a random split could put a club's later seasons in the training set and then ask the model to predict its earlier ones. That hands it information no forecaster could have had at the time.

That rolling scheme is for testing, where each season has to be predicted without the model seeing it. The forecast for 2026/27 is the same idea taken one step into the future: a model fitted on the whole history up to and including the 2025/26 season just gone, latest results and all, then pointed at the clubs promoted for 2026/27. The only season we deliberately hold back is for the back-test, which leaves out 2025/26 so that its known outcomes can serve as a fresh check.

06The generalisation test

The new idea here is the question, not the algorithm. We train one model on the whole of Europe, then test it three ways: on promoted clubs, on established clubs, and by training on four leagues and predicting the fifth. The thing to watch is whether

$$ \text{AUROC}_{\text{promoted}} \;\ll\; \text{AUROC}_{\text{established}}. $$

If it does, then promoted clubs are not just a smaller version of the usual problem. They are a separate, harder one, where a strong average tells you nothing. In the data the gap is 0.57 against 0.80, it shows up in every league, and the richer structure and style features do not close it.

07Mechanism

Permutation importance shows where it comes from. For the league as a whole, survival rests almost entirely on one thing, the Elo gap to the field, which tells strong established clubs from weak ones. Promoted clubs are squeezed into the bottom of that scale: nearly nine in ten sit below the middle of their league, on average 76 rating points down. So the one feature that does the work has almost nothing to push against inside this group, and nothing else steps in. The model does not fall apart; there is simply no usable signal left when every club is, in relative terms, a small one.

08Scope

The claims are kept within their limits. The promoted group is small, so its figures come with wide error bars. The point is that, on public pre-season data, promoted clubs cannot be told apart from a coin toss, not that the data points the wrong way. The richest features only exist for the top flight, which keeps the pre-promotion set lean, and that, once more, is part of the story rather than a flaw in it. Promotion is not handed out at random, so the comparisons between promoted and established clubs describe a pattern rather than prove a cause. Within those limits the conclusion is plain: how a newly promoted club does in its first season is close to a base-rate outcome once you know it has gone up, and more public data does not help.