The dataset
A long run of results, with richer modern data layered on top, covering five leagues and both divisions from 1993/94 to 2025/26. Every source is public, and each is used only over the years it genuinely covers.
| Source | Provides | Window |
|---|---|---|
| Football-Data.co.uk | results, shots, bookmaker odds (5 leagues, both tiers) | 1993/94– |
| ClubElo | preseason strength ratings (20,585 snapshots) | 1994–2026 |
| Understat | expected goals (xG), top flights | 2014/15– |
| FBref | advanced team/player stats, squad age, minutes | 2010– |
| Transfermarkt | squad age, managers, transfer fees | 2014– |
| SoFIFA | FIFA ratings, transfer budget, club worth | 2014–26 |
| Wyscout | raw event data for playing style (3M events) | 2017/18 |
A row of data is one club's season in the top flight. The long run of results, Elo and odds covers the whole history. The richer layers, such as expected goals, squad make-up and transfers, only start in 2014 and only cover the top divisions.
The study uses every season from 1993/94 to 2025/26, the full 33 years, not a single one. All 3,204 club-seasons are pooled together. 2025/26 turns up a lot only because it is the most recent finished season. It brings the data up to the present and gives the model a fresh test, since we now know how the clubs promoted for 2025/26 got on. The 2026/27 forecast is trained on the whole history through 2025/26, with that latest finished season included, and then applied to the clubs promoted for 2026/27, using their 2025/26 form in the division below. The only place 2025/26 is held back is the back-test, which trains on earlier seasons so that 2025/26's known outcomes can serve as a clean check. In short: every year feeds the history and the models, 2025/26 is both the newest training season and the back-test target, and 2026/27 is the one the forecast looks ahead to.
The features come in four layers. The first two make up the lean, balanced set that every model uses. The other two are a robustness layer, available from 2014 onward.
Each row is one club's season in the top flight, and every feature describes what the club brought into that season. For a promoted club, the form we use is its last season in the division below; for an established club, it is its previous top-flight season. Pre-season Elo is the rating as it stood around 1 August. Nothing from the season being predicted goes into the features. The strongest link between any single feature and the outcome is about 0.36, which is reassuringly modest and tells us the model is not quietly reading the answer.
We work out whether a club is promoted from its record in the tier below, rather than trusting a separate flag. That alone fixed 55 wrongly labelled early-Spanish seasons, and it leaves 439 promoted and 2,632 established club-seasons.
| Column | Meaning |
|---|---|
| promoted / incumbent | entered from the second / first tier (prior-tier record) |
| prior_ppg, prior_gd_pg | prior-division points and goal difference per game |
| prior_shot_diff_pg | prior-division shot difference per game |
| elo_preseason | ClubElo rating on ~1 August (pre-season) |
| elo_gap_to_median | club Elo − median Elo of that season's top flight |
| PromotionRoute | Automatic · Play-off/Other · Incumbent |
| target_survived | 1 = remained in the top flight · 0 = relegated |
| target_points / target_band | final points · finishing band (1 top … 5 bottom) |
| transfer_spend / squad_age | summer fees in € · mean squad age (2014+) |
| File | Contents | Get |
|---|---|---|
| panel_primary.csv | 3,204 club-seasons · primary features + targets | CSV ↓ |
| panel_enriched.csv | + squad age, continuity, spend, prior xG | CSV ↓ |
| club_season_targets.csv | survival / points / finishing-band labels | CSV ↓ |
| style_features.csv | per-club style vectors and components | CSV ↓ |
| forecast_2627.json | 2026-27 forecast, bands, back-test | JSON ↓ |
| panel.parquet | the panel as Parquet | Parquet ↓ |
Derived from public sources; licences and attribution respected.