Data

The dataset

A long run of results, with richer modern data layered on top, covering five leagues and both divisions from 1993/94 to 2025/26. Every source is public, and each is used only over the years it genuinely covers.

01Sources

Source	Provides	Window
Football-Data.co.uk	results, shots, bookmaker odds (5 leagues, both tiers)	1993/94–
ClubElo	preseason strength ratings (20,585 snapshots)	1994–2026
Understat	expected goals (xG), top flights	2014/15–
FBref	advanced team/player stats, squad age, minutes	2010–
Transfermarkt	squad age, managers, transfer fees	2014–
SoFIFA	FIFA ratings, transfer budget, club worth	2014–26
Wyscout	raw event data for playing style (3M events)	2017/18

02Coverage

A row of data is one club's season in the top flight. The long run of results, Elo and odds covers the whole history. The richer layers, such as expected goals, squad make-up and transfers, only start in 2014 and only cover the top divisions.

Which seasons the study uses

The study uses every season from 1993/94 to 2025/26, the full 33 years, not a single one. All 3,204 club-seasons are pooled together. 2025/26 turns up a lot only because it is the most recent finished season. It brings the data up to the present and gives the model a fresh test, since we now know how the clubs promoted for 2025/26 got on. The 2026/27 forecast is trained on the whole history through 2025/26, with that latest finished season included, and then applied to the clubs promoted for 2026/27, using their 2025/26 form in the division below. The only place 2025/26 is held back is the back-test, which trains on earlier seasons so that 2025/26's known outcomes can serve as a clean check. In short: every year feeds the history and the models, 2025/26 is both the newest training season and the back-test target, and 2026/27 is the one the forecast looks ahead to.

Leagues

England, Spain, Germany, Italy, France

Club-seasons

3,204

top flight, both eras

Promotions

439

the subgroup of interest

Matches

122k

results + odds backbone

03Feature layers

The features come in four layers. The first two make up the lean, balanced set that every model uses. The other two are a robustness layer, available from 2014 onward.

L1How dominant they were below. Points, goal and shot difference, and finishing position in the division below. Football-Data

L2Strength and setting. Pre-season Elo, the gap to the rest of the top flight, and bookmaker odds. ClubElo

L3The shape of the club. Squad age, how much of the squad stayed on, how long the manager has been there, and transfer spend. Transfermarkt, SoFIFA, FBref

L4Playing style. A picture learned straight from on-the-ball event data. Wyscout, Understat

04The panel

Each row is one club's season in the top flight, and every feature describes what the club brought into that season. For a promoted club, the form we use is its last season in the division below; for an established club, it is its previous top-flight season. Pre-season Elo is the rating as it stood around 1 August. Nothing from the season being predicted goes into the features. The strongest link between any single feature and the outcome is about 0.36, which is reassuringly modest and tells us the model is not quietly reading the answer.

We work out whether a club is promoted from its record in the tier below, rather than trusting a separate flag. That alone fixed 55 wrongly labelled early-Spanish seasons, and it leaves 439 promoted and 2,632 established club-seasons.

05Data dictionary

Column	Meaning
promoted / incumbent	entered from the second / first tier (prior-tier record)
prior_ppg, prior_gd_pg	prior-division points and goal difference per game
prior_shot_diff_pg	prior-division shot difference per game
elo_preseason	ClubElo rating on ~1 August (pre-season)
elo_gap_to_median	club Elo − median Elo of that season's top flight
PromotionRoute	Automatic · Play-off/Other · Incumbent
target_survived	1 = remained in the top flight · 0 = relegated
target_points / target_band	final points · finishing band (1 top … 5 bottom)
transfer_spend / squad_age	summer fees in € · mean squad age (2014+)

06Downloads

File	Contents	Get
panel_primary.csv	3,204 club-seasons · primary features + targets	CSV ↓
panel_enriched.csv	+ squad age, continuity, spend, prior xG	CSV ↓
club_season_targets.csv	survival / points / finishing-band labels	CSV ↓
style_features.csv	per-club style vectors and components	CSV ↓
forecast_2627.json	2026-27 forecast, bands, back-test	JSON ↓
panel.parquet	the panel as Parquet	Parquet ↓

Derived from public sources; licences and attribution respected.