A machine learning model predicting FIFA World Cup 2026 match outcomes using gradient boosted trees trained on historical tournament data (1930–2022), FIFA world rankings, and engineered features from 28,982 competitive international matches. Predictions published before the tournament opens and tracked live across all 104 matches.
Can historical FIFA data and current team metrics predict match outcomes with meaningful accuracy across 104 World Cup 2026 games?
Data Collection
├── Fjelstul World Cup Database (1930–2022, 27 tables, SQLite)
├── International Football Results martj42/Kaggle (28,982 competitive matches)
└── FIFA World Rankings cashncarry/Kaggle (April 2026 snapshot)
↓
Feature Engineering (SQL + pandas)
├── fifa_ranking_differential — home rank minus away rank
├── recent_form_last10 — rolling win rate, last 10 competitive matches
├── head_to_head_win_pct — all prior competitive matchups
├── wc_historical_win_pct — World Cup win rate (1930–2022)
├── host_advantage — US / Canada / Mexico flag
├── qualifying_goal_difference — avg goal diff in qualifying campaign
└── home_advantage — home vs neutral venue flag
↓
Model Training
└── XGBoost classifier
Train: 1930–2018 World Cup data
Test: 2022 World Cup (temporal validation — no leakage)
Target: Win / Draw / Loss (3-class)
Output: Probability per outcome
Explainability: SHAP feature importance
↓
Predictions + Live Tracking
├── Group stage — published before June 11
├── Knockout stage — updated as bracket forms
├── Accuracy tracker — updated after each matchday
└── Final accuracy report — July 20
| Stage | Matches | Accuracy | Notes |
|---|---|---|---|
| Group Stage | 48 | TBD | Published May 31 |
| Round of 32 | 32 | TBD | Updated June 27 |
| Knockout | 24 | TBD | Updated July 1 |
| Final | — | TBD | July 20 report |
Predictions made before the tournament. Results logged after each matchday.
# Phase 1: core model (build now)
features = [
'fifa_ranking_differential', # strongest signal
'recent_form_last10', # rolling win rate, last 10 competitive matches
'head_to_head_win_pct', # historical matchup record
'wc_historical_win_pct', # World Cup performance 1930–2022
'host_advantage', # US / Canada / Mexico boost
'qualifying_goal_difference', # attacking strength proxy
'home_advantage', # neutral venue flag
]
# Phase 2: squad-level features (after baseline model)
phase_2 = [
'avg_wc_experience', # prior World Cup appearances per squad member
'squad_continuity', # % players returning from last WC squad
'star_player_present', # players with 5+ World Cup goals on roster
'knockout_win_rate', # knockout-stage-only historical win rate
'shootout_win_rate', # penalty shootout record
]wc2026-match-predictor/
├── data/
│ ├── raw/ # gitignored see data/raw/README.md to download
│ │ ├── worldcup.db # Fjelstul WC database (1930–2022)
│ │ ├── fifa_ranking/ # FIFA rankings CSV (cashncarry/Kaggle)
│ │ └── ifr/ # International results CSVs (martj42/Kaggle)
│ ├── processed/ # Feature-engineered datasets (committed)
│ │ ├── competitive_results.csv # 28,982 filtered competitive matches
│ │ └── wc2026_fixtures.csv # 72 WC2026 match fixtures
│ └── live/ # Updated during tournament (committed)
│ ├── wc2026_results.csv # Match results logged after each game
│ └── accuracy_tracker.csv # Running model accuracy by matchday
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA across all data sources
│ ├── 02_feature_engineering.ipynb # SQL + pandas feature pipeline
│ ├── 03_model_training.ipynb # XGBoost + temporal validation
│ ├── 04_predictions.ipynb # Group stage + knockout predictions
│ └── 05_update_predictions.ipynb # Run after each matchday during tournament
├── models/
│ └── xgb_model.pkl # Trained model (gitignored)
├── app/
│ ├── app.py # Streamlit app pick teams, get prediction
│ └── utils.py # Feature engineering helpers
├── outputs/
│ ├── group_stage/ # Group stage prediction charts
│ ├── knockout/ # Knockout bracket predictions
│ ├── shap/ # SHAP feature importance plots
│ └── accuracy/ # Running accuracy charts
├── requirements.txt
└── README.md
Fjelstul World Cup Database Fjelstul, Joshua C. "The Fjelstul World Cup Database v.1.2.0." July 19, 2023. https://github.com/jfjelstul/worldcup — License: CC-BY-SA 4.0
International Football Results (1872–2026) Kaggle: martj42/international-football-results-from-1872-to-2017 https://kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017
FIFA World Rankings Kaggle: cashncarry/fifaworldranking https://kaggle.com/datasets/cashncarry/fifaworldranking
git clone https://github.com/ChristianLG2/WorldCup2026-Match-Predictor.git
cd WorldCup2026-Match-Predictor
pip install -r requirements.txtDownload raw data files per data/raw/README.md, then run notebooks in order (01 → 05).
To launch the Streamlit app locally:
streamlit run app/app.pyWhy SQLite over PostgreSQL? The dataset is ~1,200 WC matches and 28,000 competitive results small enough that SQLite handles all joins and aggregations instantly. SQL is used for feature engineering (window functions, CTEs, GROUP BY) and pandas for modeling and visualization.
Why temporal validation? The model trains on 1930–2018 and tests on 2022 to simulate real prediction conditions. Using random train/test split on match data leaks future information, a team's 2022 form would bleed into 2018 training data.
Why not in-match features? Goals, bookings, and substitutions are excluded, they're unknown before kickoff and using them would be data leakage regardless of how they're aggregated.
Christian Lira Gonzalez · Analytics Engineer · Machine Learning · Founder @ Orpheus Analytics