FIFA World Cup 2026 Match Outcome Predictor

Christian Lira Gonzalez · XGBoost · Python · scikit-learn · SHAP

A machine learning model predicting FIFA World Cup 2026 match outcomes using gradient boosted trees trained on historical tournament data (1930–2022), FIFA world rankings, and engineered features from 28,982 competitive international matches. Predictions published before the tournament opens and tracked live across all 104 matches.

Tech Stack

Business Question

Can historical FIFA data and current team metrics predict match outcomes with meaningful accuracy across 104 World Cup 2026 games?

Model Architecture

Data Collection
├── Fjelstul World Cup Database (1930–2022, 27 tables, SQLite)
├── International Football Results martj42/Kaggle (28,982 competitive matches)
└── FIFA World Rankings cashncarry/Kaggle (April 2026 snapshot)
        ↓
Feature Engineering (SQL + pandas)
├── fifa_ranking_differential     — home rank minus away rank
├── recent_form_last10            — rolling win rate, last 10 competitive matches
├── head_to_head_win_pct          — all prior competitive matchups
├── wc_historical_win_pct         — World Cup win rate (1930–2022)
├── host_advantage                — US / Canada / Mexico flag
├── qualifying_goal_difference    — avg goal diff in qualifying campaign
└── home_advantage                — home vs neutral venue flag
        ↓
Model Training
└── XGBoost classifier
    Train:  1930–2018 World Cup data
    Test:   2022 World Cup (temporal validation — no leakage)
    Target: Win / Draw / Loss (3-class)
    Output: Probability per outcome
    Explainability: SHAP feature importance
        ↓
Predictions + Live Tracking
├── Group stage — published before June 11
├── Knockout stage — updated as bracket forms
├── Accuracy tracker — updated after each matchday
└── Final accuracy report — July 20

Live Results

Stage	Matches	Accuracy	Notes
Group Stage	48	TBD	Published May 31
Round of 32	32	TBD	Updated June 27
Knockout	24	TBD	Updated July 1
Final	—	TBD	July 20 report

Predictions made before the tournament. Results logged after each matchday.

Features

# Phase 1: core model (build now)
features = [
    'fifa_ranking_differential',   # strongest signal
    'recent_form_last10',          # rolling win rate, last 10 competitive matches
    'head_to_head_win_pct',        # historical matchup record
    'wc_historical_win_pct',       # World Cup performance 1930–2022
    'host_advantage',              # US / Canada / Mexico boost
    'qualifying_goal_difference',  # attacking strength proxy
    'home_advantage',              # neutral venue flag
]

# Phase 2: squad-level features (after baseline model)
phase_2 = [
    'avg_wc_experience',    # prior World Cup appearances per squad member
    'squad_continuity',     # % players returning from last WC squad
    'star_player_present',  # players with 5+ World Cup goals on roster
    'knockout_win_rate',    # knockout-stage-only historical win rate
    'shootout_win_rate',    # penalty shootout record
]

Project Structure

wc2026-match-predictor/
├── data/
│   ├── raw/                         # gitignored see data/raw/README.md to download
│   │   ├── worldcup.db              # Fjelstul WC database (1930–2022)
│   │   ├── fifa_ranking/            # FIFA rankings CSV (cashncarry/Kaggle)
│   │   └── ifr/                     # International results CSVs (martj42/Kaggle)
│   ├── processed/                   # Feature-engineered datasets (committed)
│   │   ├── competitive_results.csv  # 28,982 filtered competitive matches
│   │   └── wc2026_fixtures.csv      # 72 WC2026 match fixtures
│   └── live/                        # Updated during tournament (committed)
│       ├── wc2026_results.csv       # Match results logged after each game
│       └── accuracy_tracker.csv     # Running model accuracy by matchday
├── notebooks/
│   ├── 01_data_exploration.ipynb    # EDA across all data sources
│   ├── 02_feature_engineering.ipynb # SQL + pandas feature pipeline
│   ├── 03_model_training.ipynb      # XGBoost + temporal validation
│   ├── 04_predictions.ipynb         # Group stage + knockout predictions
│   └── 05_update_predictions.ipynb  # Run after each matchday during tournament
├── models/
│   └── xgb_model.pkl               # Trained model (gitignored)
├── app/
│   ├── app.py                       # Streamlit app pick teams, get prediction
│   └── utils.py                     # Feature engineering helpers
├── outputs/
│   ├── group_stage/                 # Group stage prediction charts
│   ├── knockout/                    # Knockout bracket predictions
│   ├── shap/                        # SHAP feature importance plots
│   └── accuracy/                    # Running accuracy charts
├── requirements.txt
└── README.md

Data Sources

Fjelstul World Cup Database Fjelstul, Joshua C. "The Fjelstul World Cup Database v.1.2.0." July 19, 2023. https://github.com/jfjelstul/worldcup — License: CC-BY-SA 4.0

International Football Results (1872–2026) Kaggle: martj42/international-football-results-from-1872-to-2017 https://kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017

FIFA World Rankings Kaggle: cashncarry/fifaworldranking https://kaggle.com/datasets/cashncarry/fifaworldranking

Setup

git clone https://github.com/ChristianLG2/WorldCup2026-Match-Predictor.git
cd WorldCup2026-Match-Predictor
pip install -r requirements.txt

Download raw data files per data/raw/README.md, then run notebooks in order (01 → 05).

To launch the Streamlit app locally:

streamlit run app/app.py

Design Decisions

Why SQLite over PostgreSQL? The dataset is ~1,200 WC matches and 28,000 competitive results small enough that SQLite handles all joins and aggregations instantly. SQL is used for feature engineering (window functions, CTEs, GROUP BY) and pandas for modeling and visualization.

Why temporal validation? The model trains on 1930–2018 and tests on 2022 to simulate real prediction conditions. Using random train/test split on match data leaks future information, a team's 2022 form would bleed into 2018 training data.

Why not in-match features? Goals, bookings, and substitutions are excluded, they're unknown before kickoff and using them would be data leakage regardless of how they're aggregated.

Author

Christian Lira Gonzalez · Analytics Engineer · Machine Learning · Founder @ Orpheus Analytics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FIFA World Cup 2026 Match Outcome Predictor

Christian Lira Gonzalez · XGBoost · Python · scikit-learn · SHAP

Tech Stack

Business Question

Model Architecture

Live Results

Features

Project Structure

Data Sources

Setup

Design Decisions

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
app		app
data		data
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FIFA World Cup 2026 Match Outcome Predictor

Christian Lira Gonzalez · XGBoost · Python · scikit-learn · SHAP

Tech Stack

Business Question

Model Architecture

Live Results

Features

Project Structure

Data Sources

Setup

Design Decisions

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages