Skip to content

ChristianLG2/WorldCup2026-Match-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FIFA World Cup 2026 Match Outcome Predictor

Christian Lira Gonzalez · XGBoost · Python · scikit-learn · SHAP

A machine learning model predicting FIFA World Cup 2026 match outcomes using gradient boosted trees trained on historical tournament data (1930–2022), FIFA world rankings, and engineered features from 28,982 competitive international matches. Predictions published before the tournament opens and tracked live across all 104 matches.

Tech Stack

Python XGBoost scikit-learn pandas matplotlib SHAP SQL SQLite Streamlit Quarto


Business Question

Can historical FIFA data and current team metrics predict match outcomes with meaningful accuracy across 104 World Cup 2026 games?


Model Architecture

Data Collection
├── Fjelstul World Cup Database (1930–2022, 27 tables, SQLite)
├── International Football Results martj42/Kaggle (28,982 competitive matches)
└── FIFA World Rankings cashncarry/Kaggle (April 2026 snapshot)
        ↓
Feature Engineering (SQL + pandas)
├── fifa_ranking_differential     — home rank minus away rank
├── recent_form_last10            — rolling win rate, last 10 competitive matches
├── head_to_head_win_pct          — all prior competitive matchups
├── wc_historical_win_pct         — World Cup win rate (1930–2022)
├── host_advantage                — US / Canada / Mexico flag
├── qualifying_goal_difference    — avg goal diff in qualifying campaign
└── home_advantage                — home vs neutral venue flag
        ↓
Model Training
└── XGBoost classifier
    Train:  1930–2018 World Cup data
    Test:   2022 World Cup (temporal validation — no leakage)
    Target: Win / Draw / Loss (3-class)
    Output: Probability per outcome
    Explainability: SHAP feature importance
        ↓
Predictions + Live Tracking
├── Group stage — published before June 11
├── Knockout stage — updated as bracket forms
├── Accuracy tracker — updated after each matchday
└── Final accuracy report — July 20

Live Results

Stage Matches Accuracy Notes
Group Stage 48 TBD Published May 31
Round of 32 32 TBD Updated June 27
Knockout 24 TBD Updated July 1
Final TBD July 20 report

Predictions made before the tournament. Results logged after each matchday.


Features

# Phase 1: core model (build now)
features = [
    'fifa_ranking_differential',   # strongest signal
    'recent_form_last10',          # rolling win rate, last 10 competitive matches
    'head_to_head_win_pct',        # historical matchup record
    'wc_historical_win_pct',       # World Cup performance 1930–2022
    'host_advantage',              # US / Canada / Mexico boost
    'qualifying_goal_difference',  # attacking strength proxy
    'home_advantage',              # neutral venue flag
]

# Phase 2: squad-level features (after baseline model)
phase_2 = [
    'avg_wc_experience',    # prior World Cup appearances per squad member
    'squad_continuity',     # % players returning from last WC squad
    'star_player_present',  # players with 5+ World Cup goals on roster
    'knockout_win_rate',    # knockout-stage-only historical win rate
    'shootout_win_rate',    # penalty shootout record
]

Project Structure

wc2026-match-predictor/
├── data/
│   ├── raw/                         # gitignored see data/raw/README.md to download
│   │   ├── worldcup.db              # Fjelstul WC database (1930–2022)
│   │   ├── fifa_ranking/            # FIFA rankings CSV (cashncarry/Kaggle)
│   │   └── ifr/                     # International results CSVs (martj42/Kaggle)
│   ├── processed/                   # Feature-engineered datasets (committed)
│   │   ├── competitive_results.csv  # 28,982 filtered competitive matches
│   │   └── wc2026_fixtures.csv      # 72 WC2026 match fixtures
│   └── live/                        # Updated during tournament (committed)
│       ├── wc2026_results.csv       # Match results logged after each game
│       └── accuracy_tracker.csv     # Running model accuracy by matchday
├── notebooks/
│   ├── 01_data_exploration.ipynb    # EDA across all data sources
│   ├── 02_feature_engineering.ipynb # SQL + pandas feature pipeline
│   ├── 03_model_training.ipynb      # XGBoost + temporal validation
│   ├── 04_predictions.ipynb         # Group stage + knockout predictions
│   └── 05_update_predictions.ipynb  # Run after each matchday during tournament
├── models/
│   └── xgb_model.pkl               # Trained model (gitignored)
├── app/
│   ├── app.py                       # Streamlit app pick teams, get prediction
│   └── utils.py                     # Feature engineering helpers
├── outputs/
│   ├── group_stage/                 # Group stage prediction charts
│   ├── knockout/                    # Knockout bracket predictions
│   ├── shap/                        # SHAP feature importance plots
│   └── accuracy/                    # Running accuracy charts
├── requirements.txt
└── README.md

Data Sources

Fjelstul World Cup Database Fjelstul, Joshua C. "The Fjelstul World Cup Database v.1.2.0." July 19, 2023. https://github.com/jfjelstul/worldcup — License: CC-BY-SA 4.0

International Football Results (1872–2026) Kaggle: martj42/international-football-results-from-1872-to-2017 https://kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017

FIFA World Rankings Kaggle: cashncarry/fifaworldranking https://kaggle.com/datasets/cashncarry/fifaworldranking


Setup

git clone https://github.com/ChristianLG2/WorldCup2026-Match-Predictor.git
cd WorldCup2026-Match-Predictor
pip install -r requirements.txt

Download raw data files per data/raw/README.md, then run notebooks in order (01 → 05).

To launch the Streamlit app locally:

streamlit run app/app.py

Design Decisions

Why SQLite over PostgreSQL? The dataset is ~1,200 WC matches and 28,000 competitive results small enough that SQLite handles all joins and aggregations instantly. SQL is used for feature engineering (window functions, CTEs, GROUP BY) and pandas for modeling and visualization.

Why temporal validation? The model trains on 1930–2018 and tests on 2022 to simulate real prediction conditions. Using random train/test split on match data leaks future information, a team's 2022 form would bleed into 2018 training data.

Why not in-match features? Goals, bookings, and substitutions are excluded, they're unknown before kickoff and using them would be data leakage regardless of how they're aggregated.


Author

Christian Lira Gonzalez · Analytics Engineer · Machine Learning · Founder @ Orpheus Analytics

LinkedIn Portfolio Orpheus

About

XGBoost match outcome predictor for FIFA World Cup 2026 Win/Draw/Loss probabilities across all 104 matches using FIFA rankings, recent form, H2H records, and squad features. Predictions tracked live June 11 – July 19.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors