This repository builds a custom dataset of NBA play-by-play (pbp) logs and finetunes T5-Gemma small model to generate concise but engaging quarter summaries of an NBA game. The pipeline covers scraping, preprocessing, and supervised fine-tuning.
NBA_pbp_scraper.ipynb
Scrapes official NBA play-by-play logs and stores actions (shots, fouls, turnovers, rebounds, etc.) in structured form.pbp_preprocessing.ipynb
Cleans and formats scraped data, splits games by quarter, and pairs quarter pbp text with a ground-truth summary in a kaggle datasetfinetuning_nba.ipynb
Finetunesgoogle/t5-gemma-smallusing Hugging Face training utilities with experiment tracking via Weights & Biases
- Source: NBA Quarter Play Summaries (custom-built)
- Granularity: Per quarter
- Link: https://www.kaggle.com/datasets/shrishtiroy6/nba-quarter-play-summaries
- Format example:
- Input:
Time TIMBERWOLVES Score Lead Warriors 12:00 Start of Period (9:16 PM) 12:00 Possession: Timberwolves 30-23 +7 11:44 MISS N.Alexander-Walker 25' 3PT Jump Shot 11:41 Q.Post REBOUND 11:34 30-25 +5 J.ButlerIII Driving Layup 11:34 N.Reid S.FOUL (P1, T1) (S.Wright) 11:34 30-26 +4 J.ButlerIII Free Throw 1 of 1 - Output:
The first quarter was tightly contested, with the Warriors leading thanks to strong three-point shooting from Curry and Hield.
- Input:
- Base model:
google/t5-gemma-small(60M params) - Uses LoRA finetuning and 4-bit quantization to compress model weights
- Typical settings used:
- Per-device batch size: 1 (with gradient accumulation)
- Optimizer:
adamw_torch_8bit - Mixed precision:
bf16 - Logging: Weights & Biases
- GPU: Tesla P100
- Development goal: Overfit small examples to validate training loop, then scale to full dataset and evaluate with ROUGE/BLEU.