Advanced crime forecasting system using Apache Spark ML Pipelines with comprehensive geospatial and fairness analysis
This project implements a production-ready machine learning system for beat-level violent crime prediction in Chicago, processing over 8 million crime events (2001-present). The system combines advanced time-series forecasting with fairness-aware algorithms and multi-scale geospatial analysis.
- Beat-level crime prediction using Gradient Boosted Trees (GBT) regressor
- Advanced time-series feature engineering (lag features, rolling averages, seasonal indicators)
- Multi-scale geospatial analysis (Block → Beat → District hierarchy)
- Fairness-aware ML with bias detection and mitigation strategies
- Comprehensive temporal & distribution analysis
- Production-ready Spark ML Pipelines
Chicago Crime Data (8M+ events)
↓
[Apache Spark]
↓
┌────────────────────────────┐
│ Data Processing Pipeline │
└────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 1. Temporal Analysis │
│ 2. Geospatial Analysis │
│ 3. Arrest Pattern Analysis │
│ 4. Crime Prediction Model │
│ 5. Bias/Fairness Analysis │
└─────────────────────────────────────────┘
↓
[Insights & Predictions]
chicago-crime-prediction/
│
├── temporal_analysis.py # Comprehensive temporal & distribution analysis
├── geospatial_analysis.py # Multi-scale spatial analysis (Block/Beat/District)
├── arrest_pattern_analysis.py # Arrest patterns by time/location
├── crime_prediction_model.py # Beat-level violent crime forecasting
├── bias_fairness_analysis.py # Fairness metrics & bias detection
│
├── README.md # This file
├── requirements.txt # Python dependencies
└── outputs/ # Generated visualizations & results
- Apache Spark 3.x
- Python 3.8+
- HDFS access (or modify paths for local filesystem)
# Clone the repository
git clone https://github.com/yourusername/chicago-crime-prediction.git
cd chicago-crime-prediction
# Install dependencies
pip install pyspark pandas matplotlib seaborn numpy# 1. Temporal and distribution analysis
spark-submit temporal_analysis.py
# 2. Geospatial analysis (block/beat/district)
spark-submit geospatial_analysis.py
# 3. Arrest pattern analysis
spark-submit arrest_pattern_analysis.py
# 4. Crime prediction model
spark-submit crime_prediction_model.py
# 5. Bias and fairness analysis
spark-submit bias_fairness_analysis.pyComprehensive exploratory data analysis of crime patterns over time.
Features:
- Monthly, hourly, daily, and yearly crime trends
- Distribution by district and crime type
- Violent vs non-violent crime patterns
- Statistical summaries and peak detection
Outputs:
- Distribution visualizations (district, crime type)
- Temporal pattern charts (month/hour/day/year)
- Violent vs non-violent cross-analysis
- Statistical summary report
Multi-scale spatial analysis across Chicago's geographic hierarchy.
Features:
- Block-level analysis: Top crime hotspots at street level
- Beat correlation: Spatial relationships between adjacent police beats
- District comparison: Crime patterns across police districts by mayoral administration
Outputs:
- Top 10 crime blocks (2019-present)
- Adjacent beat crime correlations
- Daley vs Emanuel administration crime comparison (t-statistic)
Temporal patterns of crime arrests across different time dimensions.
Features:
- Arrest rates by month, hour of day, and day of week
- Visualization of arrest patterns
- Seasonal arrest trends
Outputs:
- Monthly arrest pattern chart
- Hourly arrest pattern chart
- Day-of-week arrest pattern chart
Production-ready ML pipeline for next-week violent crime forecasting at beat level.
Features:
- Time-series features:
- Lag features (1-3 weeks)
- Rolling averages (3-4 week windows)
- Seasonal indicators (summer, winter, holidays)
- Crime trend (rate of change)
- Model: Gradient Boosted Trees (GBT) Regressor
- Spatial granularity: Police beat level
- Prediction horizon: 1 week ahead
Model Architecture:
Input Features (21 total):
├── Categorical: Beat, Year, Week (one-hot encoded)
├── Aggregated: TotalCrimes, ArrestRate, DomesticRate, percent_violent_crimes
├── Lag Features: total_last_week, total_2weeks_ago, total_3weeks_ago,
│ violent_last_week, violent_2weeks_ago
├── Rolling Avg: total_crimes_rolling_avg_3wk, violent_crimes_rolling_avg_3wk,
│ total_crimes_rolling_avg_4wk
├── Seasonal: is_summer, is_winter, is_holiday_season
└── Trend: crime_trend
↓
[GBT Regressor]
↓
violent_crimes_next_week (prediction)
Performance Metrics:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- R² Score
Comprehensive fairness analysis ensuring equitable predictions across demographics.
Features:
- Protected attribute identification (District as demographic proxy)
- Bias detection in arrest rates and crime patterns
- Fairness metrics computation
- Prediction error analysis across geographic groups
Fairness Metrics:
- Disparate Impact Ratio
- Mean Absolute Error by group
- Prediction error distributions
Mitigation Strategies:
- Sample re-weighting
- Fairness-constrained optimization
- Post-processing calibration
- Regular fairness audits
Outputs:
- Fairness metrics by district group
- Prediction error visualizations
- Bias analysis report
- Data Loading: 8M+ crime records from HDFS
- Preprocessing: Date parsing, feature engineering, aggregation
- Feature Engineering: 21 features including temporal, spatial, and statistical
- Model Training: 80/20 train-test split with time-based ordering
- Evaluation: Multiple metrics (RMSE, MAE, R²) with fairness analysis
- Apache Spark: Distributed data processing
- PySpark ML: Machine learning pipelines
- Python: Data analysis and visualization
- Matplotlib/Seaborn: Visualization
- HDFS: Distributed file storage
Source: Chicago Police Department - Crimes (2001-present)
Size: 8M+ records
Key Fields:
- ID, Date, Block, Beat, District, Ward
- Primary Type, Description, Arrest, Domestic
- Coordinates (X, Y, Latitude, Longitude)
Geographic Hierarchy:
Block (Street-level)
↓ aggregates to
Beat (Police patrol zones - ~280 beats)
↓ aggregates to
District (Police districts - 25 districts)
1. Temporal Features
- Extract: Year, Month, Week, Hour, Day of Week
- Lag: 1-3 week historical crime counts
- Rolling: 3-4 week moving averages
- Seasonal: Binary indicators for summer, winter, holidays
- Trend: Week-over-week crime rate changes
2. Aggregation Features
- Total crimes per beat-week
- Violent crime counts and percentages
- Arrest rates
- Domestic incident rates
3. Categorical Encoding
- One-hot encoding for Beat, Year, Week
- Handles high cardinality (280+ beats)
Protected Attributes: District (proxy for socioeconomic/demographic status)
Analysis Approach:
- Identify potential bias sources (policing patterns, reporting rates)
- Measure prediction disparities across groups
- Quantify fairness metrics
- Recommend mitigation strategies
- Chicago Police Department for providing open crime data
- Northwestern University for academic support
- Apache Spark community for excellent documentation