NYC Rain Crash Risk Dashboard 🌧️🚗

An interactive Streamlit dashboard visualizing heterogeneous treatment effects of rain on NYC crash risk using H3 spatial indexing and causal inference.

YouTube Demo:

Run it live: https://causalaccidents.streamlit.app/

🎯 Overview

This project analyzes how rain affects crash probability across different locations in NYC using:

Geospatial Analysis: H3 hexagonal indexing (resolution 8, ~600m cells)
Causal Inference: T-Learner with Gradient Boosting to estimate Conditional Average Treatment Effects (CATE)
Interactive Visualization: Flat Mercator map showing "kill zones" over NYC streets

For detailed information:

Complete pipeline documentation → documents/PIPELINE.md

Full analysis results & business insights → documents/HETEROGENEITY_RESULTS.md

Key Findings

Rain increases crash probability by 0.10pp on average (0.1013%)
Effect varies 13.6x across locations (max 1.38pp in most vulnerable zones)
High-traffic zones show 3x stronger rain sensitivity (18x higher traffic than average)
Top vulnerable cell: 882a100d69fffff (885 Lexington Avenue, Upper East Side)

🚀 Quick Start

Prerequisites

Python 3.12+

Running the App

cd CausalAccidents

# Create & activate virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start the app
streamlit run src/app.py

The app will open at http://localhost:8501

📊 Features

The dashboard has two main views accessible via sidebar:

🗺️ Map View (Default)

Interactive NYC map with all 1,135 H3 hexagons

Clean Visualization: All hexagons display with black borders over a street map
Selective Highlighting: Top N highest-risk zones (adjustable 10-200) filled with translucent gray
Google Maps-like Controls: Flat Mercator projection, scroll to zoom, drag to pan
Hover Tooltips: See CATE, traffic, baseline risk, and crash count for any cell
Top 5 Quick Reference: Geocoded addresses for the 5 most vulnerable zones
- Example: #1 - 885 Lexington Ave (CATE: 1.378%)

📈 Analysis & Charts View

Comprehensive visual analytics with 4 key charts

1. Risk Distribution Histogram

Shows how CATE is distributed across all 1,135 cells
Red dashed line marks Top N cutoff
Reveals highly-skewed distribution (justifies targeted interventions)

2. Traffic vs. CATE Scatter Plot

Validates 3x traffic amplification effect
Top 10 zones highlighted as red stars
Color-coded by risk rank

3. Top 20 Kill Zones Table

Detailed breakdown with geocoded addresses
Ranks, CATE values, traffic, baseline risk
Shows multiplier vs. average (e.g., 13.6x for top cell)

4. Summary Statistics Dashboard

Three-panel overview: Targeting, Geography, Impact
Key metrics and percentiles
Methodology notes

📁 Project Structure

CausalAccidents/
├── data/
│   ├── cate_by_h3_cells.csv            # Main data (1,135 H3 cells)
│   ├── top_20_geocoded.csv             # Top 20 cells with street addresses
│   ├── nyc_crash_data.csv              # Raw crash data
│   ├── nyc_weather_hourly.csv          # Weather data
│   └── ...                             # Other pipeline outputs
│
├── documents/
│   ├── HETEROGENEITY_RESULTS.md        # Analysis results (must read)
│   └── PIPELINE_SUMMARY.md             # Full pipeline documentation (must read)
│
├── notebooks/
│   ├── 01_data_cleaning.ipynb          # Data cleaning and Feature Engineering (Collisions Data)
│   ├── 02_a_h3_construction.ipynb      # H3 initial construction
│   ├── 02_b_h3_full_construction.ipynb # H3 full panel construction (h3_cell, date, hour)
│   ├── 03_weather.ipynb                # Weather integration
│   ├── 04_causal.ipynb                 # Initial causal analysis
│   ├── 04.5_TLC_data_cleaning.ipynb    # Data cleaning and Feature Engineering (TLC Data)
│   ├── 05_causal_validation.ipynb      # Traffic-adjusted analysis
│   └── 06_CATE.ipynb                   # T-Learner CATE estimation
│
├── src/
│   ├── app.py                          # Main Streamlit dashboard (Map + Analysis views)
│   ├── geocode_top_20.py               # Geocoding script for top 20 H3 cells
│   └── test_map_visual.py              # Test map visualization
│  
├── requirements.txt                    # Python dependencies
└── README.md                           # Project documentation

🔧 Technical Details

Data Pipeline

Crash Data Cleaning (01_data_cleaning.ipynb)
- NYPD Motor Vehicle Collisions (2022-01-01 to 2025-10-31)
- Parse timestamps, filter invalid lat/lon and geographic outliers
- Extract time features: day_of_week, is_weekend, month, is_rush_hour
- Output: crashes_cleaned.csv (~92% retention rate)
Weather Data (03_weather.ipynb)
- Open-Meteo Historical Weather API (NYC coordinates)
- Hourly precipitation and visibility data
- Binary rain flag: rain_flag = (precipitation > 0.1mm)
- Output: nyc_weather_hourly.csv (~33k hours, 8% rain prevalence)
Traffic Data (04.5_TLC_data_cleaning.ipynb)
- NYC TLC Trip Record Data (Yellow taxi pickups)
- Polyfill taxi zones to H3 cells with distributed allocation
- DuckDB optimization for 46 months of parquet files
- Output: traffic_h3_2022_2025_polyfill.parquet (~1,500 cells)
Initial H3 Panel (02_a_h3_construction.ipynb)
- Apply H3 resolution 8 indexing to crashes (~600m hexagons)
- Aggregate to (h3_cell, date, hour) with crash counts
- Dense panel: cells with ≥1 historical crash
- Output: h3_panel_res8.csv (~1,500 unique cells)
Full Panel Construction (02_b_h3_full_construction.ipynb)
- Cartesian product: H3_cells × Dates × Hours (avoid selection bias)
- Merge crash counts (fill zeros), weather, and compute Baseline_Risk
- Rolling 30-hour average crash rate per cell (lag-adjusted)
- Output: h3_full_panel_res8.csv (~40M observations)
Causal Inference - Initial ATE (04_causal.ipynb)
- DoWhy framework with backdoor adjustment
- Propensity score weighting for average treatment effect
- Confounders: time features, Baseline_Risk, Traffic_Proxy
- Result: ATE ≈ 0.091pp, robustness checks passed
Causal Validation - Real Traffic (05_causal_validation.ipynb)
- Replace Traffic_Proxy with actual TLC traffic_count
- Re-estimate ATE with real exposure data
- Result: ATE ≈ 0.095pp (validates initial estimate)
Heterogeneous Effects (06_CATE.ipynb)
- T-Learner with GradientBoostingRegressor (100 trees, depth=5)
- Stratified sampling: 1M observations (500k rain / 500k no-rain)
- Features: log_traffic, Baseline_Risk, temporal covariates
- Training time: ~56 seconds, spatial aggregation to 1,135 cells
- Output: cate_by_h3_cells.csv (mean, median, std per cell)

Model Details

Algorithm: T-Learner (two separate models for treatment/control)
Base Learner: GradientBoostingRegressor (100 estimators, max_depth=5)
Features: log_traffic, baseline_risk, day_of_week, is_weekend, month, is_rush_hour
Training Time: ~56 seconds total
Sample Size: 1M stratified sample

📈 Results Summary

From HETEROGENEITY_RESULTS.md:

Metric	Value
Mean CATE	0.001013 (0.10pp)
Std CATE	0.000732
Max CATE	0.013775 (1.38pp)
Total H3 Cells	1,135
High-traffic effect	0.19pp (~3x average)
Low-traffic effect	0.07pp

Recommendations

Target high-CATE zones (top 10%) with 2x surge pricing + safety alerts
Sunday rain events: Pre-position drivers, proactive warnings (1.6x higher risk)
Low-CATE zones: Standard operations (maintain competitiveness)

🚧 Future Enhancements

~~Geocoding integration (H3 → street addresses)~~ DONE
Real-time CATE scoring API
Historical back-testing dashboard
Multi-city support (Chicago, LA, SF)
Driver positioning optimization algorithm
Live weather forecast integration
A/B test deployment framework

📚 References

H3 Spatial Indexing: Uber H3
Causal Inference: DoWhy
Data Sources:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Rain Crash Risk Dashboard 🌧️🚗

YouTube Demo:

Run it live: https://causalaccidents.streamlit.app/

🎯 Overview

Key Findings

🚀 Quick Start

Prerequisites

Running the App

📊 Features

🗺️ Map View (Default)

📈 Analysis & Charts View

1. Risk Distribution Histogram

2. Traffic vs. CATE Scatter Plot

3. Top 20 Kill Zones Table

4. Summary Statistics Dashboard

📁 Project Structure

🔧 Technical Details

Data Pipeline

Model Details

📈 Results Summary

Recommendations

🚧 Future Enhancements

📚 References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
assets		assets
data		data
documents		documents
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Ashish-Reddy-T/CausalAccidents

Folders and files

Latest commit

History

Repository files navigation

NYC Rain Crash Risk Dashboard 🌧️🚗

YouTube Demo:

Run it live: https://causalaccidents.streamlit.app/

🎯 Overview

Key Findings

🚀 Quick Start

Prerequisites

Running the App

📊 Features

🗺️ Map View (Default)

📈 Analysis & Charts View

1. Risk Distribution Histogram

2. Traffic vs. CATE Scatter Plot

3. Top 20 Kill Zones Table

4. Summary Statistics Dashboard

📁 Project Structure

🔧 Technical Details

Data Pipeline

Model Details

📈 Results Summary

Recommendations

🚧 Future Enhancements

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages