- Team Members: Ethan Cao, Leah Kim, Eden Mulugeta
Infant mortality remains a critical public health concern, with approximately 20,000 deaths occurring in the United States annually among infants under one year of age. This project applies machine learning techniques to predict infant mortality risk using comprehensive birth and health data from the National Vital Statistics System (NVSS). By identifying high-risk cases early, this work aims to support targeted interventions that can improve neonatal outcomes and reduce preventable infant deaths.
The analysis leverages a diverse set of maternal, prenatal, and neonatal factors including demographic information, prenatal care patterns, maternal health conditions, gestational characteristics, birth outcomes, and congenital anomalies. Multiple predictive models are developed and compared to identify the most effective approach for infant mortality prediction.
This project uses the NVSS Linked Birth-Infant Death dataset (2013), which combines birth certificate data with infant death records for comprehensive mortality analysis. The dataset includes:
- Over 3.9 million birth records
- Maternal demographics and health history
- Prenatal care utilization patterns
- Birth outcomes and neonatal interventions
- Congenital anomaly indicators
- Linked mortality outcomes for infants who died before age one
The data has been preprocessed to remove potential label leakage, handle missing values, and create balanced train/validation/test splits suitable for machine learning.
.
├── README.md # Project overview and documentation
├── Final_Report.pdf # Final project report
├── Data/ # Data files and preprocessing
│ ├── linkco2013us_den.csv # Original NVSS linked birth-death dataset
│ ├── Train_Test_Validaiton_Split.ipynb # Data preprocessing and split generation
│ ├── nvss_train.csv # Training set (60% of data)
│ ├── nvss_val.csv # Validation set (20% of data)
│ ├── nvss_test.csv # Test set (20% of data)
│ └── nvss_aggregated.csv # Complete preprocessed dataset
│
├── NOTEBOOKS/ # Machine learning model development
│ ├── Descriptive_Analysis.ipynb # Exploratory data analysis and visualization
│ ├── SVM.ipynb # Support Vector Machine implementation
│ ├── Penalized_Logistics_Regression.ipynb # Lasso, Ridge, and Elastic Net models
│ ├── Neural_Network_Prediction.ipynb # Neural network models
│ ├── Neural_Network_Prediction_Optimized.ipynb # Optimized neural network architecture
│ ├── Ensemble.ipynb # XGBoost and ensemble methods
│ └── Final_Test_Evaluation.ipynb # Final model evaluation on test set
│
├── output/ # Trained models and evaluation results
│ ├── Infant_mortality_distribution.png # Distribution of infant mortality (0 vs 1)
│ ├── Maternal_Demographics.png # Visualization of maternal demographic variables
│ ├── Birth_Characterstics.png # Birth characteristics visualization
│ ├── Prenatal_Pregnancy_Factors.png # Prenatal and pregnancy-related factors
│ ├── Medical_Risk_Factors.png # Medical risk factor distributions
│ ├── Congential_Anomalies.png # Distribution of congenital anomalies
│ ├── Interventions.png # Medical interventions visualized
│ ├── Correlation_Heatmap_of_Top_Infant_Mortality_Predictors.png # Heatmap of top predictors
│ ├── best_mlp_infant_mortality.pt # Best performing neural network model
│ ├── svm_best_model.joblib # Optimized SVM model
│ ├── svm_gridsearch.joblib # SVM grid search results
│ ├── lasso_model.joblib # Lasso regression model
│ ├── ridge_model.joblib # Ridge regression model
│ ├── elastic_model.joblib # Elastic Net model
│ ├── xgb_best_model.joblib # Best XGBoost model
│ ├── xgb_randomsearch.joblib # XGBoost random search results
│ ├── overall_accuracy.png # Overall model accuracy comparison
│ ├── Positive_case_accuracy.png # Positive case detection accuracy
│ ├── neural_network_loss.png # Neural network training loss curves
│ └── overfitting.jpg # Overfitting analysis visualization
│
└── env/ # Python virtual environment
The models utilize carefully selected features from the NVSS dataset:
Maternal Demographics: Age, education, race, ethnicity, marital status
Obstetric History: Live birth order, previous pregnancies, prior cesarean deliveries, history of preterm birth or poor outcomes
Prenatal Care: Month care began, number of prenatal visits, gestational weight gain
Maternal Health Conditions: Diabetes, chronic hypertension, pregnancy-associated hypertension, eclampsia
Maternal Risk Behaviors: Tobacco use during pregnancy, cigarettes per day
Infant Characteristics: Sex, gestational age, birth weight, 5-minute Apgar score, plurality (singleton/twin/etc.)
Neonatal Interventions: Assisted ventilation, NICU admission, extended ventilation
Congenital Anomalies: Neural tube defects, congenital heart disease, chromosomal disorders, and other major birth defects
Python Version: 3.10 or higher
Required Packages:
- pandas
- numpy
- scikit-learn
- torch (PyTorch)
- xgboost
- matplotlib
- seaborn
- jupyter
- joblib
- Data Preprocessing: Start with
Data/Train_Test_Validaiton_Split.ipynbto understand the data preparation process - Exploratory Analysis: Review
NOTEBOOKS/Descriptive_Analysis.ipynbfor data insights and visualizations - Model Training: Explore individual model notebooks (SVM, Neural Networks, Penalized Regression, Ensemble)
- Model Comparison: Check
output/directory for performance metrics and visualizations - Final Evaluation: See
NOTEBOOKS/Final_Test_Evaluation.ipynbfor comprehensive test set results
Support Vector Machine (SVM): Grid search optimization with various kernels and hyperparameters
Penalized Logistic Regression: Lasso (L1), Ridge (L2), and Elastic Net regularization
Neural Networks: Multi-layer perceptron with optimized architecture and training procedures
Ensemble Methods: XGBoost with random search hyperparameter tuning
Each model is trained on the same preprocessed dataset and evaluated using consistent metrics including accuracy, precision, recall, F1-score, and ROC-AUC. Trained models are saved in the output/ directory for reproducibility.
- Develop accurate predictive models for infant mortality risk
- Identify the most important risk factors contributing to infant mortality
- Provide interpretable results that can inform clinical decision-making
- Compare multiple machine learning approaches to determine optimal methodology
This project was developed as part of DS4021 focused on applying machine learning to real-world healthcare challenges.