Stroke is a critical health condition and a leading cause of death and long-term disability worldwide. This project focuses on predicting the risk of stroke using machine learning models based on medical and demographic features.
Two models were implemented and evaluated:
- Decision Tree Classifier
- Random Forest Classifier (🏆 Best Performing)
🔹 Dataset (CSV File): Download healthcare-dataset-stroke-data.csv
🔹 Project Code (.ipynb): View Jupyter Notebook
🔹 Project Report (DOCX File): Download Report
🔹 Presentation (PPTX File): Download PPT
- Total Records: 5,110
- Target Variable:
stroke(1 = Stroke occurred, 0 = No stroke) - Features:
- Demographic: Age, Gender, Marital Status, Residence Type
- Clinical: Hypertension, Heart Disease, Glucose Level, BMI
- Lifestyle: Smoking Status, Work Type
📝 Class Imbalance Notice: Only ~5% of the data points represent stroke cases, making class balancing necessary.
✔️ Cleaning & Transformation:
- Dropped non-informative
idcolumn - Handled missing values in
bmiusing median imputation - Encoded categorical variables using
LabelEncoder
✔️ Outlier Removal:
- Used IQR method to eliminate outliers in
bmiandavg_glucose_level
✔️ Feature Scaling:
- Applied
StandardScalerto numerical columns
✔️ Train-Test Split:
- 80% Training, 20% Testing using Stratified Sampling
✔️ Imbalance Handling:
- Used SMOTE (Synthetic Minority Over-sampling Technique) on training set
- Visualizations: Histograms, KDE plots, bar charts, and heatmaps
- Findings:
- Higher age and glucose levels correlate positively with stroke risk
- Hypertension and heart disease increase likelihood of stroke
- Gender and residence type have minimal impact
✔️ Transparent and interpretable
❌ Slightly lower accuracy on imbalanced data
✅ High accuracy and better generalization
✅ Handles imbalance effectively with SMOTE
✅ Offers feature importance insights
| Model | Accuracy | Precision | F1 Score |
|---|---|---|---|
| Decision Tree | 89.5% | 68.2% | 74.4% |
| Random Forest | 93.7% | 75.9% | 81.1% |
📌 Random Forest outperformed Decision Tree in all metrics.
The top predictors identified by the Random Forest model:
- Age 🥇
- Average Glucose Level
- BMI
- Hypertension
- Heart Disease
- Smoking Status
🔹 Incorporate more clinical features (blood pressure, cholesterol, medication history)
🔹 Apply deep learning methods (e.g., Neural Networks, LSTM)
🔹 Develop a real-time decision support system for healthcare providers
This project showcases the complete data science workflow for stroke risk prediction, including:
✔️ Data Preprocessing & Cleaning
✔️ Exploratory Data Analysis and Feature Engineering
✔️ Model Development and Evaluation (Decision Tree, Random Forest)
✔️ Report, Visualization, and Documentation