Customer segmentation app built with Streamlit. Takes the classic Mall Customers dataset (200 rows), runs K-Means and DBSCAN clustering, and lets you explore the results interactively.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
streamlit run app.pyOpens at http://localhost:8501. The bundled Mall_Customers.csv loads automatically — or upload your own CSV through the sidebar.
customer_seg/
├── app.py # Entry point — page config, sidebar, tab routing
├── config.py # Constants: colours, Plotly theme, personas, defaults
├── data_loader.py # CSV ingestion and column normalisation
├── preprocessing.py # StandardScaler, LabelEncoder, alignment checks
├── clustering.py # K-Means, DBSCAN, elbow, PCA, summaries
├── components.py # Reusable UI pieces (metric cards, banners, etc.)
├── styles.css # All custom CSS lives here
├── tabs/
│ ├── eda.py # Distributions, scatter plots, correlation, box plots
│ ├── preprocessing_tab.py# Pipeline overview, null check, descriptive stats
│ ├── models.py # Elbow/silhouette, PCA visualisation, radar chart
│ ├── predict.py # Predict cluster for a new customer
│ └── metrics.py # Silhouette, DBI, inertia, model comparison
├── Mall_Customers.csv # 200-row dataset (Kaggle)
├── report.md # Maths, metrics breakdown, finetuning notes
└── README.md
Mall Customer Segmentation from Kaggle — 200 customers with age, gender, annual income (k$), and a mall-assigned spending score (1–100).
- EDA — KPI cards, histograms, gender split, correlation heatmap, income-vs-spending scatter, age group analysis, box plots by gender.
- Preprocessing — Shows the pipeline step by step: load → null check → label encoding → feature selection → StandardScaler.
- Models — Elbow method + silhouette sweep for picking K. PCA-projected cluster plots for both K-Means and DBSCAN. Cluster profile table and radar chart.
- Predict — Slide in age/income/score for a hypothetical customer and see which segment they land in under each model.
- Metrics — Silhouette, Davies-Bouldin, inertia side by side for both models. Cluster size distribution.
- The 5 income×spending clusters are cleanest when you use just Annual Income + Spending Score as features. Adding Age blurs boundaries and drops silhouette.
- DBSCAN with the default ε=0.5 on standardised data is tight — bump to 0.6–0.8 and lower min_samples to 3–4 to reduce noise points.
- Check
report.mdfor the full maths and a "was this successful?" checklist.
Streamlit, pandas, numpy, scikit-learn, plotly, matplotlib, seaborn, joblib. All pinned in requirements.txt.