Online Retail Analysis & Customer Segmentation

A comprehensive data analysis project examining e-commerce transactions to uncover revenue trends, customer segments, and actionable business insights using machine learning and statistical methods.

Overview

This project analyzes the Online Retail II dataset from the UCI Machine Learning Repository, containing over 1 million transactions from a UK-based online retailer (2009-2011). Through RFM analysis and K-means clustering, we segment customers into actionable personas and identify strategic opportunities for business growth.

Key Deliverables:

Customer segmentation using RFM analysis (4 distinct clusters)
Revenue trend analysis and seasonality patterns
Geographic market distribution insights
Product performance analytics
90-day revenue forecasting using ARIMA

Key Findings

Business Metrics

Total Revenue: £17,743,429
Customer Base: 5,878 unique customers
Transaction Volume: 1,067,371 records
Geographic Reach: 38 countries

Customer Segmentation Results

Segment	Count	% of Base	Avg Spend	Total Revenue	Revenue %
Ultra VIP	4	0.1%	£437K	£1.7M	9.9%
High-Value	35	0.6%	£83K	£2.9M	16.4%
Active Regular	3,841	65.3%	£3K	£11.6M	65.4%
At-Risk	1,998	34.0%	£765	£1.5M	8.5%

Critical Insight: 39 customers (0.7% of base) generate £4.6M (26.3% of total revenue)

Market Concentration

UK: 83.0% of revenue (£14.7M) - extreme geographic concentration
Australia: 1.0% (£170K) - largest non-UK market
International markets: Only 17% combined, indicating expansion opportunity

Product Analysis

Top 10 products: 8.5% of revenue (healthy diversification)
Leading product: REGENCY CAKESTAND 3 TIER (£286K)
Category dominance: Home décor and gift items

Visualizations

Monthly Revenue Trend

Customer Segmentation

Top Products

Clustering Validation

Technology Stack

Python 3.10+
pandas, numpy - Data manipulation
scikit-learn - Machine learning (K-Means clustering)
matplotlib, seaborn - Visualizations
plotly - Interactive maps
pmdarima - Time series forecasting
Jupyter Notebook - Interactive analysis

Getting Started

Prerequisites

Python 3.10 or higher
pip package manager
Kaggle account (for dataset access)

Installation

Clone the repository

git clone https://github.com/itsaryanchauhan/online-retail-analysis.git
cd online-retail-analysis

Install dependencies
```
pip install -r requirements.txt
```

Configure Kaggle credentials

mkdir ~/.kaggle
mv kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Reproducibility

This analysis is designed for reproducible results:

K-Means clustering uses random_state=42 for consistent cluster assignments
Data processing pipeline is deterministic (no random sampling)
Same dataset and preprocessing steps will yield identical customer segments
ARIMA forecasting may show minor variations due to optimization algorithms

To ensure identical results:

import numpy as np
np.random.seed(42)  # Set before running analysis

Usage

Run complete analysis pipeline:

python3 run_analysis.py

Or from the scripts directory:

cd scripts
python3 main.py

Interactive Jupyter Notebook:

jupyter notebook retail_analysis.ipynb

Use individual modules:

import sys
sys.path.append('scripts')

from data_loader import load_retail_data
from rfm_analysis import calculate_rfm, segment_customers

df = load_retail_data()
rfm = calculate_rfm(df)
rfm = segment_customers(rfm, n_clusters=4)

Project Structure

online-retail-analysis/
├── retail_analysis.ipynb   # Complete analysis notebook
├── scripts/
│   ├── data_loader.py       # Data loading utilities
│   ├── data_cleaning.py     # Preprocessing & feature engineering
│   ├── rfm_analysis.py      # Customer segmentation
│   ├── visualizations.py    # Chart generation
│   └── main.py              # Pipeline orchestrator
├── plots/                  # Generated visualizations
├── requirements.txt        # Dependencies
└── README.md              # Documentation

Business Recommendations

High Priority (0-3 months)

Protect high-value customers: 39 customers generate 26.3% of revenue
Launch Q4 marketing campaigns (September-November peak season)
Deploy win-back campaigns for 1,998 dormant customers

Medium Term (3-6 months)

Reduce UK dependency from 83% through international expansion
Implement tiered loyalty program for active customers
Expand product portfolio beyond home décor

Long Term (6-12 months)

Build predictive churn models
Develop subscription-based revenue streams
Systematic market entry into Australia, Germany, Netherlands

Projected Impact: £2.5M-£3.5M additional revenue (14-20% growth)

Resources

Jupyter Notebook (Google Colab): Open in Colab
Dataset: Online Retail II UCI
Kaggle Notebook: View on Kaggle
GitHub Repository: Source Code

Results & Outputs

Running the analysis generates:

Visualizations (plots/)

monthly_revenue.png
top_products.png
rfm_clusters.png
elbow_silhouette.png

Author

Aryan Chauhan
GitHub | LinkedIn

Data Analyst specializing in customer analytics and business intelligence.

Contributing

Contributions are welcome. Please fork the repository and submit a pull request with your improvements.

Potential enhancements:

Cohort analysis and retention metrics
Advanced forecasting models (Prophet, LSTM)
Interactive dashboard (Streamlit/Dash)
A/B testing framework

License

This project is available under the MIT License.

Acknowledgments

UCI Machine Learning Repository for the dataset
Kaggle community for data science resources
scikit-learn, pandas, and visualization library contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Retail Analysis & Customer Segmentation

Overview

Key Findings

Business Metrics

Customer Segmentation Results

Market Concentration

Product Analysis

Visualizations

Monthly Revenue Trend

Customer Segmentation

Top Products

Clustering Validation

Technology Stack

Getting Started

Prerequisites

Installation

Reproducibility

Usage

Project Structure

Business Recommendations

High Priority (0-3 months)

Medium Term (3-6 months)

Long Term (6-12 months)

Resources

Results & Outputs

Author

Contributing

License

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
plots		plots
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
retail_analysis.ipynb		retail_analysis.ipynb
run_analysis.py		run_analysis.py

Folders and files

Latest commit

History

Repository files navigation

Online Retail Analysis & Customer Segmentation

Overview

Key Findings

Business Metrics

Customer Segmentation Results

Market Concentration

Product Analysis

Visualizations

Monthly Revenue Trend

Customer Segmentation

Top Products

Clustering Validation

Technology Stack

Getting Started

Prerequisites

Installation

Reproducibility

Usage

Project Structure

Business Recommendations

High Priority (0-3 months)

Medium Term (3-6 months)

Long Term (6-12 months)

Resources

Results & Outputs

Author

Contributing

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages