Laurel Technology Solutions Ltd. is a UK-based small-to-medium enterprise (SMB) that has recently experienced significant growth and a rapid influx of new customers. Like many growing organisations, Laurel currently relies on off-the-shelf IT systems to support individual business functions such as Finance and Human Resources.
While these systems function well in isolation, they store customer information in separate silos, making it difficult to compare, analyse, or exploit data across departments. For example, financial data is stored exclusively in finance systems, while employment information is maintained only within HR systems. As a result, Laurel does not currently possess a single, unified customer record.
This project addresses that challenge by designing and implementing a Python-based ETL (Extract, Transform, Load) solution that consolidates multiple heterogeneous data sources into a central organisational database.
Following a highly successful financial year, Laurel Technology Solutions Ltd. is preparing for international expansion.
Key expansion goals include:
- Establishing a new regional office in Seoul, South Korea
- Supporting customers across East Asian markets
- Maintaining the UK headquarters as the primary operational hub
- Enabling regular data exchange between UK and South Korea offices
- Hiring local employees in South Korea while retaining UK staff
- Supporting executives who will travel frequently between offices
To support this growth, Laurel aims to modernise its core data infrastructure by unifying its customer data streams, enabling deeper analysis, improved decision-making, and future data exploitation.
The primary goal of this project is to build a robust ETL pipeline that:
- Extracts customer data from multiple structured and semi-structured sources
- Transforms and cleans the data, resolving inconsistencies and duplicates
- Unifies all data into a single, coherent customer record
- Loads the unified data into a central MySQL database
This central data store can then be used as a foundation for future analytics, reporting, and expansion-related operations.
The ETL pipeline processes the following data formats:
- CSV – Demographic and vehicle information
- JSON – Financial and billing details
- XML – HR-related data such as salary, pension, and employment attributes
- TXT – Unstructured business rules and data corrections
Each source represents data generated by different organisational systems, reflecting real-world enterprise data fragmentation.
- Programming Language: Python
- Database: MySQL (via USBWebserver)
- ORM: PonyORM
- Design Approach:
- Modular ETL stages (Extract → Transform → Load)
- Defensive programming to handle missing or inconsistent data
- De-duplication to ensure one unified record per customer
- Reusable and maintainable code structure
- Combines multiple heterogeneous data sources into a single schema
- Safely handles missing fields and inconsistent records
- Prevents duplicate customer entries during data unification
- Automatically creates database tables using ORM mapping
- Designed to be rerunnable without corrupting existing data
- Portable and environment-agnostic database setup using USBWebserver
- A centralised MySQL database containing unified customer records
- A clean and consistent dataset suitable for:
- Business intelligence
- Cross-departmental analysis
- International expansion planning
- Exportable CSV output for reporting and assessment submission
In addition to the technical implementation, this project includes a critical report that:
- Reflects on the design and implementation of the ETL solution
- Evaluates the suitability of chosen technologies for international expansion
- Discusses scalability, data governance, and infrastructure considerations
- Provides recommendations for future improvements and growth
Potential extensions to this project include:
- Introducing unique customer IDs for stronger entity resolution
- Splitting the unified schema into multiple relational tables
- Adding validation rules for financial and personal data
- Implementing role-based access control for international teams
- Integrating analytics or visualisation tools for data insights
Simon Ugochukwu Awaogu
MSc Cybersecurity / Computing
University of Sunderland
This project is developed as part of CETM_50 Coursework.