This project aims to detect fraudulent credit card transactions using various machine learning algorithms. The goal is to create a model that can identify suspicious activities and prevent losses for cardholders, merchants, and financial institutions.
Data preprocessing involved transforming the raw data into a suitable format for machine learning:
- Transaction Characteristics: Log transformation for transaction amounts and extraction of timing features (transaction hour, day of week).
- Cumulative Metrics: Added transaction count and 7-day spending to track user activity.
- Fraud Indicators: Included fraud flag for recent fraud detection and used transaction frequency and spending velocity as fraud indicators.
- Demographic Factors: Used age and job type to analyze fraud patterns across demographics.
- Distance and Location: Calculated distances between cardholder's home and merchant locations, and between consecutive transactions, to detect location anomalies.
- Data Cleaning & Balancing: Handled missing values, outliers, and balanced the dataset using SMOTE for fraud class.
Key features were engineered to improve model performance:
- amt: Log-transformed transaction amount to reduce skewness and normalize data.
- city_pop: Population of the transaction city, indicating possible location-based patterns.
- transaction_hour: Extracted from transaction timestamp to capture time-based patterns.
- transaction_count: Total transaction count for each cardholder, helping to detect abnormal spikes in activity.
- age: Cardholder's age, which may correlate with specific spending behaviors and fraud patterns.
- trans_7d_count: Number of transactions in the past 7 days, used to identify sudden activity increases.
- prev_trans_count: Previous transaction count to capture historical transaction behavior.
- spending_velocity: Rate of spending over time, used to identify sudden increases in spending.
- distance: Distance between consecutive transaction locations, detecting irregular travel patterns.
- fraud_7d_flag: A binary flag indicating fraud within the last 7 days, used to capture ongoing fraud patterns.
In this project, multiple machine learning models were evaluated to detect fraudulent transactions. Each model has its strengths in identifying complex patterns and outliers in data:
Decision Tree
- Purpose: A non-linear classification algorithm that splits the data into subsets based on feature values, recursively creating a tree structure.
- Why Used: Decision trees are simple, interpretable models that can handle both numerical and categorical features. They are particularly useful for identifying patterns in high-dimensional datasets.
Random Forest
- Purpose: An ensemble model that builds multiple decision trees and combines their predictions.
- Why Used: Random forest is effective in handling complex datasets with interactions between features, and it is robust to overfitting.
XGBoost
- Purpose: A gradient boosting algorithm that builds an ensemble of decision trees sequentially, correcting errors made by previous trees.
- Why Used: XGBoost is highly efficient and has been proven to perform well in classification tasks, especially in handling imbalanced datasets.
Logistic Regression
- Purpose: A linear model used for binary classification to predict whether a transaction is fraudulent (1) or legitimate (0).
- Why Used: Logistic regression provides a baseline and is interpretable, making it useful for understanding feature importance.
Neural Networks
- Purpose: A deep learning model that consists of multiple layers of neurons to detect complex patterns in large datasets.
- Why Used: Neural networks are ideal for capturing non-linear relationships in large and complex datasets like fraud detection.
- Imbalanced Dataset
- Very few fraudulent transactions in the dataset
- High risk of model overfitting
- One-Hot Encoding
- Encoding categorical features created excessive columns
- Increased dimensionality
The data used in this project is publicly sourced from Kaggle. This is a simulated credit card transaction data set containing legitimate and fraud transactions between January 1, 2019 to December 31, 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. The data consists of two data sets: fraudTrain.csv (what we worked on during our project time frame) and fraudTest.csv (ideally what we would use to further test our models).
- Source: Link to the data set
- Details: Features include transaction merchant, category, amount, and date/time.
To use this project, follow these steps:
Make sure you have the following installed on your machine:
- Python (version 3.6 or later)
- Jupyter Notebook
- pip (Python package installer)
Clone the repository using the following command:
git clone https://github.com/Anas10202/mastercard_fraud_detection.git
cd mastercard_fraud_detectionManually install the necessary packages with the following command:
pip install jupyter pandas numpy scikit-learn matplotlib seabornTo launch the Jupyter Notebook interface, run the following command in your terminal:
jupyter notebookThis will open the Jupyter Notebook interface in your web browser.
Navigate to the directory where you cloned the repository and open the relevant .ipynb file. For example:
cd path/to/your/notebook
jupyter notebook Mastercard1_CreditCardFraudDetection.ipynb- Cell Execution: Execute cells in the notebook sequentially. Run an individual cell by clicking it and typing
Shift + Enter. - Modify Parameters: If desired, parameters and variables may be changed in their corresponding cells. Run the cell or the entire notebook to verify changes in the results.
- View Results: As each notebook cell runs, outputs such as plots, tables and textual results will be displayed below each cell, unless purposely not displayed.
Save work frequently by clicking the save icon or typing Ctrl + S.
When finished working, deactivate the virtual environment with:
deactivateWe welcome contributions from the community to enhance and improve this project!
- Fork the Repository: Click the "Fork" button at the top right of the repository page. This will create a copy of the repository in your GitHub account.
- Clone the Repository: Clone the forked repository to your local machine using the following command:
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name- Create a new branch for a new contribution using the following command:
git checkout -b contribution/your-contribution- Ensure all libraries that are listed in the notebook are installed.
- Make changes, but try to follow the existing style and conventions of the project.
- Commit changes using a descriptive commit message:
git add
git commit -m "Add contribution: your-contribution"- Push to the forked repository:
git push origin contribution/your-contribution- Create a Pull Request: Open a pull request (PR) to the main repository. Include a clear title and description of your changes. Make sure to reference any related issues or PRs, if applicable.
If we receive contributions, we will try our best to respond with feedback or comments. Please make any necessary adjustments and commit them to your branch, if applicable.
Thank you for your contribution! Together, we can learn more together and make this project better!
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- Team Members: Anas Ahmed, Ashley Nguyen, Tarina Priti
- Mastercard Challenge Advisors: Vikas Bishnoi, Dhivya Jayaraman
- Cornell Tech Course Support/TA: Dev Ashar
- Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn, Keras, XGBoost
- Applications: Google Colab, Jira, Slack
- Data Set Source: Kaggle