This project is an interactive Shiny web application for dataset upload, preprocessing, feature engineering, and exploratory data analysis.
The application follows a 4-step data analysis pipeline:
- Dataset Upload & Preview
- Data Cleaning
- Feature Engineering
- Exploratory Data Analysis (EDA)
- Choose:
- Upload file, or
- Use sample dataset (iris / mtcars)
- If uploading:
- Click Upload a dataset
- Supported formats: .csv .xlsx / .xls .rds .json
- Dataset structure:
- Number of rows and columns
- Column names and types
- Missing values per column
- Data preview:
- First 10 rows of the dataset
-
Remove Duplicate Rows
-
- Removes exact duplicate observations
-
-
Handle Missing Values
-
- Choose one method from the drop down menu
Missing value method:
- Remove rows with missing values
- Mean imputation (numeric only)
- Choose one method from the drop down menu
-
-
Handle outliers
-
- Choose one method from the drop down menu
Outlier method:
- Cap outliers using IQR
- Remove rows with outliers
- Choose one method from the drop down menu
-
-
Scale Numeric Variables (Optional)
-
- Choose one method from the drop down menu
Scaling method:
- Standardization (z-score)
- Min-Max scaling
- Choose one method from the drop down menu
-
- Select cleaning options in the sidebar
- Click Apply Data Cleaning
- Go to Data Cleaning tab panel
- Cleaned dataset preview
- Cleaning log (step-by-step actions)
user can choose to download cleaned dataset and cleaning log as csv files by clicking Download Cleaned Dataset and Download Cleaning Log at the side bar
- cleaned_dataset.csv
- cleaning_log.csv
user can choose which feature engineering type to use
- Single-variable Transformations
-
- Choose transformation type through drop down menu
Choose transformation:
- Log: log(x + 1)
- Square: x²
- Square root: √x
- Choose transformation type through drop down menu
-
- Two-variable Interaction Features
-
- Choose operation type through drop down menu
Choose operation:
- Multiply: x * y
- Divide: x / y (safe division)
- Add: x + y
- Subtract: x - y
- Choose operation type through drop down menu
-
- Binning
-
- Select the number of bins in
Number of bins:
- Select the number of bins in
-
- Select feature engineering type
- Choose variables
- Click Apply Feature Engineering
- Check results in Feature Engineering tab panel
- Only numeric variables are selectable
- Prevents invalid operations (e.g., division by zero)
- Automatically generates unique feature names
- Updates dataset instantly
- Updated dataset with new features
- Feature engineering log
user can choose to download engineered dataset and feature log as csv files by clicking Download Engineered Dataset and Download Feature Log at the side bar
- engineered_dataset.csv
- feature_log.csv
At the Exploratory Data Analysis tab panel, under EDA Controls, user can choose dataset by selecting either:
Current engineered datasetCleaned datasetOriginal raw dataset
Under EDA Controls, user can choose the plot type they want to generate throught the drop down menu Choose plot type:
- Histogram
-
- Select the numeric variable through drop down menu
Select numeric variable:
- Select the numeric variable through drop down menu
-
- Boxplot
-
- Select the numeric variable through drop down menu
Select numeric variable:
- Select the numeric variable through drop down menu
-
- Select the categorical variable through drop down menu
Select categorical variable:
- Select the categorical variable through drop down menu
-
- Bar Chart
-
- Select the categorical variable through drop down menu
Select categorical variable:
- Select the categorical variable through drop down menu
-
- Scatter Plot
-
- Select the numeric variable through drop down menu
Select numeric variable:
- Select the numeric variable through drop down menu
-
- Select the X variable through drop down menu
Select X variable:
- Select the X variable through drop down menu
-
- Select the Y variable through drop down menu
Select Y variable:
- Select the Y variable through drop down menu
-
- Select dataset
- Choose plot type
- Choose primary variable
- (Optional) Choose secondary variable
- View plot and summary on the right side of the tab panel
- Dataset Summary
- EDA Plot
- Missing Values by Column
- Numeric Summary
- Correlation Analysis
- Correlation Matrix
- Categorical Summary
- Carrie Yan Yin Feng – Data upload and preview
- Shuzhi Yang – Data cleaning
- Haoyun Tong – Feature engineering
- Yolanda He – EDA