AI vs Human Text Classification Pipeline

CS 6220 - Data Mining Project

This project implements a complete end-to-end workflow for detecting whether a given piece of text was written by a human or generated by an AI system. The pipeline focuses on scalable preprocessing, meaningful feature engineering, robust model training, and comprehensive evaluation, providing a reproducible framework for large-scale text classification.

Overview

The goal of this project is to distinguish AI-generated text from human-written text using a combination of traditional machine-learning algorithms and deep learning models.
A unified dataset was prepared by combining multiple sources containing labeled human and AI text. After preprocessing, vectorization, and structured model training, we evaluated multiple models to understand which approaches generalize best.

This project demonstrates:

Clean and modular preprocessing for large text datasets
Feature engineering based on linguistic and structural patterns
Handling imbalance through class-weight computation
Training and evaluation of multiple ML and transformer-based models
Analysis of failure modes and pattern-based insights

Pipeline Workflow

The notebook implements the following core stages:

1. Data Loading & Preprocessing

Lowercasing, whitespace normalization
Expansion of contractions (e.g., can't → can not)
Duplicate removal
Label validation
Lemmatization for standardized token representation

2. Structural Feature Engineering

The pipeline extracts additional numeric features that help distinguish writing styles:

Character, word, and estimated sentence counts
Punctuation distributions
Average word length
Lexical diversity

3. Vectorization & Scaling

The cleaned text is transformed using:

TF-IDF (word-level)
Bag-of-Words representation

Numeric text features are standardized to ensure compatibility with ML models.
Class weights are computed to balance the learning signal across human/AI labels.

4. Dataset Splitting

A stratified split ensures equal representation of both classes across train, validation, and test sets.
This supports fair comparison and prevents hidden bias.

5. Model Training & Evaluation

Models trained include:

Logistic Regression
Random Forest
Transformer-based BERT fine-tuning

Evaluation focuses on:

Accuracy
Precision / Recall for both classes
F1-score

Additionally, hyperparameter tuning had been applied to all the models to achieve the best perfromance.

Key Results

Traditional Machine Learning Models

Logistic Regression improved from ~84% to ~89% accuracy after tuning.
Random Forest improved from ~85% to 91% accuracy after systematic hyperparameter exploration.

BERT-Based Deep Learning Model

Achieved 93% accuracy
AI-recall: 93% (very strong at catching AI text)
Human-recall: 92%
F1-score: 0.93

This shows that both classical and deep learning models can effectively detect writing patterns, with transformers providing the strongest generalization.

This project provides a scalable pipeline and analytical foundation that can be extended toward:

Multimodal detection
Adversarial-resistant classifiers
Domain-specific detectors

Usage

To run the project:

Clone the repository

git clone https://github.com/Rohith-Kumar-S/AIDataHunter.git
cd AIDataHunter

Install dependencies using the provided environment file or requirements

pip install -r requirements.txt

Execute cells sequentially to reproduce preprocessing, feature generation, model training, and evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
utils		utils
DMT_pipeline.ipynb		DMT_pipeline.ipynb
EDA.ipynb		EDA.ipynb
Final_report.pdf		Final_report.pdf
Project_proposal.pdf		Project_proposal.pdf
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI vs Human Text Classification Pipeline

CS 6220 - Data Mining Project

Overview

Pipeline Workflow

1. Data Loading & Preprocessing

2. Structural Feature Engineering

3. Vectorization & Scaling

4. Dataset Splitting

5. Model Training & Evaluation

Key Results

Traditional Machine Learning Models

BERT-Based Deep Learning Model

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI vs Human Text Classification Pipeline

CS 6220 - Data Mining Project

Overview

Pipeline Workflow

1. Data Loading & Preprocessing

2. Structural Feature Engineering

3. Vectorization & Scaling

4. Dataset Splitting

5. Model Training & Evaluation

Key Results

Traditional Machine Learning Models

BERT-Based Deep Learning Model

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages