Skip to content

Rohith-Kumar-S/AI-DataHunter

Repository files navigation

AI vs Human Text Classification Pipeline

CS 6220 - Data Mining Project

This project implements a complete end-to-end workflow for detecting whether a given piece of text was written by a human or generated by an AI system. The pipeline focuses on scalable preprocessing, meaningful feature engineering, robust model training, and comprehensive evaluation, providing a reproducible framework for large-scale text classification.


Overview

The goal of this project is to distinguish AI-generated text from human-written text using a combination of traditional machine-learning algorithms and deep learning models.
A unified dataset was prepared by combining multiple sources containing labeled human and AI text. After preprocessing, vectorization, and structured model training, we evaluated multiple models to understand which approaches generalize best.

This project demonstrates:

  • Clean and modular preprocessing for large text datasets
  • Feature engineering based on linguistic and structural patterns
  • Handling imbalance through class-weight computation
  • Training and evaluation of multiple ML and transformer-based models
  • Analysis of failure modes and pattern-based insights

Pipeline Workflow

The notebook implements the following core stages:

1. Data Loading & Preprocessing

  • Lowercasing, whitespace normalization
  • Expansion of contractions (e.g., can't → can not)
  • Duplicate removal
  • Label validation
  • Lemmatization for standardized token representation

2. Structural Feature Engineering

The pipeline extracts additional numeric features that help distinguish writing styles:

  • Character, word, and estimated sentence counts
  • Punctuation distributions
  • Average word length
  • Lexical diversity

3. Vectorization & Scaling

The cleaned text is transformed using:

  • TF-IDF (word-level)
  • Bag-of-Words representation

Numeric text features are standardized to ensure compatibility with ML models.
Class weights are computed to balance the learning signal across human/AI labels.

4. Dataset Splitting

A stratified split ensures equal representation of both classes across train, validation, and test sets.
This supports fair comparison and prevents hidden bias.

5. Model Training & Evaluation

Models trained include:

  • Logistic Regression
  • Random Forest
  • Transformer-based BERT fine-tuning

Evaluation focuses on:

  • Accuracy
  • Precision / Recall for both classes
  • F1-score

Additionally, hyperparameter tuning had been applied to all the models to achieve the best perfromance.


Key Results

Traditional Machine Learning Models

  • Logistic Regression improved from ~84% to ~89% accuracy after tuning.
  • Random Forest improved from ~85% to 91% accuracy after systematic hyperparameter exploration.

BERT-Based Deep Learning Model

  • Achieved 93% accuracy
  • AI-recall: 93% (very strong at catching AI text)
  • Human-recall: 92%
  • F1-score: 0.93

This shows that both classical and deep learning models can effectively detect writing patterns, with transformers providing the strongest generalization.


This project provides a scalable pipeline and analytical foundation that can be extended toward:

  • Multimodal detection
  • Adversarial-resistant classifiers
  • Domain-specific detectors

Usage

To run the project:

  1. Clone the repository
git clone https://github.com/Rohith-Kumar-S/AIDataHunter.git
cd AIDataHunter
  1. Install dependencies using the provided environment file or requirements
pip install -r requirements.txt
  1. Execute cells sequentially to reproduce preprocessing, feature generation, model training, and evaluation

About

End-to-end, scalable pipeline for detecting AI-generated versus human-written text with robust preprocessing, training, and evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors