Skip to content

ibtesaamaslam/NLP-Sentiment-Analysis-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

image

πŸ’¬ Sentiment Analysis Using NLP & Machine Learning

IMDb Reviews Β· TF-IDF Β· Logistic Regression Β· SVM Β· Random Forest Β· Naive Bayes

A complete end-to-end NLP pipeline that classifies movie reviews as Positive or Negative β€” covering raw text cleaning, TF-IDF feature engineering, multi-model benchmarking, and comprehensive evaluation metrics including ROC-AUC.


GitHub Stars Β  GitHub Forks Β  GitHub Issues


πŸ“‹ Table of Contents


πŸ” Overview

This project presents a machine learning pipeline for binary sentiment analysis on textual data, trained and evaluated on the IMDb Movie Review Dataset β€” 50,000 movie reviews each labelled as either Positive or Negative.

The pipeline covers every stage of an NLP workflow: raw text ingestion, exploratory data analysis, text cleaning with NLTK, TF-IDF and CountVectorizer feature engineering, training and comparing four classical ML classifiers, and a thorough evaluation using accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices.

The best-performing model β€” Logistic Regression with TF-IDF β€” achieves ~89–91% accuracy with lightweight inference speed suitable for real-time applications.

πŸ’‘ Why classical ML over deep learning? Transformer models like BERT achieve higher accuracy on this task but require significantly more compute and memory. Classical ML with TF-IDF is fast, interpretable, deployable on CPU, and still achieves near state-of-the-art results on binary sentiment tasks β€” making it ideal for production environments with resource constraints.


🌍 Real-World Applications

Domain Application
🎬 Entertainment Classify user movie, show, or book reviews automatically
πŸ›’ E-commerce Analyse product review sentiment at scale
πŸ“± Social Media Monitor brand sentiment on Twitter, Reddit, or Instagram
🏦 Finance Detect positive/negative sentiment in earnings call transcripts
πŸ₯ Healthcare Classify patient feedback and satisfaction survey responses
🀝 Customer Support Prioritize negative tickets for immediate escalation
πŸ“° Media Monitoring Track public sentiment toward news stories or political events

🧰 Tech Stack

Technology Version Purpose
Python 3.8+ Core programming language
NLTK 3.x Tokenization, stopword removal, lemmatization
Scikit-learn 1.x TF-IDF, CountVectorizer, ML models, evaluation metrics
Pandas 1.x Data loading, cleaning, and EDA
NumPy 1.x Numerical operations and array processing
Matplotlib 3.x Training visualizations and plots
Seaborn 0.x Confusion matrix heatmaps and styled charts
Jupyter Notebook β€” Interactive development and reporting environment

πŸ“Š Dataset

Name: IMDb Movie Review Dataset
Source: Stanford AI Lab β€” Andrew Maas et al.
Loaded via: tensorflow.keras.datasets.imdb or direct download

Attribute Value
Total reviews 50,000
Positive reviews 25,000 (50%)
Negative reviews 25,000 (50%)
Train split 25,000 reviews
Test split 25,000 reviews
Class balance Perfectly balanced (1:1)
Label encoding Positive β†’ 1, Negative β†’ 0
Language English

Review characteristics:

  • Average review length: ~230 words
  • Range: 10 words to 2,500+ words
  • Contains HTML tags, special characters, and domain-specific vocabulary
  • Highly polarized β€” only reviews scored ≀ 4/10 (negative) or β‰₯ 7/10 (positive) are included

πŸ— System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        RAW INPUT                                β”‚
β”‚         IMDb Dataset β€” 50,000 labelled movie reviews            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  EXPLORATORY DATA ANALYSIS                      β”‚
β”‚  Null checks Β· Class distribution Β· Review length histograms    β”‚
β”‚  Word cloud visualizations (positive vs. negative)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     TEXT CLEANING (NLTK)                        β”‚
β”‚  Lowercase β†’ Strip HTML tags β†’ Remove special chars             β”‚
β”‚  β†’ Remove stopwords β†’ Tokenize β†’ Lemmatize                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FEATURE ENGINEERING                           β”‚
β”‚  TF-IDF Vectorization (primary)                                 β”‚
β”‚  CountVectorizer (comparison baseline)                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   MODEL TRAINING & SELECTION                    β”‚
β”‚  Logistic Regression  Β·  SVM  Β·  Random Forest  Β·  Naive Bayes β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       EVALUATION                                β”‚
β”‚  Accuracy Β· Precision Β· Recall Β· F1-Score                       β”‚
β”‚  ROC-AUC Score Β· Confusion Matrix Heatmap                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ NLP Pipeline

Step 1 β€” Data Loading & EDA

import pandas as pd

df = pd.read_csv('imdb_reviews.csv')
print(df['sentiment'].value_counts())   # Confirm class balance
print(df.isnull().sum())                 # Check for missing values
df['review_length'] = df['review'].apply(lambda x: len(x.split()))

EDA includes:

  • Class distribution bar chart (Positive vs. Negative)
  • Review length histogram β€” reveals bimodal distribution between short and long reviews
  • Word cloud for positive reviews vs. word cloud for negative reviews

Step 2 β€” Label Encoding

df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

Step 3 β€” Text Cleaning with NLTK

A full cleaning function is applied to every review:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download(['stopwords', 'wordnet', 'punkt'])

lemmatizer = WordNetLemmatizer()
stop_words  = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()                               # Lowercase
    text = re.sub(r'<.*?>', '', text)                 # Strip HTML tags
    text = re.sub(r'[^a-z\s]', '', text)              # Remove special chars & digits
    tokens = nltk.word_tokenize(text)                 # Tokenize
    tokens = [t for t in tokens if t not in stop_words]   # Remove stopwords
    tokens = [lemmatizer.lemmatize(t) for t in tokens]    # Lemmatize
    return ' '.join(tokens)

df['clean_review'] = df['review'].apply(clean_text)
Cleaning Step Raw Example After
Lowercase "The Film was GREAT" "the film was great"
Strip HTML "<br/>Great movie" "great movie"
Remove special chars "movie!!! 10/10" "movie"
Remove stopwords "this is a great film" "great film"
Lemmatization "running", "runs", "ran" "run"

Step 4 β€” Train / Test Split

from sklearn.model_selection import train_test_split

X = df['clean_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

βš™οΈ Feature Engineering

TF-IDF Vectorization (Primary)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)

Why TF-IDF over raw word counts?

Method Handles common words Captures importance Sparse representation
CountVectorizer ❌ Over-weights frequent words ❌ βœ…
TF-IDF βœ… Penalizes common words βœ… Rewards unique, discriminative words βœ…

Configuration:

  • max_features=10000 β€” vocabulary capped at top 10,000 terms by TF-IDF score
  • ngram_range=(1, 2) β€” captures both unigrams ("great") and bigrams ("not great") for negation handling
  • fit only on training data β†’ transform on test data to prevent leakage

CountVectorizer (Comparison Baseline)

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=10000)
X_train_cv = cv.fit_transform(X_train)
X_test_cv  = cv.transform(X_test)

Used as a baseline to confirm TF-IDF's superiority on this dataset.


πŸ€– Models & Benchmarks

Four classical ML classifiers are trained and compared on the TF-IDF feature matrix:

Logistic Regression ⭐ Best Model

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')
lr.fit(X_train_tfidf, y_train)

Logistic Regression is highly effective for high-dimensional sparse text features. Its linear decision boundary maps well to TF-IDF space and produces interpretable feature weights.

Support Vector Machine (SVM)

from sklearn.svm import LinearSVC

svm = LinearSVC(C=1.0, max_iter=2000)
svm.fit(X_train_tfidf, y_train)

LinearSVC is chosen over kernel SVM for efficiency on high-dimensional text data. Maximises the margin between positive and negative review representations.

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_tfidf, y_train)

An ensemble of 100 decision trees. Less optimal for high-dimensional sparse data but benchmarked for completeness.

Naive Bayes

from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=1.0)
nb.fit(X_train_tfidf, y_train)

Based on Bayes' theorem with a word-independence assumption. Extremely fast to train, performs surprisingly well on text data.


πŸ“‚ Project Structure

NLP-Sentiment-Analysis-Model/
β”‚
β”œβ”€β”€ sentiment analysis.ipynb     # Main notebook β€” full NLP pipeline
β”‚                                # β”œβ”€ Section 1: Data Loading & EDA
β”‚                                # β”œβ”€ Section 2: Text Cleaning (NLTK)
β”‚                                # β”œβ”€ Section 3: Feature Engineering (TF-IDF)
β”‚                                # β”œβ”€ Section 4: Model Training (4 classifiers)
β”‚                                # └─ Section 5: Evaluation & Visualizations
β”‚
β”œβ”€β”€ README.md                    # Project documentation (this file)
└── dataset/                     # (optional) Local IMDb dataset storage

πŸ“¦ Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip or conda package manager

Option A β€” Local Environment

# 1. Clone the repository
git clone https://github.com/ibtesaamaslam/NLP-Sentiment-Analysis-Model.git
cd NLP-Sentiment-Analysis-Model

# 2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate           # Windows

# 3. Install dependencies
pip install pandas numpy matplotlib seaborn nltk scikit-learn jupyter wordcloud

# 4. Download NLTK resources (run once)
python -c "import nltk; nltk.download(['stopwords','wordnet','punkt'])"

# 5. Launch Jupyter
jupyter notebook

Option B β€” Quick pip install

pip install pandas numpy matplotlib seaborn nltk scikit-learn jupyter wordcloud

β–Ά How to Run

  1. Open sentiment analysis.ipynb in Jupyter Notebook or JupyterLab.
  2. Select Kernel β†’ Restart & Run All.
  3. The notebook will load the IMDb data, clean the text, train all four models, and display all evaluation outputs automatically.

Dataset note: The IMDb dataset can be loaded via keras.datasets.imdb, or downloaded directly from Stanford AI Lab and placed in the dataset/ folder.


πŸ“ˆ Results & Evaluation

Model Comparison

Model Vectorizer Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression ⭐ TF-IDF ~90–91% ~90% ~91% ~90% ~96%
SVM (LinearSVC) TF-IDF ~89–90% ~89% ~90% ~89% ~95%
Naive Bayes TF-IDF ~87–88% ~87% ~88% ~87% ~94%
Random Forest TF-IDF ~84–86% ~85% ~85% ~85% ~92%
Logistic Regression CountVectorizer ~88–89% ~88% ~89% ~88% ~94%

Winner: Logistic Regression + TF-IDF β€” best accuracy, highest ROC-AUC, and fastest inference speed for real-time deployment.

Why Logistic Regression Wins on Text

Text features (TF-IDF vectors) are inherently high-dimensional and sparse. Logistic Regression's linear decision boundary is well-suited to this geometry β€” it learns which words are most predictive of each sentiment class and weights them accordingly. More complex models like Random Forest struggle because they split on individual features in a space where thousands of features each carry weak signals.

Evaluation Metrics Used

Metric Why It Matters
Accuracy Overall correctness across all 10,000 test reviews
Precision Of all reviews predicted positive, how many actually were?
Recall Of all truly positive reviews, how many did the model catch?
F1-Score Harmonic mean of precision and recall β€” balanced metric
ROC-AUC Area under the ROC curve β€” model's ability to rank positive above negative
Confusion Matrix Exact count of true positives, false positives, true negatives, false negatives

πŸ“‰ Visualizations

The notebook produces the following outputs automatically:

Visualization Description
Class distribution bar chart Positive vs. negative review counts β€” confirms balance
Review length histogram Distribution of word counts across training reviews
Word cloud β€” Positive Most frequent terms in positive reviews
Word cloud β€” Negative Most frequent terms in negative reviews
Training accuracy comparison Bar chart comparing all 4 models
ROC curves One curve per model β€” visual AUC comparison
Confusion matrix heatmap Seaborn heatmap for the best model (Logistic Regression)
Top TF-IDF features Bar chart of most predictive positive and negative words

πŸ—Ί Roadmap & Future Improvements

  • Streamlit web app β€” Interactive UI to input any text and receive a live sentiment prediction
  • BERT / RoBERTa fine-tuning β€” Fine-tune a pre-trained transformer for higher accuracy (~93–95%)
  • Multilingual support β€” Add language detection and sentiment analysis in Arabic, French, Urdu, and more
  • Extended data sources β€” Twitter/X tweets, YouTube comments, Amazon product reviews, Google Play store feedback
  • Aspect-level sentiment β€” Go beyond document-level to identify sentiment toward specific aspects (e.g., "great acting, terrible plot")
  • Model explainability β€” LIME or SHAP to explain individual predictions ("this review was classified negative because of: boring, disappointing, waste")
  • REST API endpoint β€” FastAPI wrapper for integration with external applications
  • Cross-validation β€” Replace single train/test split with k-fold cross-validation for more robust estimates
  • Hyperparameter tuning β€” GridSearchCV on TF-IDF parameters and classifier regularization strength

🀝 Contributing

Contributions are welcome! Here's how to get involved:

# 1. Fork the repository on GitHub

# 2. Clone your fork
git clone https://github.com/YOUR-USERNAME/NLP-Sentiment-Analysis-Model.git

# 3. Create a feature branch
git checkout -b feature/add-streamlit-app

# 4. Make your changes and commit
git add .
git commit -m "feat: add Streamlit interactive demo for live sentiment prediction"

# 5. Push and open a Pull Request
git push origin feature/add-streamlit-app

Ideas for contributions: add a new model, improve text cleaning, add multilingual support, write unit tests, or build the Streamlit frontend.


πŸ‘€ Author

Ibtesaam Aslam

GitHub

Machine Learning Engineer & NLP Enthusiast


πŸ“œ License

MIT License

Copyright (c) 2024 Ibtesaam Aslam

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

License Permissions at a Glance

Permission Status
βœ… Commercial use Allowed
βœ… Modification Allowed
βœ… Distribution Allowed
βœ… Private use Allowed
❌ Liability No warranty provided
❌ Trademark use Not granted

πŸ™ Acknowledgements

  • Andrew Maas et al., Stanford AI Lab β€” For creating and releasing the IMDb Large Movie Review Dataset that powers this project.
  • NLTK Team β€” For the comprehensive natural language processing toolkit including WordNetLemmatizer, stopwords corpus, and tokenizers.
  • Scikit-learn β€” For the consistent, well-documented ML API that makes model training, evaluation, and comparison straightforward.
  • The open-source Python data science community β€” For Pandas, NumPy, Matplotlib, Seaborn, and WordCloud.

⭐ If this project was useful to you, please consider starring it on GitHub!

Star on GitHub

Made with ❀️ by Ibtesaam Aslam

About

πŸ’¬ Classifies IMDb movie reviews as Positive or Negative with ~91% accuracy β€” TF-IDF + bigrams, 4 benchmarked ML models, ROC-AUC ~96%. Built with NLTK & Scikit-learn.

Topics

Resources

License

Stars

Watchers

Forks

Contributors