A complete end-to-end NLP pipeline that classifies movie reviews as Positive or Negative β covering raw text cleaning, TF-IDF feature engineering, multi-model benchmarking, and comprehensive evaluation metrics including ROC-AUC.
- Overview
- Real-World Applications
- Tech Stack
- Dataset
- System Architecture
- NLP Pipeline
- Feature Engineering
- Models & Benchmarks
- Project Structure
- Installation & Setup
- How to Run
- Results & Evaluation
- Visualizations
- Roadmap & Future Improvements
- Contributing
- Author
- License
- Acknowledgements
This project presents a machine learning pipeline for binary sentiment analysis on textual data, trained and evaluated on the IMDb Movie Review Dataset β 50,000 movie reviews each labelled as either Positive or Negative.
The pipeline covers every stage of an NLP workflow: raw text ingestion, exploratory data analysis, text cleaning with NLTK, TF-IDF and CountVectorizer feature engineering, training and comparing four classical ML classifiers, and a thorough evaluation using accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices.
The best-performing model β Logistic Regression with TF-IDF β achieves ~89β91% accuracy with lightweight inference speed suitable for real-time applications.
π‘ Why classical ML over deep learning? Transformer models like BERT achieve higher accuracy on this task but require significantly more compute and memory. Classical ML with TF-IDF is fast, interpretable, deployable on CPU, and still achieves near state-of-the-art results on binary sentiment tasks β making it ideal for production environments with resource constraints.
| Domain | Application |
|---|---|
| π¬ Entertainment | Classify user movie, show, or book reviews automatically |
| π E-commerce | Analyse product review sentiment at scale |
| π± Social Media | Monitor brand sentiment on Twitter, Reddit, or Instagram |
| π¦ Finance | Detect positive/negative sentiment in earnings call transcripts |
| π₯ Healthcare | Classify patient feedback and satisfaction survey responses |
| π€ Customer Support | Prioritize negative tickets for immediate escalation |
| π° Media Monitoring | Track public sentiment toward news stories or political events |
| Technology | Version | Purpose |
|---|---|---|
| Python | 3.8+ | Core programming language |
| NLTK | 3.x | Tokenization, stopword removal, lemmatization |
| Scikit-learn | 1.x | TF-IDF, CountVectorizer, ML models, evaluation metrics |
| Pandas | 1.x | Data loading, cleaning, and EDA |
| NumPy | 1.x | Numerical operations and array processing |
| Matplotlib | 3.x | Training visualizations and plots |
| Seaborn | 0.x | Confusion matrix heatmaps and styled charts |
| Jupyter Notebook | β | Interactive development and reporting environment |
Name: IMDb Movie Review Dataset
Source: Stanford AI Lab β Andrew Maas et al.
Loaded via: tensorflow.keras.datasets.imdb or direct download
| Attribute | Value |
|---|---|
| Total reviews | 50,000 |
| Positive reviews | 25,000 (50%) |
| Negative reviews | 25,000 (50%) |
| Train split | 25,000 reviews |
| Test split | 25,000 reviews |
| Class balance | Perfectly balanced (1:1) |
| Label encoding | Positive β 1, Negative β 0 |
| Language | English |
Review characteristics:
- Average review length: ~230 words
- Range: 10 words to 2,500+ words
- Contains HTML tags, special characters, and domain-specific vocabulary
- Highly polarized β only reviews scored β€ 4/10 (negative) or β₯ 7/10 (positive) are included
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAW INPUT β
β IMDb Dataset β 50,000 labelled movie reviews β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXPLORATORY DATA ANALYSIS β
β Null checks Β· Class distribution Β· Review length histograms β
β Word cloud visualizations (positive vs. negative) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEXT CLEANING (NLTK) β
β Lowercase β Strip HTML tags β Remove special chars β
β β Remove stopwords β Tokenize β Lemmatize β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE ENGINEERING β
β TF-IDF Vectorization (primary) β
β CountVectorizer (comparison baseline) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL TRAINING & SELECTION β
β Logistic Regression Β· SVM Β· Random Forest Β· Naive Bayes β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVALUATION β
β Accuracy Β· Precision Β· Recall Β· F1-Score β
β ROC-AUC Score Β· Confusion Matrix Heatmap β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import pandas as pd
df = pd.read_csv('imdb_reviews.csv')
print(df['sentiment'].value_counts()) # Confirm class balance
print(df.isnull().sum()) # Check for missing values
df['review_length'] = df['review'].apply(lambda x: len(x.split()))EDA includes:
- Class distribution bar chart (Positive vs. Negative)
- Review length histogram β reveals bimodal distribution between short and long reviews
- Word cloud for positive reviews vs. word cloud for negative reviews
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})A full cleaning function is applied to every review:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download(['stopwords', 'wordnet', 'punkt'])
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = text.lower() # Lowercase
text = re.sub(r'<.*?>', '', text) # Strip HTML tags
text = re.sub(r'[^a-z\s]', '', text) # Remove special chars & digits
tokens = nltk.word_tokenize(text) # Tokenize
tokens = [t for t in tokens if t not in stop_words] # Remove stopwords
tokens = [lemmatizer.lemmatize(t) for t in tokens] # Lemmatize
return ' '.join(tokens)
df['clean_review'] = df['review'].apply(clean_text)| Cleaning Step | Raw Example | After |
|---|---|---|
| Lowercase | "The Film was GREAT" |
"the film was great" |
| Strip HTML | "<br/>Great movie" |
"great movie" |
| Remove special chars | "movie!!! 10/10" |
"movie" |
| Remove stopwords | "this is a great film" |
"great film" |
| Lemmatization | "running", "runs", "ran" |
"run" |
from sklearn.model_selection import train_test_split
X = df['clean_review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)Why TF-IDF over raw word counts?
| Method | Handles common words | Captures importance | Sparse representation |
|---|---|---|---|
| CountVectorizer | β Over-weights frequent words | β | β |
| TF-IDF | β Penalizes common words | β Rewards unique, discriminative words | β |
Configuration:
max_features=10000β vocabulary capped at top 10,000 terms by TF-IDF scorengram_range=(1, 2)β captures both unigrams ("great") and bigrams ("not great") for negation handlingfitonly on training data βtransformon test data to prevent leakage
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=10000)
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)Used as a baseline to confirm TF-IDF's superiority on this dataset.
Four classical ML classifiers are trained and compared on the TF-IDF feature matrix:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')
lr.fit(X_train_tfidf, y_train)Logistic Regression is highly effective for high-dimensional sparse text features. Its linear decision boundary maps well to TF-IDF space and produces interpretable feature weights.
from sklearn.svm import LinearSVC
svm = LinearSVC(C=1.0, max_iter=2000)
svm.fit(X_train_tfidf, y_train)LinearSVC is chosen over kernel SVM for efficiency on high-dimensional text data. Maximises the margin between positive and negative review representations.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_tfidf, y_train)An ensemble of 100 decision trees. Less optimal for high-dimensional sparse data but benchmarked for completeness.
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=1.0)
nb.fit(X_train_tfidf, y_train)Based on Bayes' theorem with a word-independence assumption. Extremely fast to train, performs surprisingly well on text data.
NLP-Sentiment-Analysis-Model/
β
βββ sentiment analysis.ipynb # Main notebook β full NLP pipeline
β # ββ Section 1: Data Loading & EDA
β # ββ Section 2: Text Cleaning (NLTK)
β # ββ Section 3: Feature Engineering (TF-IDF)
β # ββ Section 4: Model Training (4 classifiers)
β # ββ Section 5: Evaluation & Visualizations
β
βββ README.md # Project documentation (this file)
βββ dataset/ # (optional) Local IMDb dataset storage
- Python 3.8 or higher
- pip or conda package manager
# 1. Clone the repository
git clone https://github.com/ibtesaamaslam/NLP-Sentiment-Analysis-Model.git
cd NLP-Sentiment-Analysis-Model
# 2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows
# 3. Install dependencies
pip install pandas numpy matplotlib seaborn nltk scikit-learn jupyter wordcloud
# 4. Download NLTK resources (run once)
python -c "import nltk; nltk.download(['stopwords','wordnet','punkt'])"
# 5. Launch Jupyter
jupyter notebookpip install pandas numpy matplotlib seaborn nltk scikit-learn jupyter wordcloud- Open
sentiment analysis.ipynbin Jupyter Notebook or JupyterLab. - Select Kernel β Restart & Run All.
- The notebook will load the IMDb data, clean the text, train all four models, and display all evaluation outputs automatically.
Dataset note: The IMDb dataset can be loaded via
keras.datasets.imdb, or downloaded directly from Stanford AI Lab and placed in thedataset/folder.
| Model | Vectorizer | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|---|
| Logistic Regression β | TF-IDF | ~90β91% | ~90% | ~91% | ~90% | ~96% |
| SVM (LinearSVC) | TF-IDF | ~89β90% | ~89% | ~90% | ~89% | ~95% |
| Naive Bayes | TF-IDF | ~87β88% | ~87% | ~88% | ~87% | ~94% |
| Random Forest | TF-IDF | ~84β86% | ~85% | ~85% | ~85% | ~92% |
| Logistic Regression | CountVectorizer | ~88β89% | ~88% | ~89% | ~88% | ~94% |
Winner: Logistic Regression + TF-IDF β best accuracy, highest ROC-AUC, and fastest inference speed for real-time deployment.
Text features (TF-IDF vectors) are inherently high-dimensional and sparse. Logistic Regression's linear decision boundary is well-suited to this geometry β it learns which words are most predictive of each sentiment class and weights them accordingly. More complex models like Random Forest struggle because they split on individual features in a space where thousands of features each carry weak signals.
| Metric | Why It Matters |
|---|---|
| Accuracy | Overall correctness across all 10,000 test reviews |
| Precision | Of all reviews predicted positive, how many actually were? |
| Recall | Of all truly positive reviews, how many did the model catch? |
| F1-Score | Harmonic mean of precision and recall β balanced metric |
| ROC-AUC | Area under the ROC curve β model's ability to rank positive above negative |
| Confusion Matrix | Exact count of true positives, false positives, true negatives, false negatives |
The notebook produces the following outputs automatically:
| Visualization | Description |
|---|---|
| Class distribution bar chart | Positive vs. negative review counts β confirms balance |
| Review length histogram | Distribution of word counts across training reviews |
| Word cloud β Positive | Most frequent terms in positive reviews |
| Word cloud β Negative | Most frequent terms in negative reviews |
| Training accuracy comparison | Bar chart comparing all 4 models |
| ROC curves | One curve per model β visual AUC comparison |
| Confusion matrix heatmap | Seaborn heatmap for the best model (Logistic Regression) |
| Top TF-IDF features | Bar chart of most predictive positive and negative words |
- Streamlit web app β Interactive UI to input any text and receive a live sentiment prediction
- BERT / RoBERTa fine-tuning β Fine-tune a pre-trained transformer for higher accuracy (~93β95%)
- Multilingual support β Add language detection and sentiment analysis in Arabic, French, Urdu, and more
- Extended data sources β Twitter/X tweets, YouTube comments, Amazon product reviews, Google Play store feedback
- Aspect-level sentiment β Go beyond document-level to identify sentiment toward specific aspects (e.g., "great acting, terrible plot")
- Model explainability β LIME or SHAP to explain individual predictions ("this review was classified negative because of: boring, disappointing, waste")
- REST API endpoint β FastAPI wrapper for integration with external applications
- Cross-validation β Replace single train/test split with k-fold cross-validation for more robust estimates
- Hyperparameter tuning β GridSearchCV on TF-IDF parameters and classifier regularization strength
Contributions are welcome! Here's how to get involved:
# 1. Fork the repository on GitHub
# 2. Clone your fork
git clone https://github.com/YOUR-USERNAME/NLP-Sentiment-Analysis-Model.git
# 3. Create a feature branch
git checkout -b feature/add-streamlit-app
# 4. Make your changes and commit
git add .
git commit -m "feat: add Streamlit interactive demo for live sentiment prediction"
# 5. Push and open a Pull Request
git push origin feature/add-streamlit-appIdeas for contributions: add a new model, improve text cleaning, add multilingual support, write unit tests, or build the Streamlit frontend.
MIT License
Copyright (c) 2024 Ibtesaam Aslam
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
| Permission | Status |
|---|---|
| β Commercial use | Allowed |
| β Modification | Allowed |
| β Distribution | Allowed |
| β Private use | Allowed |
| β Liability | No warranty provided |
| β Trademark use | Not granted |
- Andrew Maas et al., Stanford AI Lab β For creating and releasing the IMDb Large Movie Review Dataset that powers this project.
- NLTK Team β For the comprehensive natural language processing toolkit including WordNetLemmatizer, stopwords corpus, and tokenizers.
- Scikit-learn β For the consistent, well-documented ML API that makes model training, evaluation, and comparison straightforward.
- The open-source Python data science community β For Pandas, NumPy, Matplotlib, Seaborn, and WordCloud.
β If this project was useful to you, please consider starring it on GitHub!
Made with β€οΈ by Ibtesaam Aslam