GitHub

Northwestern University — Text Analytics Final Project
Spring 2025

Overview

This project applies multimodal machine learning techniques to classify movie genres using both textual and visual data from the IMDB Multimodal Genre Dataset. Our models analyze movie poster images and plot summaries to predict genres: Action, Comedy, Horror, and Romance.

We implemented a full pipeline covering:

Data cleaning and preprocessing
Text summarization
Classical and deep learning models
Model evaluation
Word importance visualization
Interactive genre prediction tool

Project Structure

text_analytics_project/ ├── genre_app.py # Streamlit app for interactive predictions ├── text_project.ipynb # Main Jupyter notebook with code and analysis ├── label_encoder.pkl # Saved label encoder ├── tfidf_vectorizer.pkl # TF-IDF vectorizer ├── naive_bayes_model.pkl # Naive Bayes model ├── logistic_regression_model.pkl # Logistic Regression model ├── lstm_model.keras # Trained LSTM model ├── lstm_tokenizer.json # Tokenizer used for LSTM ├── lstm_model_architecture.png # LSTM model architecture plot ├── Top 20 Words/ # CSVs and charts of top words per genre ├── wordclouds/ # Word cloud visualizations ├── Project Discussion.html # HTML export of team discussions ├── Movie Analysis Tool.pdf # Interactive predictions result └── README.md # This file

Tasks Completed

Data Preprocessing
- Cleaned and tokenized plot summaries
- Removed noise (punctuation, stopwords, etc.)
Summarization
- Built a chunking + summarizer pipeline using Hugging Face Transformers
Modeling
- Naive Bayes (baseline)
- Logistic Regression
- LSTM (Keras)
- BERT (DistilBERT from Hugging Face) (optional/advanced)
Evaluation
- Accuracy by genre and overall
- Plots for LSTM model accuracy vs. epochs
Explainability
- Extracted top N words per genre
- Created word clouds per model/genre
Error Analysis
- Highlighted cases where models disagreed
- Interpreted model decisions
Interactive Tool
- Built with Streamlit to:
  - Input summaries
  - Clean text
  - Predict genre using all models
Bonus (Visual Classification)
- Paired posters with genres
- Optional CNN image classification (in progress)

launch the interactive tool

streamlit run genre_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Project Structure

Tasks Completed

launch the interactive tool

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Top 20 Words		Top 20 Words
wordclouds		wordclouds
Movie Analysis Tool.pdf		Movie Analysis Tool.pdf
Project Discussion.html		Project Discussion.html
README.md		README.md
genre_app.py		genre_app.py
label_encoder.pkl		label_encoder.pkl
logistic_regression_model.pkl		logistic_regression_model.pkl
lstm_model.keras		lstm_model.keras
lstm_model_architecture.png		lstm_model_architecture.png
lstm_tokenizer.json		lstm_tokenizer.json
naive_bayes_model.pkl		naive_bayes_model.pkl
poster_demo.jpeg		poster_demo.jpeg
text_project.ipynb		text_project.ipynb
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

runxuanli/text_analytics_project

Folders and files

Latest commit

History

Repository files navigation

Overview

Project Structure

Tasks Completed

launch the interactive tool

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages