Northwestern University — Text Analytics Final Project
Spring 2025
This project applies multimodal machine learning techniques to classify movie genres using both textual and visual data from the IMDB Multimodal Genre Dataset. Our models analyze movie poster images and plot summaries to predict genres: Action, Comedy, Horror, and Romance.
We implemented a full pipeline covering:
- Data cleaning and preprocessing
- Text summarization
- Classical and deep learning models
- Model evaluation
- Word importance visualization
- Interactive genre prediction tool
text_analytics_project/ ├── genre_app.py # Streamlit app for interactive predictions ├── text_project.ipynb # Main Jupyter notebook with code and analysis ├── label_encoder.pkl # Saved label encoder ├── tfidf_vectorizer.pkl # TF-IDF vectorizer ├── naive_bayes_model.pkl # Naive Bayes model ├── logistic_regression_model.pkl # Logistic Regression model ├── lstm_model.keras # Trained LSTM model ├── lstm_tokenizer.json # Tokenizer used for LSTM ├── lstm_model_architecture.png # LSTM model architecture plot ├── Top 20 Words/ # CSVs and charts of top words per genre ├── wordclouds/ # Word cloud visualizations ├── Project Discussion.html # HTML export of team discussions ├── Movie Analysis Tool.pdf # Interactive predictions result └── README.md # This file
-
Data Preprocessing
- Cleaned and tokenized plot summaries
- Removed noise (punctuation, stopwords, etc.)
-
Summarization
- Built a chunking + summarizer pipeline using Hugging Face Transformers
-
Modeling
- Naive Bayes (baseline)
- Logistic Regression
- LSTM (Keras)
- BERT (DistilBERT from Hugging Face) (optional/advanced)
-
Evaluation
- Accuracy by genre and overall
- Plots for LSTM model accuracy vs. epochs
-
Explainability
- Extracted top N words per genre
- Created word clouds per model/genre
-
Error Analysis
- Highlighted cases where models disagreed
- Interpreted model decisions
-
Interactive Tool
- Built with Streamlit to:
- Input summaries
- Clean text
- Predict genre using all models
- Built with Streamlit to:
-
Bonus (Visual Classification)
- Paired posters with genres
- Optional CNN image classification (in progress)
streamlit run genre_app.py