Skip to content

abhayma1000/DS3010FinalProject

Repository files navigation

Yelp Dataset Analysis DS3010

Nick Tomasetti, Sam Nguyen, and Abhay Mathur

Reach out if you have any questions

Overview

This is our analysis of the Kaggle Yelp Dataset. We set out to solve two problems:

  1. Use review text to classify a business's categories. Ex: Seafood, Nail Salon
  2. Predict if a business will close using review data and a business's metatdata

We utilized WPI's ARC Turing Cluster to as well as the full dataset to accomplish this.

Read our report: Report

Look at our slides: Slides

Code

Files for part 1:

  • task1_bert_preprocessing.py
    • Preprocess the raw yelp review data into inputs and expected value tensors
  • task1_bert_training.py
    • Fine-tune BERT model using the data
  • task1_bert_analysis.py
    • Evaluate the fine-tuned Bert model and report metrics

Files for part 2:

  • part2preprocessing.ipynb
    • Preprocess the raw yelp review and business data into inputs and expected value tensors
  • part2model2.ipynb
    • Train the DNN, Bert hybrid model and evaluate it
    • Predict which factors led to business closure predictions
    • Evaluate model

Note: This repository is not suited for simply running the code, specific setup is required

Technologies

Used technologies:

  • Torch
  • BERT/transformers
  • SHAP
  • Numpy
  • Pandas
  • NLTK
  • spaCy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •