Nowadays, a massive amount of reviews is available online. Besides offering a valuable wording review, it always comes with a rating from 0 to 5 stars. The goal of this project is to predict the star rating associated with user reviews from Amazon Movie Reviews using the available feature.
- ProductId - unique identifier for the product
- UserId - unique identifier for the user HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review
- Id - a unique identifier associated with a review
Before applying models, I applied some common NLP models to turn reviews (natural language) into numeric matrices and clean the data, including Vectorizing the word and finding the stem of each English word. I applied 4 different models onto the dataset: KNN, Naive Bayes, SVM, and Logistic Regression, and made some adjustments and evaluation based on their performances.
predict-constant.pyto predict the same score for all rows in the test setpredict-knn.pyto predict the score using KNNNaive Bayes.ipynbto predict the score using KNNSVM.ipynbandSVM_COMBINE.ipynbto predict the score using svm model on either only Text or the combination of Text and other feature like SummaryLogistic-Regression.ipynbto predict the score using Logistic-Regression
In general, Naive Bayes is a good way to start the prediction project since it creats a relatively precise prediction in a short time. But Logistic Regression have best performances when you increase the iteration. However, it requires a long time to preduce the outcome.
More detailed comparison are in the Report_liyy.pdf file.
https://drive.google.com/drive/folders/1TGj8-EogM9MmPB8cMfFvmSsFqAeaJEae?usp=sharing J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013