A product manager wants to evaluate a NEW product in an EXISTING category/brand. Using historical performance data of similar products (same brand tier, category), predict how this new product will perform.
Example Use Case:
- Input: price=$25, category="Electronics/Accessories", brand="GenericBrand", description="Durable silicone case with kickstand"
- Output: "
⚠️ 2.8/5 star rating expected - HIGH RISK product"
Using Amazon Reviews 2023 dataset from McAuley Lab:
- Source: https://amazon-reviews-2023.github.io/
- Category: Electronics (43.9M reviews, 18.3M users, 1.6M items)
- Sample Size: 30,000 merged records for this project
Download these files from the website:
Electronics.jsonl- Review data (ratings, text, user info)meta_Electronics.jsonl- Product metadata (title, price, features, etc.)
Run file data_extraction.ipynb file to save the new data in csv format with everything you need.
main_category- Product categoryproduct_title- Product nameaverage_rating- Overall product ratingrating_number- Number of ratingsprice- Product price in USDdescription- Product description (list format)parent_asin- Unique product IDdetails- Product details (contains brand, size, etc.)
rating- Individual review rating (Target variable)review_title- Review titletext- Review contenthelpful_vote- Review helpfulness votes
🛠️ Project Pipeline
-
data_cleaning.ipynb Handle missing values and outliers (e.g., drop products with missing price, filter unrealistic values). Normalize text fields (lowercasing, removing special characters, etc.). Save cleaned and merged dataset into a CSV file for downstream use. Output: cleaned_data.csv
-
feature_engineering.ipynb This notebook transforms the cleaned dataset into machine learning–ready features.