Your Go-To Library for Discovering Great Books
An intelligent book recommendation system that learns your taste and surrounds you with books you'll love.
BookRec is an optimized recommendation engine designed to be your personalized library companion. The system allows you to:
- 🔍 Search for books by title, author, or genre
- ❤️ Add favorites and build your reading list
- 🤖 Get personalized recommendations based on your taste
- 🎨 Discover similar books to ones you've enjoyed
- 📊 Explore a curated library tailored to your preferences
Unlike generic recommendation systems, BookRec creates a personalized reading environment where every suggestion aligns with your unique literary taste, making book discovery an enjoyable journey rather than an overwhelming search.
book-recommendation-system/
├── notebooks/ # Jupyter notebooks for ML pipeline
│ ├── 01_data_eda.ipynb # Exploratory Data Analysis
│ ├── 02_preprocessing.ipynb # Data cleaning & feature engineering
│ ├── 03_base_model.ipynb # Baseline collaborative filtering
│ ├── 04_model_optimization.ipynb # Hyperparameter tuning & optimization
│ └── 04_model_optimization_for_colab.ipynb # GPU-accelerated training
│
├── data/ # Dataset files
│ ├── books.csv # Book metadata (title, author, year)
│ ├── ratings.csv # User-book ratings
│ ├── tags.csv # User-generated tags
│ ├── processed_books.csv # Cleaned book data
│ ├── processed_ratings.csv # Filtered ratings
│ └── *.pkl # Pre-computed matrices (TF-IDF, cosine similarity)
│
├── models/ # Trained ML models
│ ├── base_svd.pkl # Baseline SVD model
│ ├── tuned_svd.pkl # Hyperparameter-tuned SVD
│ ├── base_mf.keras # TensorFlow Matrix Factorization
│ └── quantized_mf.tflite # Optimized deployment model (91% smaller!)
│
├── src/ # Python source code
│ ├── models/ # Model training scripts
│ └── utils/ # Helper functions
│
├── backend/ # API server (🚧 In Development)
├── frontend/ # Web UI (🚧 In Development)
├── requirements.txt # Python dependencies
└── README.md # You are here!
The system uses the Goodbooks-10k dataset:
- Books: 10,000 popular books with rich metadata
- Ratings: 6M+ ratings from 53,000+ users
- Tags: User-generated tags for content-based filtering
- Rating Scale: 1-5 stars
Key Statistics:
- Average ratings per book: ~600
- Average ratings per user: ~113
- Data sparsity: ~99.4% (typical for recommendation systems)
1️⃣ Exploratory Data Analysis (01_data_eda.ipynb)
What it does:
- Analyzes rating distribution and user behavior
- Identifies popular books and active users
- Visualizes data sparsity and patterns
- Detects data quality issues
Key Findings:
- Most ratings are 4-5 stars (positive bias)
- Power users contribute disproportionately to ratings
- Long-tail distribution: few books are extremely popular
2️⃣ Data Preprocessing (02_preprocessing.ipynb)
What it does:
- Cleans missing values and duplicates
- Filters low-activity users and obscure books
- Creates TF-IDF vectors from book tags
- Computes cosine similarity matrix for content-based filtering
- Generates processed datasets for modeling
Outputs:
processed_books.csv- Clean book metadataprocessed_ratings.csv- Filtered user-item interactionstfidf_matrix.pkl- Term frequency vectorscosine_sim.pkl- Pre-computed book similarities
3️⃣ Base Model Development (03_base_model.ipynb)
What it does:
- Implements Collaborative Filtering using SVD (Singular Value Decomposition)
- Builds Content-Based Filtering using TF-IDF + cosine similarity
- Creates Hybrid Recommendation System combining both approaches
- Evaluates baseline performance
Models:
- SVD: Factorizes user-item matrix into latent factors
- Content-Based: Recommends books with similar tags/genres
- Hybrid: Weighted combination for robust recommendations
Baseline Performance:
- RMSE: 1.37 (Root Mean Squared Error)
- Model Size: 87.8 MB
- Inference Time: 3.0s (for 368K predictions)
4️⃣ Model Optimization (04_model_optimization.ipynb)
What it does:
- Hyperparameter Tuning with Optuna (20 trials)
- Converts SVD to TensorFlow Matrix Factorization for deployment
- Applies Quantization (float32 → int8) for model compression
Optimization Techniques:
- Optuna-based tuning: Automated search for optimal hyperparameters
- Model conversion: SVD → TensorFlow for production scalability
- Post-training quantization: Reduces model size with minimal accuracy loss
Final Results:
| Model | RMSE | Size (MB) | Inference Time (s) | Size Reduction |
|---|---|---|---|---|
| Base SVD | 1.374 | 87.77 | 2.96 | - |
| Tuned SVD | 1.320 | 101.0 | N/A | -15.1% |
| Base MF (TensorFlow) | 1.418 | 30.33 | 20.66 | 65.4% |
| Quantized MF ⭐ | 1.417 | 7.59 | 3.61 | 91.4% |
🎉 Key Achievements:
- ✅ Best accuracy: 1.32 RMSE (3.9% improvement over baseline)
- ✅ 91.4% smaller model: 88 MB → 7.6 MB
- ✅ Deployment-ready: TFLite format works on edge devices
- ✅ Minimal accuracy loss: Only 0.1 RMSE degradation from quantization
The Quantized MF model is recommended for production deployment due to its optimal balance of accuracy, size, and speed.
🚀 GPU-Accelerated Training (04_model_optimization_for_colab.ipynb)
What it does:
- Google Colab-compatible version with GPU support
- Skips expensive Optuna tuning (uses pre-computed hyperparameters)
- Faster training with T4/V100 GPUs
Why use this:
- Local training takes hours; Colab reduces it to ~20 minutes
- Free GPU access for model training
- Easy Google Drive integration for data/model storage
- Python 3.8+
- pip or conda
- (Optional) Google Colab account for GPU training
- Clone the repository
git clone https://github.com/yourusername/book-recommendation-system.git
cd book-recommendation-system- Create virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate- Install dependencies
pip install -r requirements.txt- Download dataset
- Place Goodbooks-10k dataset files in
data/folder - Or download from: https://github.com/zygmuntz/goodbooks-10k
1. Data Exploration
jupyter notebook notebooks/01_data_eda.ipynbRun all cells to understand the dataset.
2. Data Preprocessing
jupyter notebook notebooks/02_preprocessing.ipynbGenerates cleaned datasets in data/ folder.
3. Train Base Model
jupyter notebook notebooks/03_base_model.ipynbCreates base_svd.pkl in models/ folder.
4. Optimize Model
Option A: Local (slower)
jupyter notebook notebooks/04_model_optimization.ipynbOption B: Google Colab (faster) ⚡
- Upload
04_model_optimization_for_colab.ipynbto Colab - Create folder:
MyDrive/book-recommendation-system/ - Upload
data/andmodels/folders to Drive - Run notebook with GPU runtime
5. Results
- Check model performance in final comparison table
- Models saved in
models/folder
import tensorflow as tf
import numpy as np
# Load quantized model
interpreter = tf.lite.Interpreter(model_path="models/quantized_mf.tflite")
interpreter.allocate_tensors()
# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Predict rating for user 123, book 456
user_id = np.array([[123]], dtype=np.int32)
book_id = np.array([[456]], dtype=np.int32)
interpreter.set_tensor(input_details[0]['index'], user_id)
interpreter.set_tensor(input_details[1]['index'], book_id)
interpreter.invoke()
predicted_rating = interpreter.get_tensor(output_details[0]['index'])[0][0]
print(f"Predicted rating: {predicted_rating:.2f}")import pickle
# Load SVD model
with open('models/tuned_svd.pkl', 'rb') as f:
svd_model = pickle.load(f)
# Load content similarity
with open('data/cosine_sim.pkl', 'rb') as f:
cosine_sim = pickle.load(f)
# Get recommendations (combine collaborative + content-based)
# See 03_base_model.ipynb for hybrid_recommend() function- Data exploration and cleaning
- Collaborative filtering (SVD)
- Content-based filtering (TF-IDF)
- Hybrid recommendation system
- Hyperparameter optimization
- Model quantization and compression
- Backend API (FastAPI/Flask)
- REST endpoints for recommendations
- User authentication
- Model serving with TFLite
- Frontend Web App (React/Next.js)
- Book search and browsing
- User profiles and favorites
- Personalized recommendation dashboard
- Real-time model updates with new ratings
- A/B testing framework
- Cold-start problem handling (new users/books)
- Explainable recommendations
- Mobile app deployment
- Docker containerization
- CI/CD pipeline
Accuracy: Tuned SVD wins
- Winner: Tuned SVD (1.32 RMSE)
- Runner-up: Quantized MF (1.42 RMSE)
Speed: Quantized MF is fastest
- Winner: Quantized MF (3.6s)
- Original baseline: 3.0s
Size: Quantized MF is 12x smaller
- Winner: Quantized MF (7.6 MB)
- Original baseline: 87.8 MB
Recommended for Production: Quantized MF
- Excellent speed/accuracy tradeoff
- Tiny model size (mobile-friendly)
- Easy deployment with TFLite
Contributions are welcome! Areas of interest:
- Cold-start problem solutions
- Deep learning models (NCF, autoencoders)
- Frontend/backend development
- Performance optimizations
- Documentation improvements
This project is licensed under the MIT License.
- Dataset: Goodbooks-10k by Zygmunt Zając
- Libraries: Scikit-surprise, TensorFlow, Scikit-learn, Pandas
- Inspiration: Building a personalized reading experience for book lovers
For questions or collaboration:
- GitHub Issues: Open an issue
- Email: kckdeepak29@example.com
⭐ Star this repo if you find it useful!
Happy Reading! 📚