This project aims to predict the median value of owner-occupied homes in various suburbs of Boston. The dataset includes various features such as crime rate, number of rooms, and property tax rate, which are used to train different machine learning models to make accurate price predictions.
The dataset used in this project is stored in the file Real Estate Price Prediction.csv. It contains the following columns:
- CRIM: Per capita crime rate by town.
- ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.
- INDUS: Proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- NOX: Nitrogen oxides concentration (parts per 10 million).
- RM: Average number of rooms per dwelling.
- AGE: Proportion of owner-occupied units built before 1940.
- DIS: Weighted distances to five Boston employment centers.
- RAD: Index of accessibility to radial highways.
- TAX: Full-value property tax rate per $10,000.
- PTRATIO: Pupil-teacher ratio by town.
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town.
- LSTAT: Percentage of lower status of the population.
- MEDV: Median value of owner-occupied homes in $1000s.
The project consists of the following key components:
- Model Usage.ipynb:
- Loads a pre-trained model (Dragon.joblib) and demonstrates how to make predictions using a sample feature array.
- Dragon Real Estates.ipynb:
- Comprehensive notebook that includes data loading, exploration, preprocessing, model training, evaluation, and saving/loading models.
The performance of different models is evaluated and stored in Real Estate Models Outputs.txt:
- Decision Tree:
- Mean: 4.1895
- Standard Deviation: 0.8481
- Linear Regression:
- Mean: 4.2219
- Standard Deviation: 0.7520
- Random Forest Regression:
- Mean: 3.4947
- Standard Deviation: 0.7620
The data preprocessing pipeline handles missing values and scales features. Key components include:
- Imputation: Using
SimpleImputerto fill missing values. - Scaling: Using
StandardScalerto normalize features.
The following models are trained and evaluated:
-
Linear Regression
-
Decision Tree Regression
-
Random Forest Regression Evaluation metrics used:
-
Root Mean Squared Error (RMSE)
-
Cross-Validation Scores
Models are saved using joblib and can be loaded for making predictions. An example of how to save and load models is included.
To use the notebooks and run the project:
- Clone the repository.
- Ensure all necessary libraries are installed (e.g., pandas, scikit-learn, joblib, matplotlib, numpy).
- Open the Jupyter notebooks and execute the cells to see the full workflow from data loading to model evaluation.
This project demonstrates the complete workflow for building and evaluating machine learning models to predict real estate prices. It includes data preprocessing, feature engineering, model training, evaluation, and deployment steps. The Random Forest Regression model showed the best performance based on the evaluation metrics.