This project predicts student placement outcomes based on academic and profile data using a trained Random Forest Classifier. It uses a complete machine learning pipeline with preprocessing, label encoding, and model persistence using joblib.
- placement.csv # Original dataset
- input.csv # Automatically generated test data (from StratifiedShuffleSplit)
- output.csv # Output file with predictions
- model.pkl # Trained Random Forest model
- preprocessing_pipeline.pkl # Preprocessing pipeline (numerical + categorical)
- label_encoder.pkl # Encoder to convert labels to/from numeric
- Loads
placement.csvas the dataset. (CSV to be downloaded from kaggle) - Performs a StratifiedShuffleSplit to ensure class distribution is preserved.
- Splits the data into training and test sets.
- Stores the test set as
input.csv(used later for inference). - Encodes the target labels using
LabelEncoder. - Separates numerical and categorical columns.
- Constructs a preprocessing pipeline using:
SimpleImputerandStandardScalerfor numerical data.OneHotEncoderfor categorical data.
- Trains a
RandomForestClassifieron the transformed training data. - Saves the trained model, preprocessing pipeline, and label encoder using
joblib.
On rerunning the script:
- If model files exist, the script loads the model and pipeline.
- Reads
input.csv, applies the pipeline, and predicts outcomes. - Saves the result to
output.csvwith an additional columnPlacement_Prediction.
- pandas
- numpy
- scikit-learn
- joblib
- Stratified sampling ensures balanced training.
- Full pipeline for preprocessing both numerical and categorical features.
- Label encoding ensures clean conversion between text and numbers.
- Model persistence using joblib ensures easy inference without retraining.