-
Notifications
You must be signed in to change notification settings - Fork 1
Description
The project is on predicting whether a baseball hitter in Major League Baseball would swing, take, or whiff at a particular pitch, using data gathered from Major League Baseball’s own R&D Department. Characteristics of a certain play include the pitcher and hitter’s tendencies, as well as the progress of the game so far. This project is greatly beneficial for determining pitching strategies against certain hitters, or conversely, devising hitting strategies against certain hitters in Major League Baseball. Other potential use cases include determining important features in a successful bat, which is extremely useful when scouting for potential players.
What I like about the project:
- I particularly like the choice of visualizations throughout the project; under the exploratory analysis section, the heat maps and class-based histograms displayed the data’s distribution well, and also acknowledges the imbalance among classes. At the results analysis section, th prediction heat map also gives a good visualization of how the model performs an its failure points. Overall, the data analysis and visualization choices were very informative and easy to understand.
- I also like the care taken in fitting each model to prevent overfit. The author justifies each step for fine tuning and evaluating the models. I am not too sure whether balanced accuracy is equivalent to the test accuracy weighted by the imbalance among the classes — but if it is, the model seems to perform quite well, and predicts the swing/take/whiff classifications significantly higher than a baseline random guess, which should be 33.33% (since there are three classes).
- Both models chosen were rather glass-box and interpretable models. This would go far in answering the main question of highlighting characteristics important for a successful batter. It would be nice if you explained what features like byzone.swing and plate_z mean though, but in general, the interpretability of the models are a great plus.
Avenues for future improvement:
- I believe the assignment asked for three techniques learned in class in addition to any outside material, but you only provided one. There’s definitely still a lot of room to explore! For instance, since you’re already trying tree-based ensemble models, perhaps AdaBoost and Gradient Boosting methods would give similar performances as well! Furthermore, even among things learned in class, your SVM could be modified using different regularizers as well depending on how scarce the important features are.
- You might want to include some discussions on fairness, and whether this model could be used as a weapon of math destruction, as mentioned in the handout.
- On the subject of fairness, I noticed that you used balanced accuracy as a metric, which I think is based on the imbalance between classes (correct me if I’m wrong). However, I noticed that the class imbalance defers mostly based on the batting zone — hence I think it’d make more sense to me for calculating the weighted accuracy based on each zone itself. Looking at your final prediction accuracy heat map, it seems that your model performs less well on zones 1-9, mostly due to distinction between the strike and non-strike zone. An alternative idea might even to be training different models altogether for balls within and outside the strike zone.
- Lastly, there might be some room for more explanation of your features — while you highlighted a few features in your exploratory analysis section, I noticed that there were 41 columns, and I wasn’t sure what the majority of the features were about, or how to interpret the coefficients of the XGBoost model.
Overall, there is definitely still quite some room to explore for this project, in terms of creating more models and writing up the report. But I think the implementations done so far tell a great story of the dataset and question.