Skip to content

Projecting the year-to-year stat lines of Men's Division I and Division II College Basketball Players, accounting for level of play, competition, and past performance trends.

Notifications You must be signed in to change notification settings

nbetts2020/UCSD-MBB-PROJECTIONS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Networks for Predictive Analysis in Men's College Basketball

Predictive analysis for any sport is a notoriously difficult task. Current research suggests that small neural networks (3-Layer Network) [1] and ensemble models such as extra tree classifiers and gradient boosting [2] can be effective in addressing these challenges. This project aims to expand upon previous research by training individual Multi-Layer Perceptron models for each of 15 basketball metrics: GP, GS, MIN/G, FG%, 3PT%, FT%, PPG, REB/G, OFF_REB/G, DEF_REB/G, PF/G, AST/G, TO/G, STL/G, BLK/G, predicting each for a player's subsequent year of data. Additionally, this project attempts to understand how the level of play across Division I and Division II affects outcomes when a player switches between the two or within divisions. For example, if a player transfers from Point Loma Nazarene (a higher-level Division II team) to UC San Diego (a lower-level Division I team) or from UC San Diego to UConn (a higher-level Division I team), some accounting for level of play adjustment needs to be considered. A novel approach was taken in which a value was assigned to each conference's "perceived score" of competition from 0-2 (with 0 being the lowest and 2 the highest). To account for a large number of variables, careful emphasis was placed on finding similar players based on their stats, level of play similarities, and progression, and incorporating this into the model architecture. The project is divided into three parts: Scraping, Training, and Inference. The final aim is to create a versatile model that not only predicts future performance with high accuracy but also adapts to the nuances of different levels of competition, ultimately providing insight into how these changes can influence player performance.

Scraping

The goal was to obtain the last six years of data, starting in 2018-2019 and ending in 2023-2024 (though the code is setup to get the most current season, March 18th being the cutoff date for a new season). There's a few sites to scrape this information from, including barttorvik.com or espn.com, but scraping data from each team's respective website was by far the most reliable in terms of normalized data. The process looks something like this: with a list of Division I and Division II Men's Basketball Schools and for each year from 2018-2019 to 2023-2024 (or current year), two things were scraped - the team's roster (something like: https://ucsdtritons.com/sports/mens-basketball/roster) and the individual overall cumulative statistics table (something like: https://ucsdtritons.com/sports/mens-basketball/stats#individual). From here, the two tables were merged, now allowing for the player's position to be added as a feature. The scraping used a mix of BeatifulSoup and Selenium, as each team website varied on their HTML structure - for example, some cumulative statistics tables were PDFs, while others were pure HTML. After repeated for nearly 700 teams and 6+ years of data, the data was manipulated to reflect the values we wanted to predict. Counting statistics were transformed into "per game" stats (PTS -> PPG = PTS / GP (Games Played)) and other columns such 'Conference' (conference of each player's team), 'Conference_Grade' (score from 0-2 reflecting competition of play in a team's conference), and 'Occurrence' (year of eligibility for player, Freshmen would generally be 1, Sophomores 2, etc.) were added.

While most schools adhered to two prevailing html structures, some schools take a most custom approach in displaying their data. In order to get a sense of this, let's use the example of UCLA. They display their individual overall cumulative statistics table within an iframe (https://uclabruins.com/sports/mens-basketball/stats/2023-24) - meaning it generally can't be accessed through a different url (I was able to access the individual overall tables for most schools by adding '#individual' to the end of the stats page url). Being difficult to navigate programmatically and observing that no network requests were being made when interacting with the content, this was addressed by finding the pdf version of this page (https://uclabruins.com/sports/mens-basketball/stats/2023-24). Though, as the pdf is still embedded as a 'PDFObject' - meaning it still wasn't directly accessible - a workaround was found in which the cloud endpoint (https://s3.us-east-2.amazonaws.com/sidearm.nextgen.sites/uclabruins.com/stats/mbball/2023/pdf/cume.pdf?t=1717719774215) was identified within the html to access the pdf directly. Then, parsed and cleaned with pdfplumber and regex.

By prioritizing flexibility, the scraping methodology of this project attempted to obtain the most comprehensive coverage possible regardless of how a team stuctures their data.

Training

In the preprocessing stage, data was preprocessed according to the specified conditions - either for a scenario where the test loss was required or not (the last year of data was set aside for test set if specified). This preprocessing involved basic data cleaning, including dropping rows with mostly zeros, to more involved processes such as standardizing numerical features, and applying one-hot encoding to categorical features. This transformed data was then loaded into a DataLoader to facilitate batch processing (batch size set to 64) during model training.

For each target column, a Multilayer Perceptron (MLP) model is instantiated. The MLP architecture is as follows:

  • Input Layer: Transforms the 21-dimensional input vector into a 128-dimensional vector.
  • Hidden Layers: Consists of fully connected layers that reduce the dimensions in sequence:
    • First hidden layer: 128 dimensions
    • Second hidden layer: 64 dimensions
    • Third hidden layer: 32 dimensions
    • ReLU activation functions are applied between these layers to introduce non-linearity.

Output Layer: Downsamples to an output size of 1, which corresponds to the predicted target variable's dimension.

The model uses Mean Squared Error (MSE) as the loss function and the Adam Optimizer with a learning rate of 0.001 for updating the network weights, conducted over 15 epochs.

Inference

On run time, a few adjustments were made to optimize the performance of the model. The neural network takes a many-to-one mapping of inputs to output, wherein a metric like 'GP' (Games Played) is predicted only by knowing all other features of a player (i.e. 'GS', 'MIN/G', 'FG%', '3PT%', etc. goes into the input, with 'GP' being the output - this being the case for every predicted variable). At the inference stage, only five features are initially known: 'Player', 'Position', 'Team', 'Conference', and 'Conference Grade'. This presents a challenge in predicting the remaining features necessary for analysis. The devised solution involves a two-step process to estimate these unknown features:

  1. Identify the top 150 players most similar to the subject player from the previous year, using cosine similarity for comparison.
  2. From this subset, select players with available data for the subsequent year. Then, calculate the median value for each feature among these players' data for the following year.

This approach effectively generates a rough estimate of the subject player's otherwise unknown statistics, providing a viable dataset to feed into the model for accurate predictions.

To enhance the predictive accuracy and contextual relevance of player performance forecasts within the model, two significant customizability features were introduced. These enhancements are grounded in the principle that a more nuanced representation of a player's potential role and conditions can substantially improve prediction quality. This approach aligns with feedback from coaches seeking greater adaptability in predictive analytics.

Dynamic Feature Adjustment

The first customizable option allows for the dynamic adjustment of specific player features - namely 'Games Played' (GP), 'Games Started' (GS), and 'Minutes per Game' (MIN/G). This functionality enables users to tailor predictions to better reflect anticipated changes in a player's use (ex: change 'MIN/G' from 18.4 to 24). Unlike adjustments that might naively scale predictions, this mechanism recognizes the complexity of performance metrics interrelations. Adjusting for anticipated changes involves a refined process that filters out statistical outliers and reassesses player similarity by ranking the body of similar players by the Mean Squared Error on the changed metric(s) and using a weighted median to re-evaluate feature estimates. This method ensures a more grounded and realistic forecast, acknowledging that shifts in one aspect of a player's profile do not uniformly translate across all performance indicators.

Adding Context with Similar Players

The second customizable option focuses on incorporating additional context by including user-specified similar players in the model's input. Recognizing a player's intended role or anticipated conditions can significantly influence predictive outcomes. By selecting comparable players who have previously filled similar roles (i.e. predicting the output of a player in the transfer portal, knowing they will take on a defined role at another school; example from the 2023-2024 season could be: Caleb Love going to Arizona, knowing he’ll fill the role of a Bennedict Mathsurin, Nico Mannion, or Allonzo Trier from years past) the model gains a richer dataset from which to draw parallels. This approach employs a weighted median calculation between the body of similar players and the specified comparable players to derive estimates, providing a more accurate and insightful prediction.

Both customizability features deliver on providing a predictive tool that not only leverages statistical analysis but also incorporates practical insights into player usage and team dynamics.

Citations

[1] J. Perricone, I. Shaw, & W. Święchowicz, "Predicting Results for Professional Basketball Using NBA API Data," Institute of Computational and Mathematical Engineering, Stanford University. Available: https://cs229.stanford.edu/proj2016/report/PerriconeShawSwiechowicz-PredictingResultsforProfessionalBasketballUsingNBAAPIData.pdf

[2] V. Saladi, "DeepShot: A Deep Learning Approach To Predicting Basketball Success," Department of Computer Science, Stanford University. Available: https://cs230.stanford.edu/projects_winter_2020/reports/32643049.pdf

About

Projecting the year-to-year stat lines of Men's Division I and Division II College Basketball Players, accounting for level of play, competition, and past performance trends.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages