Break Through Tech Relationship Analysis Tool
Using S&P 500 stock price data, our project explores how uncovering inter-company relationships through clustering can enhance financial prediction models—particularly stock price forecasting. By leveraging XGBoost and hierarchical clustering, we aim to boost predictive accuracy for better investment insights.
- Scraped daily closing prices of S&P 500 companies from Yahoo Finance between 01/03/2022 and 12/31/2022.
- Converted data into daily return percentages to better reflect relative movement rather than raw price.
- Constructed a correlation matrix using daily returns to understand co-movements between stocks.
- Identified strong sectoral groupings (e.g., Energy, Financials) through patterns in return behavior.
- Dropped weeks with market closures (resulting in missing data).
- Final dataset includes 41 usable weeks per company with valid return data.
- Removed incomplete or anomalous entries to ensure consistent input for both clustering and modeling.
- Applied Hierarchical Clustering using Ward linkage and absolute correlation distance (
1 - |r|) to measure similarity. - Constructed a dendrogram and chose a cutoff that yielded 3 distinct clusters.
- Evaluated clustering quality using silhouette scores (achieved score:
0.2405). - Final distribution:
[392, 27, 79]companies in each cluster respectively.
-
Built two sets of XGBoost regression models:
- Baseline Model – trained without any clustering data.
- Cluster-Enhanced Model – trained separately on each cluster to capture intra-group patterns.
-
Input Features: Daily returns for 4 days (Mon–Thu) → Predict Friday return (Day 5).
-
Each week per company forms a single training sample (41 samples per company).
Used grid search for each model with the following tuning parameters:
max_depth: [2, 3, 5, 7, 8, 10]n_estimators: [100, 200, 300, 400]learning_rate: [0.001, 0.005, 0.01, 0.05, 0.1]
Evaluation Metrics: MSE, RMSE, and MAE.
Cluster-specific models significantly outperformed the baseline in terms of error reduction.
Visualizations showed improvement in prediction consistency after integrating cluster insights.
Outlier handling: Detect and evaluate companies or time periods with unusual behavior.
Expand features: Include sentiment analysis from news and social media (Reddit, Twitter).
Build an interactive dashboard to visualize company clusters and relationships.