You can install all the packages using pip. Run the following command in your terminal:
pip install -r requirements.txt
It's recommended to use a virtual environment for your Python projects. This helps to manage dependencies and keep your project isolated from other projects or system-wide Python settings. If you are not already using a virtual environment, you can set one up using virtualenv
or conda
(if you are using Anaconda).
-
Install virtualenv if you haven't already.
pip install virtualenv
-
Create a virtual environment in your project directory.
virtualenv venv
-
Activate the virtual environment:
- On Windows, use the activation script in the
Scripts
folder of the virtual environment..\venv\Scripts\activate
- On Unix or MacOS, use the
source
command to activate the environment.source venv/bin/activate
- On Windows, use the activation script in the
-
Once the virtual environment is activated, you can then run the
pip install
command as shown above.
- Create a new conda environment specifying the desired Python version.
conda create -n myenv python=3.7
- Activate the conda environment using the
conda activate
command.conda activate myenv
- Follow the same
pip install
step as above after activating the conda environment.
Remember to replace the environment name and Python version with your preferences.
The Wang_Chongrui_test_prob1.py
script demonstrates the functionality of a custom data loader CustomCSVData
in Backtrader, a Python library used for backtesting trading strategies.
- Purpose: To load and preprocess financial data from CSV files in various formats for backtesting.
- Class Inheritance: Inherits from
bt.feeds.GenericCSVData
. - Parameters:
csvformat
: Specifies the format of the CSV file ('type1', 'type2', 'type3').- Other common parameters like
dtformat
(date format) are also defined.
__init__
: Initializes the data loader and configures it based on the CSV format.configure_loader_type1
: Configures the loader for 'type1' CSV format with specific column mappings.configure_loader_type2
: Adjusts the loader for 'type2' format with a different set of column mappings.configure_loader_type3
: Sets up the loader for 'type3' format, potentially involving more complex preprocessing steps.start
: Reads the CSV data, performs any necessary sorting or processing based on the CSV format, and saves the sorted data to a temporary file for further loading.
test
Function: Demonstrates how to use theCustomCSVData
class with Backtrader.- Initializes a Backtrader
Cerebro
instance. - Adds data in different CSV formats to the Cerebro instance for backtesting.
- Runs the backtesting simulation and prints confirmation messages.
- Initializes a Backtrader
To execute the script and test the custom data loader, use the command:
python Wang_Chongrui_test_prob1.py
This script is designed for running machine learning and deep learning models for stock return prediction.
- Custom Utility Modules: Modules for specific ML and DL tasks.
- Standard Modules:
argparse
,json
,os
,pandas
.
- Purpose: Define and parse command-line arguments.
- Arguments:
--config
: JSON configuration file path.--model_type
: Model type (deep_learning
,machine_learning
,None
).--function
: Function to run.--task
: Task type (classification
,regression
,backtest
).--data_source
: Data source.
- Purpose: Validate the parsed arguments.
- Checks: Configuration file existence, function compatibility, task type, data source.
- Functionality: Parses arguments, loads configurations, sets globals, and calls specific functions based on task and model type.
- Tasks: Handles
classification
,regression
, andbacktest
.
- The script runs the
main()
function when executed as the main program.
get_score_ML
is designed to apply machine learning models to predict stock returns, handling tasks like training, testing, and evaluation.
- train_df, test_df: Training and testing datasets.
- save_dir: Directory for saving outputs.
- model_name: Name of the ML model.
- ignore_cols: Columns to exclude from the analysis.
- method: Task type ('classification' or 'regression').
- params, hyperparams: Model and hyperparameter settings.
Run the function using a command like:
python Wang_Chongrui_main.py --config Wang_Chongrui_LightGBM.json --model_type machine_learning --function get_score_ML --task classification --data_source 10stocks
The JSON configuration file specifies settings for running a machine learning model. Here is a detailed explanation of each section in the file:
- model_name: Specifies the machine learning model to be used. For example,
"LightGBM"
.
- ignore_cols: A list of column names in the dataset that should be ignored during the model training and testing process.
- save_dir: Path to the directory where the model and other output files will be saved.
- classification: Configuration specific to classification tasks.
- method: Specifies the task type, here "classification".
- params: Model-specific parameters. Set to null if default parameters are to be used.
- hyperparams: Settings for hyperparameter tuning.
- uniform: Specifies continuous hyperparameters with their range.
- int: Specifies integer hyperparameters with their range.
- choice: For categorical hyperparameters (empty in this case).
- maximum: Optimization goal (e.g., true for maximizing or false for minimizing).
- scoring: Metric used for evaluating the model performance.
- regression: Configuration specific to regression tasks. Similar structure to classification, but with different settings tailored for regression.
{
"model_name": "LightGBM",
"ignore_cols": ["date", "Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "ticker", "return", "label"],
"save_dir": ".",
"classification": {
"method": "classification",
"params": null,
"hyperparams": {
"uniform": {
"learning_rate": [0.01, 1],
"subsample": [0.1, 1],
"colsample_bytree": [0.1, 1],
"reg_lambda": [0.01, 10],
"reg_alpha": [0.01, 10]
},
"int": {
"n_estimators": [100, 1000],
"num_leaves": [20, 150],
"max_depth": [3, 20]
},
"choice": {},
"maximum": true,
"scoring": "accuracy"
}
},
"regression": {
"method": "regression",
"params": null,
"hyperparams": {
"uniform": {
"learning_rate": [0.01, 1],
"subsample": [0.1, 1],
"colsample_bytree": [0.1, 1],
"reg_lambda": [0.01, 10],
"reg_alpha": [0.01, 10]
},
"int": {
"n_estimators": [100, 1000],
"num_leaves": [20, 150],
"max_depth": [3, 20]
},
"choice": {},
"maximum": false,
"scoring": "neg_mean_squared_error"
}
}
}
The get_score_ML_ensemble
function is designed for training ensemble machine learning models. It can handle both classification and regression tasks using different ensemble methods.
- Purpose: Apply ensemble machine learning techniques to stock return prediction.
- Parameters:
train_df
,test_df
: DataFrames for training and testing.save_dir
: Directory for saving the model and factors.model_names
: List of model names used in the ensemble.ignore_cols
: Columns to ignore in the dataset.method
: Task type, either 'classification' or 'regression'.ensemble_method
: Type of ensemble method ('vote' or 'bag').params
: Model-specific parameters.
- Voting: Combines predictions from multiple models. In 'classification', it uses a soft voting strategy, while in 'regression', it averages the predictions.
- Bagging: Trains the same model on different subsets of the dataset. Useful for reducing variance and improving robustness.
To execute the get_score_ML_ensemble
function, use the following command:
python Wang_Chongrui_main.py --config Wang_Chongrui_machine_learning_ensemble.json --model_type machine_learning --function get_score_ML_ensemble --task classification --data_source 10stocks
The JSON configuration file specifies settings for the ensemble machine learning models. Here's a breakdown of each parameter in the file:
- save_dir: The directory where the model and output files will be saved.
- model_names: A list of names of the machine learning models to be included in the ensemble.
- method: Specifies the type of task - either 'classification' or 'regression'.
- ensemble_method: The ensemble technique to be used. Options include 'vote' for voting ensemble and 'bag' for bagging ensemble.
- ignore_cols: Columns from the dataset to be ignored during model training and testing.
- params: Model-specific parameters. Set to
null
if default parameters are to be used.
{
"classification": {
"save_dir": ".",
"model_names": ["LightGBM", "LightGBM", "LightGBM", "LightGBM", "LightGBM"],
"method": "classification",
"ensemble_method": "vote",
"ignore_cols": ["date", "Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "ticker", "return", "label"],
"params": null
},
"regression": {
"save_dir": ".",
"model_names": ["LightGBM", "LightGBM", "LightGBM", "LightGBM", "LightGBM"],
"method": "classification",
"ensemble_method": "vote",
"ignore_cols": ["date", "Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "ticker", "return", "label"],
"params": null
}
}
The JSON configuration file specifies settings for training deep learning models. Below is an explanation of each parameter in the file:
These settings apply to all models irrespective of the task (classification or regression): -BATCH_SIZE: The size of each batch of data during training.
- SEQ_LEN: The length of the input sequences.
- LEARNING_RATE: The learning rate for the optimizer.
- MOMENTUM: The momentum factor in the optimizer.
- WEIGHT_DECAY: The weight decay (L2 penalty) used for regularization.
- MAX_EPOCH: The maximum number of epochs for training.
- TEST_SIZE: The proportion of the dataset to include in the test split.
- RANDOM_SEED: The seed used for random number generation.
- PATIENCE: Number of epochs with no improvement after which training will be stopped.
- save_dir: Directory to save the trained model and results.
- model_name: The name of the deep learning model, e.g., "LSTM".
- loss_name: The loss function used, e.g., "BCE" for Binary Cross-Entropy.
- method: The type of task, here "classification".
- ignore_cols: Columns in the dataset to be ignored during model training.
- model_name: Same as above, but for regression models.
- loss_name: The loss function for regression, e.g., "IC".
- method: Set to "regression" for regression tasks.
- ignore_cols: Columns to ignore, similar to classification.
- Validate method and create necessary directories.
- Prepare data loaders for training, validation, and testing.
- Initialize and potentially load a pre-trained model.
- Train the model or load the best model state.
- Evaluate the model on the test data.
- Calculate accuracy and precision metrics.
- Save predictions and metrics.
To execute the get_DL_score
function for training deep learning models, use the following command:
python Wang_Chongrui_main.py --config Wang_Chongrui_Deep_Learning.json --model_type deep_learning --function get_DL_score --task classification --data_source 10stocks
The JSON file provides global settings and specific model configurations.
Includes batch size, sequence length, learning rate, momentum, weight decay, maximum epochs, test size, random seed, and patience for early stopping.
{
"global_settings": {
"BATCH_SIZE": 256,
"SEQ_LEN": 30,
"LEARNING_RATE": 1e-4,
"MOMENTUM": 0.9,
"WEIGHT_DECAY": 5e-5,
"MAX_EPOCH": 100,
"TEST_SIZE": 0.2,
"RANDOM_SEED": 42,
"PATIENCE": 5
},
"classification": {
"save_dir": ".",
"model_name": "LSTM",
"loss_name": "BCE",
"method": "classification",
"ignore_cols": ["Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "return", "label"]
},
"regression": {
"save_dir": ".",
"model_name": "LSTM",
"loss_name": "IC",
"method": "regression",
"ignore_cols": ["Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "return", "label"]
}
}
The get_DL_score_LTR
function is specifically designed for training deep learning models using a Learning to Rank (LTR) approach, suitable for regression tasks.
- Purpose: Implement deep learning models for regression tasks using LTR.
- Parameters:
train_df
,test_df
: DataFrames for training and testing.save_dir
: Directory to save the model and results.model_name
: Name of the deep learning model.loss_name
: Specifies the loss function, tailored for LTR.ignore_cols
: Columns to be ignored in the dataset.method
: Specifies the task type, set to 'regression' for LTR.
- Validation of the method as 'regression'.
- Creation of necessary directories for storing models and results.
- Preparation of data loaders specifically for LTR tasks.
- Initialization and training of the deep learning model.
- Evaluation of the model on test data.
- Calculation of accuracy and precision metrics.
- Saving the predictions in a CSV file.
Execute the get_DL_score_LTR
function with the following command:
python Wang_Chongrui_main.py --config Wang_Chongrui_Deep_Learning_LTR.json --model_type deep_learning --function get_DL_score_LTR --task regression --data_source 10stocks
The JSON configuration file for the get_DL_score_LTR
function sets up parameters for training deep learning models, specifically for regression tasks using the Learning to Rank (LTR) approach.
- BATCH_SIZE: Number of samples per gradient update.
- SEQ_LEN: Length of the sequence data.
- LEARNING_RATE: Step size at each iteration of the learning process.
- MOMENTUM: Parameter to accelerate SGD in the relevant direction.
- WEIGHT_DECAY: L2 penalty, a regularization technique.
- MAX_EPOCH: Maximum number of training epochs.
- TEST_SIZE: Proportion of the dataset to be used as the test set.
- RANDOM_SEED: Seed for random number generation to ensure reproducibility.
- PATIENCE: Number of epochs to wait for improvement before stopping.
- save_dir: Path to save the model and output files.
- model_name: Name of the deep learning model to be used.
- loss_name: Name of the loss function specific to the model.
- method: Type of machine learning task, set to 'regression' for LTR.
- ignore_cols: List of columns to be ignored during model training and testing.
Example:
{
"global_settings": {
"BATCH_SIZE": 256,
"SEQ_LEN": 30,
"LEARNING_RATE": 1e-4,
"MOMENTUM": 0.9,
"WEIGHT_DECAY": 5e-5,
"MAX_EPOCH": 100,
"TEST_SIZE": 0.2,
"RANDOM_SEED": 42,
"PATIENCE": 5
},
"regression": {
"save_dir": ".",
"model_name": "LSTM",
"loss_name": "ListMLE",
"method": "regression",
"ignore_cols": ["Open", "High", "Low", "Close", "Volume", "Dividends", "Stock Splits", "return", "label"]
}
}
The run_backtest
function is designed to perform backtesting on financial models, particularly useful for evaluating the performance of trading strategies based on historical data.
- Purpose: To conduct backtesting using historical price data and machine learning models.
- Parameters:
model_name
: Name of the machine learning model used for predictions.method
: Specifies the type of machine learning task (e.g., 'classification').ticker
: The stock ticker for which the backtest is to be run.price_data_dir
: Directory containing the price data files.
- Data Loading: Loads historical price data for the specified ticker from the given directory.
- Data Preprocessing: Renames columns, sets the index to date, and organizes data for backtesting.
- Backtrader Setup: Initializes the backtesting environment with the specified strategy and data.
- Running the Backtest: Executes the backtesting simulation and prints the portfolio values before and after the backtest.
- Result Processing: Analyzes the strategy's performance using metrics like returns and positions.
- Report Generation: Creates a detailed HTML report on the strategy's performance compared to a benchmark.
To execute the run_backtest
function for backtesting, use the following command:
python Wang_Chongrui_main.py --config Wang_Chongrui_backtest.json --model_type None --function None --task backtest --data_source None
The JSON configuration file for the run_backtest
function sets up parameters for backtesting:
- model_name: The machine learning model used for generating trading signals.
- method: The type of machine learning task, typically 'classification' or 'regression'.
- tickers: A list of stock tickers to run the backtest on.
- price_data_dir: Path to the directory where the price data and prediction files are stored.
{
"backtest": {
"model_name": "LightGBM",
"method": "classification",
"tickers": ["FE"],
"price_data_dir": "./Factor/"
}
}