-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLflow autologging issue # 1618 #2092
base: master
Are you sure you want to change the base?
Changes from 8 commits
13d8113
c15eb1e
ad4796c
cb5dbaf
22f75f5
ee55932
ce635bc
f798244
b839c67
5390383
f1a923b
65dbd48
3593454
609a0d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,7 @@ We assume that you already know about covariates in Darts. If you're new to the | |
- [Callbacks](#callbacks) | ||
- [Early Stopping](#example-with-early-stopping) | ||
- [Custom Callback](#example-of-custom-callback-to-store-losses) | ||
- [MLFlow: train, track and monitor](#example-with-mlflow-autologging) | ||
|
||
4. [Performance optimisation section](#performance-recommendations) lists tricks to speed up the computation during training. | ||
|
||
|
@@ -461,6 +462,109 @@ model.fit(...) | |
|
||
*Note* : The callback will give one more element in the `loss_logger.val_loss` as the model trainer performs a validation sanity check before the training begins. | ||
|
||
#### Example with MLflow Autologging | ||
madtoinou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
MLflow using interface (UI) and autologging to track Dart's pytorch models. | ||
```python | ||
import pandas as pd | ||
import torchmetrics | ||
from torchmetrics import MeanAbsolutePercentageError | ||
madtoinou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from darts.dataprocessing.transformers import Scaler | ||
from darts.datasets import AirPassengersDataset | ||
from darts.models import NBEATSModel | ||
|
||
# read data | ||
series = AirPassengersDataset().load() | ||
|
||
# create training and validation sets: | ||
train, val = series.split_after(pd.Timestamp(year=1957, month=12, day=1)) | ||
|
||
# normalize the time series | ||
transformer = Scaler() | ||
train = transformer.fit_transform(train) | ||
val = transformer.transform(val) | ||
|
||
# any TorchMetric or val_loss can be used as the monitor | ||
torch_metrics = torchmetrics.regression.MeanAbsolutePercentageError() | ||
|
||
# MLflow setup | ||
## Run this command with environment activated: mlflow ui --port xxxx (e.g. 5000, 5001, 5002) | ||
# Copy and paste url from command line to web browser | ||
import mlflow | ||
from mlflow.data.pandas_dataset import PandasDataset | ||
|
||
mlflow.pytorch.autolog(log_every_n_epoch=1, log_every_n_step=None, | ||
log_models=True, log_datasets=True, disable=False, | ||
exclusive=False, disable_for_unsupported_versions=False, | ||
silent=False, registered_model_name=None, extra_tags=None | ||
) | ||
|
||
import mlflow.pytorch | ||
from mlflow.client import MlflowClient | ||
|
||
model_name = "Darts" | ||
|
||
with mlflow.start_run(nested=True) as run: | ||
|
||
dataset: PandasDataset = mlflow.data.from_pandas(series.pd_dataframe(), source="AirPassengersDataset") | ||
|
||
# Log the dataset to the MLflow Run. Specify the "training" context to indicate that the | ||
# dataset is used for model training | ||
mlflow.log_input(dataset, context="training") | ||
|
||
# Define model hyperparameters to log | ||
params = { | ||
"model_type": "Darts_Pytorch_model", | ||
"input_chunk_length": 24, | ||
"output_chunk_length": 12, | ||
"n_epochs": 500, | ||
"model_name": "NBEATS_MLflow", | ||
"log_tensorboard": True, | ||
"torch_metrics": "torchmetrics.regression.MeanAbsolutePercentageError()", | ||
"nr_epochs_val_period": 1, | ||
} | ||
|
||
# Log hyperparameters | ||
mlflow.log_params(params) | ||
|
||
# create the model | ||
model = NBEATSModel( | ||
input_chunk_length=24, | ||
output_chunk_length=12, | ||
n_epochs=500, | ||
model_name='NBEATS_MLflow', | ||
log_tensorboard=True, | ||
torch_metrics=torch_metrics, | ||
nr_epochs_val_period=1, | ||
) | ||
cargecla1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# use validation dataset | ||
model.fit( | ||
series=train, | ||
val_series=val, | ||
) | ||
|
||
# predit | ||
forecast = model.predict(len(val)) | ||
|
||
# Save conda environment used to run the model | ||
mlflow.pytorch.get_default_conda_env() | ||
|
||
# Save pip requirements | ||
mlflow.pytorch.get_default_pip_requirements() | ||
|
||
# Set tracking uri | ||
mlflow.set_tracking_uri("sqlite:///mlruns.db") | ||
|
||
# Save Darts model (this need to be added via new cell) | ||
mlflow.log_artifact("NBeatsModel.pickle") | ||
|
||
# Registering model | ||
model_name = "NBEATS" | ||
model_uri = f"runs:/{run.info.run_id}/darts-NBEATS" | ||
mlflow.register_model(model_uri=model_uri, name=model_name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO runs currently do not save models as artifacts becaus there is no call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will need to have a look at this, my understanding is that it saves it both using mlflow.<model_flavor>.log_model() or mlflow.register_model(). Refer to https://mlflow.org/docs/latest/model-registry.html. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As i understand it you can only (kind of) promote a already saved model to a registered model..
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hello @turbotimon , Cheers! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cargecla1 Yes, that's a good idea to split these two things If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like: dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle") Let me know if i can help anything! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hi @cla-ra3426, sorry i missed your question. No idea.. but must have to do somehting with darts itself and not mlflow i suppose There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hello gents (@turbotimon , @madtoinou , @dennisbader,) Have you had time to consider my proposal above? "Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?" The only solution I have right now it is to train the model with MLFlow to track and monitor the model and retrain it with pure Darts to be able to save it and load to predict, Darts saving method "fails" when run inside a MLFlow run, and MLFlow log and save methods don't want to work with Darts either. Thank you in advice for your consideration! Cheers! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @cargecla1, Sorry for the delay, loading model is kind-of part of linked issue so it would probably be better to also include it in this PR. But if it too troublesome, we can treat it separately. The error you're getting when trying to pickle the model is probably due to the (pytorch-lightning) callbacks. Can you try removing them prior to exporting the model? I will test this example more thoroughly when I have more time, try to see if I can come up with a solution for the loading. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hello @madtoinou I will try this and see how it goes. Thanks for this suggestion and for considering splitting the problem into two if above doesn't work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hello @madtoinou , I took your suggestion on board as per above, but that didn't fixed the issue with loading and predicting steps. Refer to commit f798244 |
||
``` | ||
|
||
## Performance Recommendations | ||
This section recaps the main factors impacting the performance when | ||
training and using torch-based models. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the indentation is off, can you shift it to the left once so that it's at the same level as "Callbacks"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @madtoinou ,
I fixed this via commit b839c67