HPC and non-python checkpointing #9245
PythonFZ
started this conversation in
New Features & Ideas
Replies: 1 comment 5 replies
-
Sorry, I see that this was part of discussion in #8625 , but this still doesn't seem actionable yet, so I've converted to discussion again. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My workflow
I'm using DVC with CLI tools that provide their own checkpointing solutions (e.g. lammps or CP2k). The main motivation for checkpointing is not to go back to e.g. an earlier model as it is often case in ML but more to continue a simulation and append to existing output files (which might be > 100 GB). This is often required, because simulations are carried out on HPC resources with strict time limits (hours to a few days). Reaching the time limit means, that the job is immediatly killed, so the DVC process won't have time to do anything. (Some clusters send some signal before stopping, but this is not guaranteed.)
My current solution to this problem, is to run the simulation in a temporary directory and move the outputs to the correct location only after the process finished. In this way I can call
dvc repro
multiple times without having DVC remove the created checkpoint files.The corresponding
dvc.yaml
file could look something like:Unfortunately, this doesn't work with
dvc exp
and feels a bit hacky as well.Potential Solutions
There are a few things that would help me (and potentially many others who haven't used DVC yet in the field of HPC) :
dvc queue start --keep-failed
. In this way, after the experiemnt was killed, one could dodvc exp apply <id>
and start a new experiment from the checkpoint files that would otherwise be removed. The data would then only be removed usingdvc exp remove
.rwlock
if that happend.). So it might be possible to have arestart_if_killed
option per stage where DVC, instead of removing all files if called the next time, will callcmd
again in the workspace. Only if that fails with some exception the command will end.Additional Thoughts
Some ideas that might also be really helpful to the HPC community. We have our own queing system, so running multiple experiments in parallel by using something like #8121 would be really powerful.
I also saw, that the DVC queue is written modular and might in principle support more than celery. For HPC the dask distributed package is often used because it supports many HPC platforms. I've written dask4dvc for some compatibility but implementing that into dvc (maybe
pip install dvc[dask]
ordvc-dask
as an extra dependency) could potentially also be really powerful (maybe even for parallelizingdvc repro
). This could also be one possible solution for #755.Beta Was this translation helpful? Give feedback.
All reactions