HPC and non-python checkpointing #9245

PythonFZ · 2023-03-23T16:22:12Z

PythonFZ
Mar 23, 2023

My workflow

I'm using DVC with CLI tools that provide their own checkpointing solutions (e.g. lammps or CP2k). The main motivation for checkpointing is not to go back to e.g. an earlier model as it is often case in ML but more to continue a simulation and append to existing output files (which might be > 100 GB). This is often required, because simulations are carried out on HPC resources with strict time limits (hours to a few days). Reaching the time limit means, that the job is immediatly killed, so the DVC process won't have time to do anything. (Some clusters send some signal before stopping, but this is not guaranteed.)

My current solution to this problem, is to run the simulation in a temporary directory and move the outputs to the correct location only after the process finished. In this way I can call dvc repro multiple times without having DVC remove the created checkpoint files.
The corresponding dvc.yaml file could look something like:

stages:
   cp2k:
     cmd: mpirun cp2k.psmp -in cp2k.inp && mv tmp/ output/
     deps:
      - cp2k.inp
     outs: 
      - output/

Unfortunately, this doesn't work with dvc exp and feels a bit hacky as well.

Potential Solutions

There are a few things that would help me (and potentially many others who haven't used DVC yet in the field of HPC) :

An option for dvc queue start --keep-failed. In this way, after the experiemnt was killed, one could do dvc exp apply <id> and start a new experiment from the checkpoint files that would otherwise be removed. The data would then only be removed using dvc exp remove.
Because the processes are killed and DVC writes a lock file it knows if the process was killed (this was changed some time ago, but one had to remove the rwlock if that happend.). So it might be possible to have a restart_if_killed option per stage where DVC, instead of removing all files if called the next time, will call cmd again in the workspace. Only if that fails with some exception the command will end.

Additional Thoughts

Some ideas that might also be really helpful to the HPC community. We have our own queing system, so running multiple experiments in parallel by using something like #8121 would be really powerful.
I also saw, that the DVC queue is written modular and might in principle support more than celery. For HPC the dask distributed package is often used because it supports many HPC platforms. I've written dask4dvc for some compatibility but implementing that into dvc (maybe pip install dvc[dask] or dvc-dask as an extra dependency) could potentially also be really powerful (maybe even for parallelizing dvc repro). This could also be one possible solution for #755.

efiop · 2023-03-27T13:04:21Z

efiop
Mar 27, 2023

Sorry, I see that this was part of discussion in #8625 , but this still doesn't seem actionable yet, so I've converted to discussion again.

5 replies

PythonFZ Mar 27, 2023
Author

I tried to provide some better examples here and some possible solutions. Is there anything I could do or any information that could help speed up the process ?

efiop Mar 27, 2023

@PythonFZ Realistically I don't think we have any capacity to implement this ourselves any time soon. Someone needs to actually be a user of this, develop the scenario and contribute missing features. Sorry 🙁

PythonFZ Mar 27, 2023
Author

As much as I expected this it is unfortunate to hear. Personally, I think that a feature like #8121 would already help a lot. Together with --keep-failed the rest could be done by some hacky scripts. I know this is still much to ask, but how do you see the chances that this could be looked at?

efiop Mar 27, 2023

@PythonFZ Not anytime soon. The priority on that ticket so far is p3, which means low odds of us suddenly picking it up. We are open to contributions though.

dberenbaum Mar 27, 2023
Collaborator

@PythonFZ Agree with @efiop that we would love to have #8121 if you or someone want to contribute it. Otherwise, there may not be enough demand for it anytime soon, especially since there's a workaround in the issue.

If you want to create a separate issue for keeping failed outputs, I think that's also actionable and is more likely to get picked up sooner, since it seems hard to otherwise work with failed experiments, although again contributions would be welcome.

The larger discussion around HPC is an interesting one. I doubt we will soon prioritize a solution strictly for the HPC community (although I'm sympathetic as a former SLURMer), but I don't think the use case is as unique as it might seem (for example, most ML frameworks also have their own checkpointing). You might want to take a look at #9221.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC and non-python checkpointing #9245

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

HPC and non-python checkpointing #9245

PythonFZ Mar 23, 2023

My workflow

Potential Solutions

Additional Thoughts

Replies: 1 comment · 5 replies

efiop Mar 27, 2023

PythonFZ Mar 27, 2023 Author

efiop Mar 27, 2023

PythonFZ Mar 27, 2023 Author

efiop Mar 27, 2023

dberenbaum Mar 27, 2023 Collaborator

PythonFZ
Mar 23, 2023

Replies: 1 comment 5 replies

efiop
Mar 27, 2023

PythonFZ Mar 27, 2023
Author

PythonFZ Mar 27, 2023
Author

dberenbaum Mar 27, 2023
Collaborator