Skip to content

Commit

Permalink
Propagate failure/preemptation and set timeout for multinode training (
Browse files Browse the repository at this point in the history
…#278)

* Add parameters in multinode training to propagate failure/preemptation and set timeout.
  • Loading branch information
yizhongw authored Aug 20, 2024
1 parent a9c76a4 commit 5834277
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 0 deletions.
3 changes: 3 additions & 0 deletions configs/beaker_configs/default_finetune_lora_multinode.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ tasks:
replicas: 4
leaderSelection: true
hostNetworking: true
propagateFailure: true
propagatePreemption: true
synchronizedStartTimeout: 15m
image:
beaker: Yizhongw03/open-instruct-multi-node
command: [
Expand Down
3 changes: 3 additions & 0 deletions configs/beaker_configs/default_finetune_multinode.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ tasks:
replicas: 4
leaderSelection: true
hostNetworking: true
propagateFailure: true
propagatePreemption: true
synchronizedStartTimeout: 15m
image:
beaker: nathanl/open_instruct_auto
command: [
Expand Down
3 changes: 3 additions & 0 deletions configs/beaker_configs/default_finetune_qlora_multinode.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ tasks:
replicas: 4
leaderSelection: true
hostNetworking: true
propagateFailure: true
propagatePreemption: true
synchronizedStartTimeout: 15m
image:
beaker: Yizhongw03/open-instruct-multi-node
command: [
Expand Down

0 comments on commit 5834277

Please sign in to comment.