Skip to content

Commit 6c39d25

Browse files
committed
add script to train with ft
Summary: the script adds configuration options to run training locally with ft enabled
1 parent 634d838 commit 6c39d25

File tree

1 file changed

+24
-0
lines changed

1 file changed

+24
-0
lines changed

run_train_ft.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/bash
2+
3+
FT_REPLICA_ID="${FT_REPLICA_ID:-0}"
4+
FT_GROUP_SIZE="${FT_GROUP_SIZE:-1}"
5+
6+
TORCH_SHARE_RDZV_TCP_STORE=1 LOGLEVEL=INFO NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES="${FT_REPLICA_ID}" NGPU=1 ./run_train.sh \
7+
--fault_tolerance.enable \
8+
--fault_tolerance.group_size="${FT_GROUP_SIZE}" \
9+
--fault_tolerance.replica_id="${FT_REPLICA_ID}" \
10+
--training.local_batch_size=2 \
11+
--fault_tolerance.sync_steps=10 \
12+
--fault_tolerance.semi_sync_method=diloco \
13+
--parallelism.data_parallel_shard_degree=1 \
14+
--fault_tolerance.num_fragments=2 \
15+
--experimental.custom_args_module=torchtitan.components.ft.config \
16+
--profiling.enable_profiling \
17+
--profiling.profile_freq=5 \
18+
--profiling.profiler_active=5 \
19+
--profiling.profiler_warmup=0 \
20+
--training.steps=1000 \
21+
--comm.train_timeout_seconds=1 \
22+
--fault_tolerance.process_group=nccl \
23+
--checkpoint.no_enable_ft_checkpointing \
24+
--fault_tolerance.process_group_timeout_ms=1

0 commit comments

Comments
 (0)