Conversation
TaekyungHeo
left a comment
There was a problem hiding this comment.
As a new argument (step) has been added, I think we need @srivatsankrishnan 's approval. Other than that, all look good to me. If you already talked to Srivatsan, you can merge.
srivatsankrishnan
left a comment
There was a problem hiding this comment.
This will break the interface for other ML algorithms. step by design should have only actions. Need to figure out how to make step count as instance variable of gym.
Testing this fix and will confirm if it works for me.
Restored the API, kept the fix. Please check if that works for you. |
srivatsankrishnan
left a comment
There was a problem hiding this comment.
Thanks for the quick fix. Functionality wise, it was tested on on conjunction with this PR:
#466
Tested with DSE on four models (logs available here) and able to pas the observation/reward correctly now:
https://drive.google.com/drive/folders/1XP1alXX80AkLnMdGeb1AHOG9tHQ_LOIy
Summary
Fix an issue when output_path wasn't updated for test run object. Thanks @srivatsankrishnan for reporting this issue.
Test Plan
cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml, but it is unrelated:cloudai run --system-config ../cw.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.tomlon CW:2025-04-11 04:47:52,204 - INFO - Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 128} 2025-04-11 04:47:52,250 - INFO - Starting test: dse_nemo_run_llama3_8b_1 2025-04-11 04:47:52,250 - INFO - Running test: dse_nemo_run_llama3_8b_1 ... 2025-04-11 04:52:54,425 - DEBUG - All jobs finished successfully. 2025-04-11 04:52:54,430 - DEBUG - Getting metric default from /PATH/cloudai/results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt 2025-04-11 04:52:54,440 - DEBUG - No train_step_timing found in results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt 2025-04-11 04:52:54,441 - DEBUG - Writing trajectory into results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/trajectory.csv (exists: False) 2025-04-11 04:52:54,447 - INFO - Step 1: Observation: [-1.0], Reward: -1.0 2025-04-11 04:52:54,448 - INFO - Running step 1 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 256} 2025-04-11 04:52:54,546 - INFO - Starting test: dse_nemo_run_llama3_8b_1 2025-04-11 04:52:54,547 - INFO - Running test: dse_nemo_run_llama3_8b_1 ...No train_step_timing found inis real, execution failed I see some errors coming from NCCL:[Accept failed Resource temporarily unavailable](cw-dfw-h100-004-104-026:3132825:3139333 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable). But file is correct.Additional Notes
—