Skip to content

Update output path for DSE runs#473

Merged
amaslenn merged 4 commits intomainfrom
am/dse-upd-path
Apr 14, 2025
Merged

Update output path for DSE runs#473
amaslenn merged 4 commits intomainfrom
am/dse-upd-path

Conversation

@amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Apr 11, 2025

Summary

Fix an issue when output_path wasn't updated for test run object. Thanks @srivatsankrishnan for reporting this issue.

Test Plan

  1. CI (extended)
  2. Manual dry-run, got a failure for cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml, but it is unrelated:
    File "/Users/andreyma/workspace/nvidia/cloudai/src/cloudai/workloads/nemo_run/nemo_run.py", line 143, in constraint_check
        constraint3 = gbs % (mbs * dp) == 0
    ZeroDivisionError: integer division or modulo by zero
  3. Run cloudai run --system-config ../cw.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml on CW:
2025-04-11 04:47:52,204 - INFO - Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 128}
2025-04-11 04:47:52,250 - INFO - Starting test: dse_nemo_run_llama3_8b_1
2025-04-11 04:47:52,250 - INFO - Running test: dse_nemo_run_llama3_8b_1
...
2025-04-11 04:52:54,425 - DEBUG - All jobs finished successfully.
2025-04-11 04:52:54,430 - DEBUG - Getting metric default from /PATH/cloudai/results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt
2025-04-11 04:52:54,440 - DEBUG - No train_step_timing found in results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt
2025-04-11 04:52:54,441 - DEBUG - Writing trajectory into results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/trajectory.csv (exists: False)
2025-04-11 04:52:54,447 - INFO - Step 1: Observation: [-1.0], Reward: -1.0
2025-04-11 04:52:54,448 - INFO - Running step 1 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 256}
2025-04-11 04:52:54,546 - INFO - Starting test: dse_nemo_run_llama3_8b_1
2025-04-11 04:52:54,547 - INFO - Running test: dse_nemo_run_llama3_8b_1
...

No train_step_timing found in is real, execution failed I see some errors coming from NCCL: [Accept failed Resource temporarily unavailable](cw-dfw-h100-004-104-026:3132825:3139333 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable). But file is correct.

Additional Notes

@amaslenn amaslenn added bug Something isn't working enhancement New feature or request labels Apr 11, 2025
TaekyungHeo
TaekyungHeo previously approved these changes Apr 11, 2025
Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a new argument (step) has been added, I think we need @srivatsankrishnan 's approval. Other than that, all look good to me. If you already talked to Srivatsan, you can merge.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break the interface for other ML algorithms. step by design should have only actions. Need to figure out how to make step count as instance variable of gym.

Testing this fix and will confirm if it works for me.

@amaslenn
Copy link
Contributor Author

This will break the interface for other ML algorithms. step by design should have only actions. Need to figure out how to make step count as instance variable of gym.

Testing this fix and will confirm if it works for me.

Restored the API, kept the fix. Please check if that works for you.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix. Functionality wise, it was tested on on conjunction with this PR:
#466

Tested with DSE on four models (logs available here) and able to pas the observation/reward correctly now:
https://drive.google.com/drive/folders/1XP1alXX80AkLnMdGeb1AHOG9tHQ_LOIy

@amaslenn amaslenn merged commit 18c2283 into main Apr 14, 2025
2 checks passed
@amaslenn amaslenn deleted the am/dse-upd-path branch April 14, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants