Update output path for DSE runs by amaslenn · Pull Request #473 · NVIDIA/cloudai

amaslenn · 2025-04-11T11:12:44Z

Summary

Fix an issue when output_path wasn't updated for test run object. Thanks @srivatsankrishnan for reporting this issue.

Test Plan

CI (extended)
Manual dry-run, got a failure for cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml, but it is unrelated:
```
File "/Users/andreyma/workspace/nvidia/cloudai/src/cloudai/workloads/nemo_run/nemo_run.py", line 143, in constraint_check
    constraint3 = gbs % (mbs * dp) == 0
ZeroDivisionError: integer division or modulo by zero
```
Run cloudai run --system-config ../cw.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml on CW:

2025-04-11 04:47:52,204 - INFO - Running step 0 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 128}
2025-04-11 04:47:52,250 - INFO - Starting test: dse_nemo_run_llama3_8b_1
2025-04-11 04:47:52,250 - INFO - Running test: dse_nemo_run_llama3_8b_1
...
2025-04-11 04:52:54,425 - DEBUG - All jobs finished successfully.
2025-04-11 04:52:54,430 - DEBUG - Getting metric default from /PATH/cloudai/results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt
2025-04-11 04:52:54,440 - DEBUG - No train_step_timing found in results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/1/stdout.txt
2025-04-11 04:52:54,441 - DEBUG - Writing trajectory into results/nemo_run_llama3_8b_2025-04-11_04-47-52/dse_nemo_run_llama3_8b_1/0/trajectory.csv (exists: False)
2025-04-11 04:52:54,447 - INFO - Step 1: Observation: [-1.0], Reward: -1.0
2025-04-11 04:52:54,448 - INFO - Running step 1 with action {'docker_image_url': 'nvcr.io/nvidia/nemo:24.12.rc3', 'task': 'pretrain', 'recipe_name': 'llama3_8b', 'num_layers': None, 'trainer.max_steps': 100, 'trainer.val_check_interval': 1000, 'trainer.num_nodes': 1, 'trainer.strategy.tensor_model_parallel_size': 1, 'trainer.strategy.pipeline_model_parallel_size': 1, 'trainer.strategy.context_parallel_size': 1, 'trainer.strategy.virtual_pipeline_model_parallel_size': None, 'trainer.plugins': None, 'trainer.callbacks': None, 'log.ckpt.save_on_train_epoch_end': False, 'log.ckpt.save_last': False, 'log.tensorboard': None, 'data.micro_batch_size': 1, 'data.global_batch_size': 256}
2025-04-11 04:52:54,546 - INFO - Starting test: dse_nemo_run_llama3_8b_1
2025-04-11 04:52:54,547 - INFO - Running test: dse_nemo_run_llama3_8b_1
...

No train_step_timing found in is real, execution failed I see some errors coming from NCCL: [Accept failed Resource temporarily unavailable](cw-dfw-h100-004-104-026:3132825:3139333 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable). But file is correct.

Additional Notes

—

TaekyungHeo

As a new argument (step) has been added, I think we need @srivatsankrishnan 's approval. Other than that, all look good to me. If you already talked to Srivatsan, you can merge.

srivatsankrishnan

This will break the interface for other ML algorithms. step by design should have only actions. Need to figure out how to make step count as instance variable of gym.

Testing this fix and will confirm if it works for me.

amaslenn · 2025-04-11T16:17:02Z

This will break the interface for other ML algorithms. step by design should have only actions. Need to figure out how to make step count as instance variable of gym.

Testing this fix and will confirm if it works for me.

Restored the API, kept the fix. Please check if that works for you.

srivatsankrishnan

Thanks for the quick fix. Functionality wise, it was tested on on conjunction with this PR:
#466

Tested with DSE on four models (logs available here) and able to pas the observation/reward correctly now:
https://drive.google.com/drive/folders/1XP1alXX80AkLnMdGeb1AHOG9tHQ_LOIy

Update output path for DSE runs

f048ac0

amaslenn requested review from TaekyungHeo, srinivas212 and srivatsankrishnan as code owners April 11, 2025 11:12

amaslenn added bug Something isn't working enhancement New feature or request labels Apr 11, 2025

amaslenn added 2 commits April 11, 2025 04:59

Extend logging

a696392

Merge branch 'main' into am/dse-upd-path

e4f5c89

TaekyungHeo previously approved these changes Apr 11, 2025

View reviewed changes

srivatsankrishnan reviewed Apr 11, 2025

View reviewed changes

Restore gym API

626175f

amaslenn dismissed TaekyungHeo’s stale review via 626175f April 11, 2025 16:16

TaekyungHeo approved these changes Apr 11, 2025

View reviewed changes

srivatsankrishnan mentioned this pull request Apr 11, 2025

Nemo2.0 with Recipe for Complex CLI Features (Plan B) #466

Merged

srivatsankrishnan approved these changes Apr 14, 2025

View reviewed changes

amaslenn merged commit 18c2283 into main Apr 14, 2025
2 checks passed

amaslenn deleted the am/dse-upd-path branch April 14, 2025 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update output path for DSE runs#473

Update output path for DSE runs#473
amaslenn merged 4 commits intomainfrom
am/dse-upd-path

amaslenn commented Apr 11, 2025 •

edited

Loading

Uh oh!

TaekyungHeo left a comment

Uh oh!

srivatsankrishnan left a comment

Uh oh!

amaslenn commented Apr 11, 2025

Uh oh!

srivatsankrishnan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amaslenn commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

TaekyungHeo left a comment

Choose a reason for hiding this comment

Uh oh!

srivatsankrishnan left a comment

Choose a reason for hiding this comment

Uh oh!

amaslenn commented Apr 11, 2025

Uh oh!

srivatsankrishnan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amaslenn commented Apr 11, 2025 •

edited

Loading