feat: Stream DGXC logs by ko3n1g · Pull Request #377 · NVIDIA-NeMo/Run

ko3n1g · 2025-11-05T10:57:03Z

This adds a log-streamer to the DGXCExecutor and ties it to the frontend via the torchx scheduler.

Since the DXGC endpoint doesn't expose logs, we need to go via the kube-apiserver to download container pod logs. Since we need a token for creating that, I decided to go via the torchx scheduler that allows instantiating the executor' state.

I still need to:

test this e2e with a DGXC environment
add unit tests

nemo_run/core/execution/dgxcloud.py

nemo_run/run/torchx_backend/schedulers/dgxcloud.py

nemo_run/core/execution/dgxcloud.py

nemo_run/run/torchx_backend/schedulers/dgxcloud.py

test/core/execution/test_dgxcloud.py

ko3n1g · 2025-11-06T09:41:21Z

hey @roclark, Hemil mentioned that you built the initial DGXC executor. Before diving too deep into the work, I wanted to prefetch your thoughts on this PR. Do you agree with the implementation design to fetch container logs?

roclark · 2025-11-07T16:56:58Z

Hey @ko3n1g, I haven't worked on the Run:ai pieces in a while, but I think this makes sense. My only question would be if the API server is something that is always exposed/available by default to users on installations or if that needs to be configured by an admin beforehand.

nemo_run/run/logs.py

nemo_run/core/execution/dgxcloud.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

test/core/execution/test_dgxcloud.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

hemildesai · 2025-11-19T17:29:38Z

nemo_run/core/execution/dgxcloud.py

 from enum import Enum
 from pathlib import Path
-from typing import Any, Optional, Type
+from typing import Any, Dict, Iterable, Optional


Can we remove Dict and just use dict for the types?

hemildesai · 2025-11-19T17:30:53Z

nemo_run/run/logs.py

                role_name,
                replica_id,
                regex,
+                None,


Why are these needed?

torchx has 8 args, so prior to this change we were routing should_tail and streams to since and until.

Since my codepaths use streams I ran into this issue

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 5, 2025 10:58 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 5, 2025

View reviewed changes

ko3n1g temporarily deployed to public November 5, 2025 11:01 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 5, 2025

View reviewed changes

nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public November 5, 2025 11:05 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 5, 2025 11:09 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 5, 2025 11:13 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 5, 2025

View reviewed changes

test/core/execution/test_dgxcloud.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public November 5, 2025 11:15 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 5, 2025 11:17 — with GitHub Actions Inactive

ko3n1g requested review from hemildesai and roclark November 5, 2025 11:18

ko3n1g temporarily deployed to public November 10, 2025 16:20 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 10, 2025 17:26 — with GitHub Actions Inactive

ko3n1g requested a review from a team November 10, 2025 20:48

ko3n1g temporarily deployed to public November 10, 2025 21:04 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 15:50 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 16:01 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 17, 2025

View reviewed changes

nemo_run/run/logs.py Dismissed Show dismissed Hide dismissed

ko3n1g temporarily deployed to public November 17, 2025 16:07 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 17:04 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 17, 2025

View reviewed changes

nemo_run/core/execution/dgxcloud.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public November 17, 2025 17:12 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 17:25 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 17:38 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 20:58 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 21:36 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public November 17, 2025 21:46 — with GitHub Actions Inactive

ko3n1g added 14 commits November 19, 2025 11:27

revert

4895bfe

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

97a69dc

Signed-off-by: oliver könig <okoenig@nvidia.com>

test

f79d4c4

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

a2f0751

Signed-off-by: oliver könig <okoenig@nvidia.com>

verify=False

8bf2e79

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

464e301

Signed-off-by: oliver könig <okoenig@nvidia.com>

decode_unicode=True

77fe9d2

Signed-off-by: oliver könig <okoenig@nvidia.com>

headers

c8c158a

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

d7d7070

Signed-off-by: oliver könig <okoenig@nvidia.com>

newline

0559d4a

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

0aedd40

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix

e1fe89f

Signed-off-by: oliver könig <okoenig@nvidia.com>

more tests

ea98c2b

Signed-off-by: oliver könig <okoenig@nvidia.com>

disable flake8

0c6b7c9

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 19, 2025 11:29 — with GitHub Actions Inactive

ko3n1g force-pushed the ko3n1g/feat/stream-dgxc-logs branch from f2b1046 to 0c6b7c9 Compare November 19, 2025 11:29

ko3n1g temporarily deployed to public November 19, 2025 11:30 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Nov 19, 2025

View reviewed changes

test/core/execution/test_dgxcloud.py Fixed Show fixed Hide fixed

remove unused import

7874fa7

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 19, 2025 11:37 — with GitHub Actions Inactive

test stream_url_async

18b51c2

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 19, 2025 11:56 — with GitHub Actions Inactive

cleanup

6b7b308

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 19, 2025 12:00 — with GitHub Actions Inactive

hemildesai reviewed Nov 19, 2025

View reviewed changes

use native typing

c5c4c99

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public November 19, 2025 19:18 — with GitHub Actions Inactive

ko3n1g requested a review from hemildesai November 19, 2025 22:30

hemildesai approved these changes Nov 20, 2025

View reviewed changes

ko3n1g merged commit 6b2240e into main Nov 20, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Stream DGXC logs#377

feat: Stream DGXC logs#377
ko3n1g merged 45 commits intomainfrom
ko3n1g/feat/stream-dgxc-logs

ko3n1g commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ko3n1g commented Nov 6, 2025

Uh oh!

roclark commented Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hemildesai Nov 19, 2025

Uh oh!

hemildesai Nov 19, 2025

Uh oh!

ko3n1g Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ko3n1g commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ko3n1g commented Nov 6, 2025

Uh oh!

roclark commented Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hemildesai Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

hemildesai Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ko3n1g Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ko3n1g commented Nov 5, 2025 •

edited

Loading