Conversation
|
hey @roclark, Hemil mentioned that you built the initial DGXC executor. Before diving too deep into the work, I wanted to prefetch your thoughts on this PR. Do you agree with the implementation design to fetch container logs? |
|
Hey @ko3n1g, I haven't worked on the Run:ai pieces in a while, but I think this makes sense. My only question would be if the API server is something that is always exposed/available by default to users on installations or if that needs to be configured by an admin beforehand. |
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
f2b1046 to
0c6b7c9
Compare
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
nemo_run/core/execution/dgxcloud.py
Outdated
| from enum import Enum | ||
| from pathlib import Path | ||
| from typing import Any, Optional, Type | ||
| from typing import Any, Dict, Iterable, Optional |
There was a problem hiding this comment.
Can we remove Dict and just use dict for the types?
| role_name, | ||
| replica_id, | ||
| regex, | ||
| None, |
There was a problem hiding this comment.
torchx has 8 args, so prior to this change we were routing should_tail and streams to since and until.
Since my codepaths use streams I ran into this issue
Signed-off-by: oliver könig <okoenig@nvidia.com>
This adds a log-streamer to the DGXCExecutor and ties it to the frontend via the torchx scheduler.
Since the DXGC endpoint doesn't expose logs, we need to go via the kube-apiserver to download container pod logs. Since we need a token for creating that, I decided to go via the torchx scheduler that allows instantiating the executor' state.
I still need to: