diff --git a/deployments/charts/router/README.md b/deployments/charts/router/README.md index f43474f54..d18b5c80a 100644 --- a/deployments/charts/router/README.md +++ b/deployments/charts/router/README.md @@ -123,7 +123,7 @@ helm upgrade my-router ./router -f my-values.yaml | Parameter | Description | Default | |-----------|-------------|---------| -| `targetSchema` | pgroll schema version for search_path (e.g., `public_v6_2_0`). Leave empty to use the default `public` schema. | `""` | +| `targetSchema` | Database schema for search_path. Leave empty to use the default `public` schema. | `""` | | `services.postgres.serviceName` | PostgreSQL service name | `postgres` | | `services.postgres.port` | PostgreSQL service port | `5432` | | `services.postgres.db` | PostgreSQL database name | `osmo` | diff --git a/deployments/charts/service/README.md b/deployments/charts/service/README.md index ed61b331a..e93538b04 100644 --- a/deployments/charts/service/README.md +++ b/deployments/charts/service/README.md @@ -52,7 +52,7 @@ This Helm chart deploys the OSMO platform with its core services and required si | Parameter | Description | Default | |-----------|-------------|---------| | `services.migration.enabled` | Enable the pgroll migration Job (Helm pre-upgrade hook) | `false` | -| `services.migration.targetSchema` | Target pgroll schema version. Convention: `public_v{MAJOR}_{MINOR}_{PATCH}`. Updated per chart release. Set to `public_v{MAJOR}_{MINOR}_{PATCH}` to use versioned schemas. Versioned schemas ensure zero downtime between pod roll over. Setting to `public` will cause temporary disruption to existing pods as there could be database operations that are incompatible between versions. | `public` | +| `services.migration.targetSchema` | Target pgroll schema. Use `public` (the default). | `public` | | `services.migration.image` | Container image for the migration Job | `postgres:15-alpine` | | `services.migration.pgrollVersion` | pgroll release version to download | `v0.16.1` | | `services.migration.serviceAccountName` | Service account name (defaults to global if empty) | `""` | @@ -68,15 +68,6 @@ This Helm chart deploys the OSMO platform with its core services and required si To add new migrations for future releases, drop JSON files into the chart's `migrations/` directory. They are automatically included via `.Files.Glob`. -#### Choosing `targetSchema` - -| Scenario | `targetSchema` value | What happens | -|----------|---------------------|--------------| -| **Zero-downtime upgrades with multiple versions coexisting** | `public_v6_2_0` (default) | Creates a versioned schema with views. Old pods use `public`, new pods use versioned views. Both run simultaneously during gradual rollout. Re-deploys of the same version are instant no-ops (schema already exists). | -| **Simple migration without versioned schemas** | `public` | Applies migrations directly to `public`. No views created. Simpler but old and new pods cannot coexist safely if schema changes are breaking. | - -When using the versioned schema (`public_v6_2_0`), also set `targetSchema` in the router chart and `OSMO_SCHEMA_VERSION` will be automatically injected into all service pods when `migration.enabled` is true. - ### PostgreSQL Settings | Parameter | Description | Default | diff --git a/deployments/upgrades/6_0_to_6_2_upgrade.md b/deployments/upgrades/6_0_to_6_2_upgrade.md index 10d85c73d..1edbb20b4 100644 --- a/deployments/upgrades/6_0_to_6_2_upgrade.md +++ b/deployments/upgrades/6_0_to_6_2_upgrade.md @@ -22,7 +22,7 @@ SPDX-License-Identifier: Apache-2.0 - **New authentication architecture** — oauth2Proxy sidecar + authz sidecar replace the old Envoy-native oauth2Filter - **RBAC system** — new database tables for users, roles, and role mappings managed by the authz sidecar -- **pgroll database migrations** — zero-downtime schema changes via versioned schemas +- **pgroll database migrations** — automated schema changes - **Backend operator tokens must be recreated** — the RBAC migration deletes old `SERVICE` type access tokens; new tokens must be created before upgrading backend deployment charts ## Before you start @@ -40,7 +40,7 @@ Depending on your deployment, follow the relevant sections: ### How pgroll works -OSMO 6.2 uses [pgroll](https://github.com/xataio/pgroll) for zero-downtime database schema migrations. pgroll applies migrations to the `public` schema and optionally creates a versioned schema (e.g., `public_v6_2_0`) containing views over all tables. Services set their PostgreSQL `search_path` to this versioned schema, allowing old and new versions to coexist during a rolling upgrade. +OSMO 6.2 uses [pgroll](https://github.com/xataio/pgroll) for database schema migrations. pgroll applies migrations directly to the `public` schema. ### Running migrations @@ -50,7 +50,6 @@ Enable the migration job in the service chart values: services: migration: enabled: true - targetSchema: public_v6_2_0 ``` The migration runs as a Helm pre-upgrade hook before pods are updated. For ArgoCD, add: @@ -65,14 +64,6 @@ services: The database password is read from `OSMO_POSTGRES_PASSWORD` env var, or from the `postgres_password:` field in the file at `OSMO_CONFIG_FILE`. -### Choosing your upgrade path - -**Direct upgrade (simpler, requires downtime):** -Set `targetSchema: public`. Migrations apply directly to the `public` schema. All services must be on 6.2 after the upgrade. - -**Versioned schema (zero-downtime):** -Set `targetSchema: public_v6_2_0`. Both 6.0 and 6.2 services can run simultaneously. The router chart also needs `targetSchema: public_v6_2_0` set at the top level. - The migration script is idempotent — safe to run multiple times. ### Schema changes in 6.2 diff --git a/docs/user_guide/getting_started/install/index.rst b/docs/user_guide/getting_started/install/index.rst index 046734824..71e607b90 100644 --- a/docs/user_guide/getting_started/install/index.rst +++ b/docs/user_guide/getting_started/install/index.rst @@ -71,3 +71,25 @@ After successful authentication, you are logged in. Welcome to OSMO. :class: no-copybutton Successfully logged in. Welcome . + +Agent Skill +----------- + +OSMO provides an agent skill that enables AI agents to interact with the OSMO CLI on your behalf. +Once installed, agents in tools such as Claude Code, Cursor and Codex can check GPU resources, +generate and submit workflows, monitor progress, diagnose failures, and orchestrate end-to-end +Physical AI workloads through natural language. + +The skill follows the `Agent Skills `_ open standard and is compatible with +`30+ agent tools `_. + +To install: + +.. code-block:: bash + + $ npx skills add NVIDIA/osmo + +.. seealso:: + + See the `skills/README `_ for detailed + installation options and usage examples. diff --git a/skills/README.md b/skills/README.md new file mode 100644 index 000000000..26043910e --- /dev/null +++ b/skills/README.md @@ -0,0 +1,82 @@ + + +# Agent Skills + +Agent skills for the OSMO platform, built on the [Agent Skills](https://agentskills.io) open standard. Enables AI +agents to check GPU resources, generate and submit workflows, monitor progress, diagnose failures, and orchestrate +end-to-end Physical AI workloads. + +Compatible with Claude Code, Cursor, Codex, GitHub Copilot, Gemini CLI, and [30+ other agent tools](https://skills.sh/). + +## Prerequisites + +The OSMO CLI must be installed and authenticated before using the skill. See the [Getting Started](https://nvidia.github.io/OSMO/main/user_guide/getting_started/install/index.html) guide for instructions. + +## Installation + +To install: + +```bash +npx skills add NVIDIA/osmo +``` + +To update an existing installation: + +```bash +npx skills update +``` + +To uninstall: + +```bash +npx skills remove osmo-agent +``` + +## Usage + +Once installed, the skill activates automatically when the agent detects relevant requests. Example prompts: + +| Category | Example | +|----------|---------| +| Resource availability | "What GPUs are available?" | +| Workflow submission | "Submit workflow.yaml to available pool" | +| Monitoring | "What's the status of my last workflow?" | +| Failure diagnosis | "My workflow failed — figure out why and resubmit" | +| End-to-end orchestration | "Create a SDG workflow with Issac Sim, submit and monitor it, and download results when done" | + +For complex workflows, the skill spawns specialized sub-agents to handle resource selection, YAML generation, submission, monitoring, logs fetching, failure diagnosis, and retries autonomously. + +## Skill Contents + +``` +skills/osmo-agent/ +├── SKILL.md # Main skill instructions +├── LICENSE # Apache-2.0 +├── agents/ +│ ├── workflow-expert.md # Sub-agent: workflow creation, submission, diagnosis +│ └── logs-reader.md # Sub-agent: log fetching and summarization +└── references/ + ├── cookbook.md # 40+ real-world workflow templates + ├── workflow-patterns.md # Multi-task, parallel, data dependency patterns + └── advanced-patterns.md # Checkpointing, retry logic, node exclusion +``` + +## License + +Apache-2.0 — see [osmo-agent/LICENSE](osmo-agent/LICENSE). diff --git a/skills/osmo-agent/SKILL.md b/skills/osmo-agent/SKILL.md index 35f193d55..6284331e0 100644 --- a/skills/osmo-agent/SKILL.md +++ b/skills/osmo-agent/SKILL.md @@ -1,5 +1,5 @@ --- -name: osmo +name: osmo-agent description: > How to use the OSMO CLI to manage cloud compute resources for robotics development. Use this skill whenever the user asks about available resources, nodes, pools, GPUs, @@ -9,6 +9,12 @@ description: > check the status or logs of a running/completed workflow, list or browse recent workflow submissions, want to understand what a specific workflow does or is configured to do, or want to create an OSMO app from a workflow. +license: Apache-2.0 +compatibility: > + Requires osmo CLI installed and authenticated (osmo login). +metadata: + author: nvidia + version: "1.0.0" --- # OSMO CLI Use Cases diff --git a/src/cli/BUILD b/src/cli/BUILD index 15b90d9e7..5d77052a7 100755 --- a/src/cli/BUILD +++ b/src/cli/BUILD @@ -93,7 +93,6 @@ osmo_py_binary( srcs = ["cli_builder.py"], deps = [ ":cli_lib", - requirement("backports-tarfile"), requirement("pyinstaller"), requirement("shtab"), ], diff --git a/src/cli/login.py b/src/cli/login.py index 745bbe595..b407934d0 100644 --- a/src/cli/login.py +++ b/src/cli/login.py @@ -127,8 +127,8 @@ class UrlValidator(pydantic.BaseModel): token = args.token else: raise osmo_errors.OSMOUserError('Must provide token file with --token_file or --token') - refresh_url = login.construct_token_refresh_url(url, token) - service_client.login_manager.token_login(url, refresh_url) + refresh_url = login.construct_token_refresh_url(url) + service_client.login_manager.token_login(url, refresh_url, token) # For developers, simply send username as a header else: diff --git a/src/lib/utils/client.py b/src/lib/utils/client.py index c5344e753..c2d5952b2 100644 --- a/src/lib/utils/client.py +++ b/src/lib/utils/client.py @@ -226,8 +226,8 @@ def dev_login(self, url: str, username: str): self._login_storage = login.dev_login(url, username) self._save_login_info(self._login_storage, welcome=True) - def token_login(self, url: str, access_token: str): - self._login_storage = login.token_login(url, access_token, self.user_agent) + def token_login(self, url: str, refresh_url: str, refresh_token: str): + self._login_storage = login.token_login(url, refresh_url, refresh_token, self.user_agent) self._save_login_info(self._login_storage, welcome=True) def logout(self): @@ -274,7 +274,7 @@ def get_access_token(self) -> str | None: raise osmo_errors.OSMOUserError('Must login first with "login" command') if self._login_storage.token_login is None: raise osmo_errors.OSMOUserError('Must login first with token') - return login.fetch_token_from_refresh_url(self._login_storage.token_login.refresh_url or '') + return self._login_storage.token_login.refresh_token class ServiceClient(): diff --git a/src/lib/utils/login.py b/src/lib/utils/login.py index bab15de77..ddd9af7bc 100644 --- a/src/lib/utils/login.py +++ b/src/lib/utils/login.py @@ -21,7 +21,6 @@ import os import time from typing import List, Literal -from urllib.parse import urlencode, urlparse import pydantic import requests # type: ignore @@ -208,17 +207,19 @@ def owner_password_login(config: LoginConfig, ) -def construct_token_refresh_url(url: str, token: str) -> str: - return os.path.join(url, f'api/auth/jwt/access_token?{urlencode({"access_token": token})}') +def construct_token_refresh_url(url: str) -> str: + return os.path.join(url, 'api/auth/jwt/access_token') def token_login(url: str, refresh_url: str, - user_agent: str| None) -> LoginStorage: + refresh_token: str, + user_agent: str | None) -> LoginStorage: headers = {} if user_agent: headers['User-Agent'] = user_agent - result = requests.get(refresh_url, timeout=TIMEOUT, headers=headers) + result = requests.post(refresh_url, json={'token': refresh_token}, + timeout=TIMEOUT, headers=headers) if result.status_code >= 300: raise osmo_errors.OSMOServerError('Unable to refresh login token (status code ' \ f'{result.status_code}): {result.text}\n' \ @@ -228,7 +229,8 @@ def token_login(url: str, url=url, token_login=TokenLoginStorage( id_token=result['token'], - refresh_url=refresh_url + refresh_url=refresh_url, + refresh_token=refresh_token ), osmo_token=True ) @@ -258,7 +260,9 @@ def refresh_id_token(config: LoginConfig, user_agent: str | None, headers['User-Agent'] = user_agent if osmo_token: - result = requests.get(token_login_storage.refresh_url, timeout=TIMEOUT, headers=headers) + result = requests.post(token_login_storage.refresh_url, + json={'token': token_login_storage.refresh_token}, + timeout=TIMEOUT, headers=headers) else: result = requests.post(token_endpoint, data={ 'grant_type': 'refresh_token', @@ -289,7 +293,3 @@ def parse_allowed_pools(allowed_pools_header: str | None) -> List[str]: return [pool.strip() for pool in allowed_pools_header.split(',') if pool.strip()] -def fetch_token_from_refresh_url(refresh_url: str) -> str | None: - parsed = urlparse(refresh_url) - query_params = dict(param.split('=') for param in parsed.query.split('&')) - return query_params.get('access_token', None) diff --git a/src/locked_requirements.txt b/src/locked_requirements.txt index f90f3cea0..577fc39e3 100644 --- a/src/locked_requirements.txt +++ b/src/locked_requirements.txt @@ -52,10 +52,6 @@ azure-storage-blob==12.26.0 \ --hash=sha256:5dd7d7824224f7de00bfeb032753601c982655173061e242f13be6e26d78d71f \ --hash=sha256:8c5631b8b22b4f53ec5fff2f3bededf34cfef111e2af613ad42c9e6de00a77fe # via -r requirements.txt -backports-tarfile==1.2.0 \ - --hash=sha256:77e284d754527b01fb1e6fa8a1afe577858ebe4e9dad8919e34c862cb399bc34 \ - --hash=sha256:d75e02c268746e1b8144c278978b6e98e85de6ad16f8e4b0844a154557eca991 - # via -r requirements.txt boto3==1.38.0 \ --hash=sha256:8b6544eca17e31d1bfd538e5d152b96a68d6c92950352a0cd9679f89d217d53a \ --hash=sha256:96898facb164b47859d40a4271007824a0a791c3811a7079ce52459d753d4474 @@ -602,7 +598,9 @@ lark==1.1.5 \ macholib==1.16.3 \ --hash=sha256:07ae9e15e8e4cd9a788013d81f5908b3609aa76f9b1421bae9c4d7606ec86a30 \ --hash=sha256:0e315d7583d38b8c77e815b1ecbdbf504a8258d8b3e17b61165c6feb60d18f2c - # via -r requirements.txt + # via + # -r requirements.txt + # pyinstaller markupsafe==3.0.3 \ --hash=sha256:0303439a41979d9e74d18ff5e2dd8c43ed6c6001fd40e5bf2e43f7bd9bbc523f \ --hash=sha256:068f375c472b3e7acbe2d5318dea141359e6900156b5b2ba06a30b169086b91a \ @@ -937,19 +935,19 @@ pydantic==1.10.26 \ # via # -r requirements.txt # fastapi -pyinstaller==6.12.0 \ - --hash=sha256:0c271896a3a168f4f91827145702543db9c5427f4c7372a6df8c75925a3ac18a \ - --hash=sha256:0e62d3906309248409f215b386f33afec845214e69cc0f296b93222b26a88f43 \ - --hash=sha256:138856a5a503bb69c066377e0a22671b0db063e9cc14d5cf5c798a53561200d3 \ - --hash=sha256:1834797be48ce1b26015af68bdeb3c61a6c7500136f04e0fc65e468115dec777 \ - --hash=sha256:68f1e4cecf88a6272063977fa2a2c69ad37cf568e5901769d7206d0314c74f47 \ - --hash=sha256:83c7f3bde9871b4a6aa71c66a96e8ba5c21668ce711ed97f510b9382d10aac6c \ - --hash=sha256:8e92e9873a616547bbabbb5a3a9843d5f2ab40c3d8b26810acdf0fe257bee4cf \ - --hash=sha256:a2abf5fde31a8b38b6df7939bcef8ac1d0c51e97e25317ce3555cd675259750f \ - --hash=sha256:a69818815c6e0711c727edc30680cb1f81c691b59de35db81a2d9e0ae26a9ef1 \ - --hash=sha256:aefe502d55c9cf6aeaed7feba80b5f8491ce43f8f2b5fe2d9aadca3ee5a05bc4 \ - --hash=sha256:dac8a27988dbc33cdc34f2046803258bc3f6829de24de52745a5daa22bdba0f1 \ - --hash=sha256:fea76fc9b55ffa730fcf90beb897cce4399938460b0b6f40507fbebfc752c753 +pyinstaller==6.19.0 \ + --hash=sha256:1ec54ef967996ca61dacba676227e2b23219878ccce5ee9d6f3aada7b8ed8abf \ + --hash=sha256:3c5c251054fe4cfaa04c34a363dcfbf811545438cb7198304cd444756bc2edd2 \ + --hash=sha256:4190e76b74f0c4b5c5f11ac360928cd2e36ec8e3194d437bf6b8648c7bc0c134 \ + --hash=sha256:481a909c8e60c8692fc60fcb1344d984b44b943f8bc9682f2fcdae305ad297e6 \ + --hash=sha256:4ab2bb52e58448e14ddf9450601bdedd66800465043501c1d8f1cab87b60b122 \ + --hash=sha256:8bd68abd812d8a6ba33b9f1810e91fee0f325969733721b78151f0065319ca11 \ + --hash=sha256:a0fc5f6b3c55aa54353f0c74ffa59b1115433c1850c6f655d62b461a2ed6cbbe \ + --hash=sha256:b5bb6536c6560330d364d91522250f254b107cf69129d9cbcd0e6727c570be33 \ + --hash=sha256:c2d5a539b0bfe6159d5522c8c70e1c0e487f22c2badae0f97d45246223b798ea \ + --hash=sha256:da6d5c6391ccefe73554b9fa29b86001c8e378e0f20c2a4004f836ba537eff63 \ + --hash=sha256:e649ba6bd1b0b89b210ad92adb5fbdc8a42dd2c5ca4f72ef3a0bfec83a424b83 \ + --hash=sha256:ec73aeb8bd9b7f2f1240d328a4542e90b3c6e6fbc106014778431c616592a865 # via -r requirements.txt pyinstaller-hooks-contrib==2026.1 \ --hash=sha256:66ad4888ba67de6f3cfd7ef554f9dd1a4389e2eb19f84d7129a5a6818e3f2180 \ diff --git a/src/operator/utils/login.py b/src/operator/utils/login.py index aaedc93a6..89854b450 100644 --- a/src/operator/utils/login.py +++ b/src/operator/utils/login.py @@ -66,7 +66,8 @@ def get_login_info( raise osmo_errors.OSMOUserError('Must provide token') return login.token_login( config.service_url, - login.construct_token_refresh_url(config.service_url, token), + login.construct_token_refresh_url(config.service_url), + token, user_agent=user_agent, ) else: diff --git a/src/requirements.txt b/src/requirements.txt index 5f4ac55b0..eda1eae9a 100644 --- a/src/requirements.txt +++ b/src/requirements.txt @@ -58,9 +58,7 @@ cryptography==46.0.5 jwcrypto==1.5.6 # Pyinstaller -pyinstaller==6.12.0 -# backports-tarfile: required by jaraco.context (vendored in setuptools 82+) on Python < 3.12 -backports-tarfile==1.2.0 +pyinstaller==6.19.0 # Yaml pyyaml==6.0.3 diff --git a/src/runtime/cmd/ctrl/ctrl.go b/src/runtime/cmd/ctrl/ctrl.go index be2176ae3..b0bc529ce 100644 --- a/src/runtime/cmd/ctrl/ctrl.go +++ b/src/runtime/cmd/ctrl/ctrl.go @@ -1,5 +1,5 @@ /* -SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -20,6 +20,7 @@ package main import ( "bufio" + "bytes" "crypto/tls" "encoding/binary" "encoding/json" @@ -140,18 +141,22 @@ func refreshJWTToken(cmdArgs args.CtrlArgs) error { panic(fmt.Sprintf("Parsing refreshUrl failed: %v\n%s", cmdArgs.RefreshTokenUrl, err)) } - // Query parameters + // Query parameters (token goes in the body, not the URL) params := url.Values{} - // Get refresh token - params.Add("refresh_token", string(refreshToken)) params.Add("workflow_id", cmdArgs.Workflow) params.Add("group_name", cmdArgs.GroupName) params.Add("task_name", cmdArgs.LogSource) params.Add("retry_id", cmdArgs.RetryId) - // Encode query parameters and append to the base URL u.RawQuery = params.Encode() - resp, err := http.Get(u.String()) + + // Send token in request body as JSON + requestBody, err := json.Marshal(map[string]string{"token": string(refreshToken)}) + if err != nil { + osmo_errors.SetExitCode(osmo_errors.TOKEN_INVALID_CODE) + panic(fmt.Sprintf("Error marshaling token request body: %s\n", err)) + } + resp, err := http.Post(u.String(), "application/json", bytes.NewBuffer(requestBody)) if err != nil { return &DialWebsocketError{ ErrorType: string(FetchFailureError), @@ -216,10 +221,7 @@ func dialWebsocket(url string, conn **websocket.Conn, cmdArgs args.CtrlArgs, ret if isRefresh { err := refreshJWTToken(cmdArgs) if err != nil { - // Exponential backoff - exponent := common.Min(retryCount, 5) - delay := time.Duration(math.Pow(2, float64(exponent))) * time.Second - time.Sleep(delay) + time.Sleep(data.ExponentialBackoffWithJitter(retryCount)) return err } } @@ -245,10 +247,7 @@ func dialWebsocket(url string, conn **websocket.Conn, cmdArgs args.CtrlArgs, ret } } if !data.WebsocketConnection.ReachedTimeout() { - // Exponential backoff - exponent := common.Min(retryCount, 5) - delay := time.Duration(math.Pow(2, float64(exponent))) * time.Second - time.Sleep(delay) + time.Sleep(data.ExponentialBackoffWithJitter(retryCount)) return err } diff --git a/src/runtime/pkg/data/data.go b/src/runtime/pkg/data/data.go index 76b2912d8..22a6f0e95 100644 --- a/src/runtime/pkg/data/data.go +++ b/src/runtime/pkg/data/data.go @@ -1,5 +1,5 @@ /* -SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -25,7 +25,7 @@ import ( "fmt" "log" "math" - "math/rand" + "math/rand/v2" "net" "os" "os/exec" @@ -151,6 +151,20 @@ func (f WebsocketConnectionInfo) ReachedTimeout() bool { return time.Since(f.DisconnectStartTime) >= f.Timeout } +// ExponentialBackoffWithJitter returns a randomized delay using "equal jitter": +// uniformly distributed in [backoff/2, backoff) where backoff = 2^min(retryCount,5) seconds. +// The guaranteed minimum avoids near-zero sleeps while the jitter decorrelates +// concurrent clients to prevent thundering herd. +func ExponentialBackoffWithJitter(retryCount int) time.Duration { + exponent := common.Min(retryCount, 5) + maxDelay := time.Duration(math.Pow(2, float64(exponent))) * time.Second + if maxDelay <= 0 { + return 0 + } + halfDelay := maxDelay / 2 + return halfDelay + time.Duration(rand.Int64N(int64(halfDelay))) +} + func (f WebsocketConnectionInfo) TimeLeft() time.Duration { return f.Timeout - time.Since(f.DisconnectStartTime) } @@ -268,8 +282,7 @@ func RunOSMOCommandStreamingWithRetry(command []string, retryCommand []string, osmoChan <- "Rate limited by service. Waiting before retrying..." firstError = true } - maxSleep := math.Pow(2, float64(math.Min(float64(backoffCount), 5))) - sleepTime = time.Second * time.Duration(1+rand.Float64()*(maxSleep-1)) + sleepTime = ExponentialBackoffWithJitter(backoffCount) backoffCount++ continueLoop = true } @@ -346,8 +359,7 @@ func RunOSMOCommandWithRetry(commandArgs []string, retryCount int, osmoChan <- "Rate limited by service. Waiting before retrying..." firstError = true } - maxSleep := math.Pow(2, float64(math.Min(float64(backoffCount), 5))) - sleepTime = time.Second * time.Duration(1+rand.Float64()*(maxSleep-1)) + sleepTime = ExponentialBackoffWithJitter(backoffCount) backoffCount++ continueLoop = true } diff --git a/src/scripts/export_status_metadata.py b/src/scripts/export_status_metadata.py index fbf9cdedb..3e16f9d3a 100644 --- a/src/scripts/export_status_metadata.py +++ b/src/scripts/export_status_metadata.py @@ -31,9 +31,10 @@ """ import argparse -import json import sys +from collections.abc import Mapping from typing import Literal + from typing_extensions import assert_never from src.utils.job.task import TaskGroupStatus @@ -113,14 +114,33 @@ def get_workflow_status_category(status: WorkflowStatus) -> StatusCategory: return 'completed' case WorkflowStatus.RUNNING: return 'running' - case WorkflowStatus.PENDING: - return 'pending' - case WorkflowStatus.WAITING: + case WorkflowStatus.PENDING | WorkflowStatus.WAITING: return 'waiting' case _ as unreachable: assert_never(unreachable) +def _ts_value(value: object) -> str: + """Convert a Python value to its TypeScript literal representation.""" + if isinstance(value, bool): + return 'true' if value else 'false' + elif isinstance(value, str): + return f'"{value}"' + else: + return str(value) + + +def format_metadata_entries(metadata: Mapping[str, Mapping[str, object]]) -> str: + """Format a metadata dict as TypeScript object entries with Prettier-style formatting.""" + lines: list[str] = [] + for status_name, fields in metadata.items(): + lines.append(f' {status_name}: {{') + for key, value in fields.items(): + lines.append(f' {key}: {_ts_value(value)},') + lines.append(' },') + return '\n'.join(lines) + + def generate_typescript() -> str: """Generate TypeScript code from Python enum metadata.""" # Build TaskGroupStatus metadata @@ -148,9 +168,8 @@ def generate_typescript() -> str: 'isFailed': workflow_status.failed(), } - # Format JSON with proper indentation for TypeScript - task_json = json.dumps(task_metadata, indent=2) - workflow_json = json.dumps(workflow_metadata, indent=2) + task_entries = format_metadata_entries(task_metadata) + workflow_entries = format_metadata_entries(workflow_metadata) # pylint: disable=line-too-long return f'''// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. @@ -206,9 +225,13 @@ def generate_typescript() -> str: // Generated Metadata // ============================================================================= -export const TASK_STATUS_METADATA: Record = {task_json} as const; +export const TASK_STATUS_METADATA: Record = {{ +{task_entries} +}} as const; -export const WORKFLOW_STATUS_METADATA: Record = {workflow_json} as const; +export const WORKFLOW_STATUS_METADATA: Record = {{ +{workflow_entries} +}} as const; // ============================================================================= // Helper Functions (O(1) lookups) @@ -257,8 +280,7 @@ def generate_typescript() -> str: /** Check if a workflow status is a failure */ export function isWorkflowFailed(status: WorkflowStatus): boolean {{ return WORKFLOW_STATUS_METADATA[status]?.isFailed ?? false; -}} -''' +}}''' def main(): diff --git a/src/service/core/auth/auth_service.py b/src/service/core/auth/auth_service.py index bfb827acc..97879ceeb 100644 --- a/src/service/core/auth/auth_service.py +++ b/src/service/core/auth/auth_service.py @@ -56,7 +56,25 @@ def get_new_jwt_token(refresh_token: str, workflow_id: str, group_name: str, task_name: str, retry_id: int = 0): """ API to fetch for a new access token using a refresh token. + + Deprecated: Use POST /api/auth/jwt/refresh_token instead. + """ + return _create_jwt_from_refresh_token(refresh_token, workflow_id, + group_name, task_name, retry_id) + + +@router.post('/api/auth/jwt/refresh_token') +def post_new_jwt_token(request: objects.TokenRequest, workflow_id: str, + group_name: str, task_name: str, retry_id: int = 0): """ + API to fetch for a new access token using a refresh token. + """ + return _create_jwt_from_refresh_token(request.token, workflow_id, + group_name, task_name, retry_id) + + +def _create_jwt_from_refresh_token(refresh_token: str, workflow_id: str, + group_name: str, task_name: str, retry_id: int = 0): if len(refresh_token) not in task_lib.VALID_TOKEN_LENGTHS: raise osmo_errors.OSMOUserError( f'Refresh token has invalid length {len(refresh_token)}') @@ -117,7 +135,21 @@ def get_new_jwt_token(refresh_token: str, workflow_id: str, def get_jwt_token_from_access_token(access_token: str): """ API to create a new jwt token from an access token. + + Deprecated: Use POST /api/auth/jwt/access_token instead. + """ + return _create_jwt_from_access_token(access_token) + + +@router.post('/api/auth/jwt/access_token') +def post_jwt_token_from_access_token(request: objects.TokenRequest): """ + API to create a new jwt token from an access token. + """ + return _create_jwt_from_access_token(request.token) + + +def _create_jwt_from_access_token(access_token: str): if len(access_token) not in task_lib.VALID_TOKEN_LENGTHS: raise osmo_errors.OSMOUserError( f'Access token has invalid length {len(access_token)}') diff --git a/src/service/core/auth/objects.py b/src/service/core/auth/objects.py index 4a6756a2e..b5428746b 100644 --- a/src/service/core/auth/objects.py +++ b/src/service/core/auth/objects.py @@ -249,6 +249,11 @@ class UserWithRoles(User): roles: List[UserRole] = [] +class TokenRequest(pydantic.BaseModel): + """Request body containing a token for JWT generation.""" + token: str + + class CreateUserRequest(pydantic.BaseModel): """Request to create a new user.""" id: str diff --git a/src/service/core/data/data_service.py b/src/service/core/data/data_service.py index 71e1614fe..8f68b3577 100755 --- a/src/service/core/data/data_service.py +++ b/src/service/core/data/data_service.py @@ -1004,8 +1004,7 @@ def list_dataset_from_bucket(name: objects.DatasetPattern | None = None, """ This api returns the list of datasets/colections.""" postgres = connectors.PostgresConnector.get_instance() fetch_cmd = ''' - SELECT DISTINCT dataset.*, dv.created_date as dv_created_date, - dv.version_id as dv_version_id, + SELECT dataset.*, dv.created_date as dv_created_date, dv.version_id as dv_version_id, COALESCE(dv.created_date, dataset.created_date) as combined_date FROM dataset LEFT JOIN (SELECT dataset_version.* FROM dataset_version @@ -1050,7 +1049,8 @@ def list_dataset_from_bucket(name: objects.DatasetPattern | None = None, fetch_cmd += ' AND name LIKE %s' fetch_input.append('%' + name + '%') - fetch_cmd += ' ORDER BY combined_date DESC LIMIT %s' + fetch_cmd += \ + ' GROUP BY dataset.id, dv.created_date, dv.version_id ORDER BY combined_date DESC LIMIT %s' fetch_input.append(min(count, 1000)) fetch_cmd = f'SELECT * FROM ({fetch_cmd}) as ds' diff --git a/src/ui/src/components/data-table/utils/column-constants.ts b/src/ui/src/components/data-table/utils/column-constants.ts index c65896a2b..532dbe61d 100644 --- a/src/ui/src/components/data-table/utils/column-constants.ts +++ b/src/ui/src/components/data-table/utils/column-constants.ts @@ -39,6 +39,9 @@ export const COLUMN_MIN_WIDTHS_REM = { /** Text that truncates with ellipsis (names, descriptions) */ TEXT_TRUNCATE: 8.75, + /** Medium text labels */ + TEXT_MEDIUM: 7.5, + /** Short text labels (status, type) */ TEXT_SHORT: 6, @@ -82,6 +85,9 @@ export const COLUMN_PREFERRED_WIDTHS_REM = { /** Text that truncates - comfortable reading width */ TEXT_TRUNCATE: 16, + /** Medium text labels */ + TEXT_MEDIUM: 12, + /** Short text labels (status, type) - badge + text */ TEXT_SHORT: 8, diff --git a/src/ui/src/components/event-viewer/event-details-panel.tsx b/src/ui/src/components/event-viewer/event-details-panel.tsx index 6209fac3e..44356e299 100644 --- a/src/ui/src/components/event-viewer/event-details-panel.tsx +++ b/src/ui/src/components/event-viewer/event-details-panel.tsx @@ -94,7 +94,9 @@ export const EventDetailsPanel = memo(function EventDetailsPanel({ {/* Message */} -
{event.message}
+
+ {event.message} +
); })} diff --git a/src/ui/src/components/event-viewer/event-viewer-container.tsx b/src/ui/src/components/event-viewer/event-viewer-container.tsx index 215808ac5..2d9df0440 100644 --- a/src/ui/src/components/event-viewer/event-viewer-container.tsx +++ b/src/ui/src/components/event-viewer/event-viewer-container.tsx @@ -17,14 +17,17 @@ "use client"; import { useState, useMemo, useCallback, startTransition, useDeferredValue } from "react"; -import { ChevronsDownUp, ChevronsUpDown, Loader2, Radio } from "lucide-react"; +import { ChevronsDownUp, ChevronsUpDown, ExternalLink, Loader2, Radio } from "lucide-react"; import { cn } from "@/lib/utils"; import { useEventStream } from "@/lib/api/adapter/events/use-event-stream"; import { groupEventsByTask, calculateDuration } from "@/lib/api/adapter/events/events-grouping"; import { EventViewerTable } from "@/components/event-viewer/event-viewer-table"; import { EventViewerProvider } from "@/components/event-viewer/event-viewer-context"; import type { TaskGroupStatus } from "@/lib/api/generated"; +import { Button } from "@/components/shadcn/button"; +import { Tooltip, TooltipContent, TooltipTrigger } from "@/components/shadcn/tooltip"; import { useTick } from "@/hooks/use-tick"; +import { getBasePathUrl } from "@/lib/config"; import { FilterBar } from "@/components/filter-bar/filter-bar"; import { useUrlChips } from "@/components/filter-bar/hooks/use-url-chips"; import { EVENT_SEARCH_FIELDS, EVENT_PRESETS } from "@/components/event-viewer/lib/event-search-fields"; @@ -79,6 +82,7 @@ export function EventViewerContainer({ taskTimings, }: EventViewerContainerProps) { const isTaskScope = scope === "task"; + const openUrl = url.startsWith("http://") || url.startsWith("https://") ? url : getBasePathUrl(url); // URL-synced filter chips (only in workflow scope) const { searchChips, setSearchChips } = useUrlChips({ paramName: "ef" }); @@ -262,6 +266,52 @@ export function EventViewerContainer({ Collapse All + + {/* Open raw event stream in new tab */} + + + + + Open event stream in new tab + + + )} + + {/* Task scope: minimal toolbar with open-in-new-tab button */} + {isTaskScope && ( +
+ + + + + Open event stream in new tab +
)} diff --git a/src/ui/src/components/inline-progress.tsx b/src/ui/src/components/inline-progress.tsx index 08c755ea3..e970f5a13 100644 --- a/src/ui/src/components/inline-progress.tsx +++ b/src/ui/src/components/inline-progress.tsx @@ -18,27 +18,17 @@ import { memo } from "react"; import { cn } from "@/lib/utils"; -import type { DisplayMode } from "@/stores/shared-preferences-store"; import { ProgressBar } from "@/components/progress-bar"; -// Re-export for consumers that import from here -export type { DisplayMode }; - export interface InlineProgressProps { /** Current usage value */ used: number; /** Total/maximum value */ total: number; - /** Free/available value (from API) */ - free: number; - /** Display mode: show "used/total" or "free" */ - displayMode?: DisplayMode; /** Compact mode: hide progress bar, show only text */ compact?: boolean; /** Width of the progress bar */ barWidth?: string; - /** Label for free display (e.g., "free", "idle", "available") */ - freeLabel?: string; /** Additional content to render after the label (e.g., icons) */ children?: React.ReactNode; /** Additional className for the container */ @@ -52,54 +42,34 @@ export interface InlineProgressProps { /** * InlineProgress - Horizontal progress display for table cells. * - * Renders a progress bar with value label in a horizontal layout, - * suitable for table cells and inline contexts. - * - * Composes from ProgressBar primitive. + * Renders a progress bar with a "{used}/{total}" fraction label. + * Suitable for table cells showing utilization. * * @example * ```tsx - * // Basic usage - * - * - * // Free display mode - * - * - * // Compact mode (no bar) - * - * - * // With trailing content (e.g., icon) - * - * - * + * + * * ``` */ export const InlineProgress = memo(function InlineProgress({ used, total, - free, - displayMode = "used", compact = false, barWidth = "w-16", - freeLabel = "free", children, className, }: InlineProgressProps) { - // Format display label based on mode - const displayFree = Math.max(0, free); - const displayLabel = displayMode === "used" ? `${used}/${total}` : `${displayFree} ${freeLabel}`; + const label = `${used}/${total}`; if (compact) { return (
- {displayLabel} + {label} {children}
); } - // Convert width class to max-width for capped growth - // e.g., "w-16" -> "max-w-16", bar grows to fill but caps at this size const maxBarWidth = barWidth.replace(/^w-/, "max-w-"); return ( @@ -109,9 +79,10 @@ export const InlineProgress = memo(function InlineProgress({ value={used} max={total} size="md" + thresholdColors /> - {displayLabel} + {label} {children} ); diff --git a/src/ui/src/components/shell/lib/shell-cache.ts b/src/ui/src/components/shell/lib/shell-cache.ts index cfa23945c..51d44bc51 100644 --- a/src/ui/src/components/shell/lib/shell-cache.ts +++ b/src/ui/src/components/shell/lib/shell-cache.ts @@ -63,8 +63,10 @@ function getSnapshot(): CachedSession[] { return cachedSnapshot; } -function getServerSnapshot(): CachedSession[] { - return []; +const SERVER_SNAPSHOT: readonly CachedSession[] = []; + +function getServerSnapshot(): readonly CachedSession[] { + return SERVER_SNAPSHOT; } export function useShellSessions(): readonly CachedSession[] { diff --git a/src/ui/src/components/submit-workflow/detect-localpath.test.ts b/src/ui/src/components/submit-workflow/detect-localpath.test.ts new file mode 100644 index 000000000..4a23c2d02 --- /dev/null +++ b/src/ui/src/components/submit-workflow/detect-localpath.test.ts @@ -0,0 +1,261 @@ +// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// +// SPDX-License-Identifier: Apache-2.0 + +import { describe, it, expect } from "vitest"; +import { detectLocalpathUsage } from "@/components/submit-workflow/detect-localpath"; + +describe("detectLocalpathUsage", () => { + // ── Early exit ─────────────────────────────────────────────────────────── + + describe("early exit", () => { + it("returns both false for empty string", () => { + expect(detectLocalpathUsage("")).toEqual({ + hasFileLocalpath: false, + hasDatasetLocalpath: false, + }); + }); + + it("returns both false when localpath: is absent", () => { + const spec = [ + "workflow:", + " name: test", + " tasks:", + " - name: task1", + " files:", + " - path: /tmp/a.sh", + " contents: hello", + ].join("\n"); + expect(detectLocalpathUsage(spec)).toEqual({ + hasFileLocalpath: false, + hasDatasetLocalpath: false, + }); + }); + + it("returns both false when localpath appears without colon", () => { + const spec = [" files:", " - path: /tmp/localpath_example"].join("\n"); + expect(detectLocalpathUsage(spec)).toEqual({ + hasFileLocalpath: false, + hasDatasetLocalpath: false, + }); + }); + }); + + // ── hasFileLocalpath ───────────────────────────────────────────────────── + + describe("hasFileLocalpath", () => { + describe("detects localpath: inside files block", () => { + it("first key in list item", () => { + const spec = [" files:", " - localpath: /home/user/data"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("subsequent key after path: in same list item", () => { + const spec = [" files:", " - path: /tmp/a.sh", " localpath: /home/user/a.sh"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("multiple intermediate lines between files: and localpath:", () => { + const spec = [ + " files:", + " - path: /tmp/a.sh", + " contents: |", + " #!/bin/bash", + " echo hello", + " - localpath: /home/user/b.sh", + ].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("blank lines between files: and localpath:", () => { + const spec = [" files:", "", " - localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("whitespace-only blank lines between files: and localpath:", () => { + const spec = [" files:", " ", " - localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("deeply indented files block (6+ spaces)", () => { + const spec = [" files:", " - localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("tab indentation", () => { + const spec = ["\tfiles:", "\t- localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("mixed tab and space indentation", () => { + const spec = ["\t files:", "\t - localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + }); + + describe("ignores localpath: outside files block", () => { + it("files: at column 0 (not indented)", () => { + const spec = ["files:", " - localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("localpath: in a value string", () => { + const spec = [" files:", " - path: /tmp/localpath:test"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("localpath: after a non-list key exits the files block", () => { + const spec = [ + " files:", + " - path: /tmp/a.sh", + " command: [bash]", + " localpath: /should/not/match", + ].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("files: with inline value", () => { + const spec = [" files: []", " localpath: /path"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("localpath: before files: in spec", () => { + const spec = [" localpath: /path", " files:", " - path: /tmp/a.sh"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + }); + }); + + // ── hasDatasetLocalpath ────────────────────────────────────────────────── + + describe("hasDatasetLocalpath", () => { + describe("detects localpath: inside dataset block", () => { + it("directly under dataset:", () => { + const spec = [" - dataset:", " localpath: /data"].join("\n"); + expect(detectLocalpathUsage(spec).hasDatasetLocalpath).toBe(true); + }); + + it("after sibling key name: under dataset:", () => { + const spec = [" - dataset:", " name: my-ds", " localpath: /data"].join("\n"); + expect(detectLocalpathUsage(spec).hasDatasetLocalpath).toBe(true); + }); + + it("dataset: without list dash", () => { + const spec = [" dataset:", " localpath: /data"].join("\n"); + expect(detectLocalpathUsage(spec).hasDatasetLocalpath).toBe(true); + }); + + it("dataset: at column 0", () => { + const spec = ["dataset:", " localpath: /data"].join("\n"); + expect(detectLocalpathUsage(spec).hasDatasetLocalpath).toBe(true); + }); + }); + + describe("ignores localpath: outside dataset block", () => { + it("dataset: with no localpath: child", () => { + const spec = [" - dataset:", " name: my-ds"].join("\n"); + expect(detectLocalpathUsage(spec)).toEqual({ + hasFileLocalpath: false, + hasDatasetLocalpath: false, + }); + }); + + it("localpath: in unrelated block", () => { + const spec = [" - dataset:", " name: my-ds", " other:", " localpath: /should/not/match"].join("\n"); + expect(detectLocalpathUsage(spec).hasDatasetLocalpath).toBe(false); + }); + }); + }); + + // ── Context tracking ───────────────────────────────────────────────────── + + describe("context tracking", () => { + it("non-list key at same indent as files: exits context", () => { + const spec = [ + " files:", + " - path: /tmp/a.sh", + " command: [bash]", + " localpath: /should/not/match", + ].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("line at lesser indent exits context", () => { + const spec = [" files:", " - path: /tmp/a.sh", " image: ubuntu", " localpath: /should/not/match"].join( + "\n", + ); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(false); + }); + + it("list item at same indent stays in context", () => { + const spec = [" files:", " - path: /tmp/a.sh", " - localpath: /should/match"].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("new files: block replaces previous context", () => { + const spec = [ + " files:", + " - path: /tmp/a.sh", + " command: [bash]", + " files:", + " - localpath: /path", + ].join("\n"); + expect(detectLocalpathUsage(spec).hasFileLocalpath).toBe(true); + }); + + it("detects both warnings in same spec", () => { + const spec = [" files:", " - localpath: /file", " inputs:", " - dataset:", " localpath: /data"].join( + "\n", + ); + expect(detectLocalpathUsage(spec)).toEqual({ + hasFileLocalpath: true, + hasDatasetLocalpath: true, + }); + }); + + it("dataset context replaces files context", () => { + const spec = [" files:", " - path: /tmp/a.sh", " - dataset:", " localpath: /data"].join("\n"); + const result = detectLocalpathUsage(spec); + expect(result.hasFileLocalpath).toBe(false); + expect(result.hasDatasetLocalpath).toBe(true); + }); + }); + + // ── Performance ────────────────────────────────────────────────────────── + + describe("performance", () => { + it("10,000-line spec without localpath: completes in under 50ms", () => { + const lines = ["workflow:", " name: test", " tasks:"]; + for (let i = 0; i < 10_000; i++) { + lines.push(` - name: task-${i}`); + lines.push(" image: ubuntu"); + lines.push(" files:"); + lines.push(" - path: /tmp/test.sh"); + lines.push(" contents: |"); + lines.push(" #!/bin/bash"); + lines.push(" echo hello"); + lines.push(" "); + } + const spec = lines.join("\n"); + + const start = performance.now(); + const result = detectLocalpathUsage(spec); + const elapsed = performance.now() - start; + + expect(result).toEqual({ hasFileLocalpath: false, hasDatasetLocalpath: false }); + expect(elapsed).toBeLessThan(50); + }); + }); +}); diff --git a/src/ui/src/components/submit-workflow/detect-localpath.ts b/src/ui/src/components/submit-workflow/detect-localpath.ts new file mode 100644 index 000000000..c9d196e9e --- /dev/null +++ b/src/ui/src/components/submit-workflow/detect-localpath.ts @@ -0,0 +1,112 @@ +// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// +// SPDX-License-Identifier: Apache-2.0 + +export interface LocalpathWarnings { + hasFileLocalpath: boolean; + hasDatasetLocalpath: boolean; +} + +interface ParsedKey { + name: string; + isBlock: boolean; +} + +/** Count leading whitespace characters (spaces and tabs). Returns `line.length` for blank lines. */ +function leadingWhitespace(line: string): number { + for (let i = 0; i < line.length; i++) { + if (line[i] !== " " && line[i] !== "\t") return i; + } + return line.length; +} + +function isOnlyWhitespaceAfter(line: string, from: number): boolean { + for (let i = from; i < line.length; i++) { + if (line[i] !== " " && line[i] !== "\t") return false; + } + return true; +} + +/** + * Lines deeper than the block are inside. Lines at the same level are inside + * only if they are list continuations (start with "-"). + */ +function isInsideBlock(lineIndent: number, blockIndent: number, line: string): boolean { + if (lineIndent > blockIndent) return true; + return lineIndent === blockIndent && line[lineIndent] === "-"; +} + +/** Strips an optional leading list marker ("- ") before extracting the key. */ +function parseYamlKey(line: string, indent: number): ParsedKey | null { + let keyStart = indent; + + if (line[keyStart] === "-" && keyStart + 1 < line.length && line[keyStart + 1] === " ") { + for (keyStart += 2; keyStart < line.length && line[keyStart] === " "; keyStart++) {} + } + + const colonPos = line.indexOf(":", keyStart); + if (colonPos === -1) return null; + + return { + name: line.substring(keyStart, colonPos), + isBlock: isOnlyWhitespaceAfter(line, colonPos + 1), + }; +} + +/** + * Detect `localpath:` usage in a YAML workflow spec. + * + * - `hasFileLocalpath`: `localpath:` inside a `files:` block (browser cannot read local files). + * - `hasDatasetLocalpath`: `localpath:` inside a `dataset:` block (browser cannot rsync). + */ +export function detectLocalpathUsage(spec: string): LocalpathWarnings { + if (!spec.includes("localpath:")) { + return { hasFileLocalpath: false, hasDatasetLocalpath: false }; + } + + let hasFileLocalpath = false; + let hasDatasetLocalpath = false; + let context: "files" | "dataset" | null = null; + let contextIndent = 0; + + for (const line of spec.split("\n")) { + const indent = leadingWhitespace(line); + if (indent === line.length) continue; + + if (context !== null && !isInsideBlock(indent, contextIndent, line)) { + context = null; + } + + if (line.indexOf(":", indent) === -1) continue; // fast path: no key on this line + const parsed = parseYamlKey(line, indent); + if (parsed === null) continue; + + // files: must be nested (indent > 0) to exclude top-level `files:` keys that are + // not task file lists. dataset: is valid at any indent level, including root. + if (parsed.name === "files" && indent > 0 && parsed.isBlock) { + context = "files"; + contextIndent = indent; + } else if (parsed.name === "dataset" && parsed.isBlock) { + context = "dataset"; + contextIndent = indent; + } else if (parsed.name === "localpath" && context !== null) { + if (context === "files") hasFileLocalpath = true; + else hasDatasetLocalpath = true; + if (hasFileLocalpath && hasDatasetLocalpath) break; + } + } + + return { hasFileLocalpath, hasDatasetLocalpath }; +} diff --git a/src/ui/src/components/submit-workflow/submit-workflow-config-panel.tsx b/src/ui/src/components/submit-workflow/submit-workflow-config-panel.tsx index 9bbd6071d..0b72cd903 100644 --- a/src/ui/src/components/submit-workflow/submit-workflow-config-panel.tsx +++ b/src/ui/src/components/submit-workflow/submit-workflow-config-panel.tsx @@ -31,7 +31,7 @@ import { Tooltip, TooltipContent, TooltipTrigger } from "@/components/shadcn/too import { CollapsibleSection } from "@/components/workflow/collapsible-section"; import { PoolPicker } from "@/components/workflow/pool-picker"; import { PriorityPicker, PRIORITY_LABELS } from "@/components/workflow/priority-picker"; -import type { LocalpathWarnings } from "@/components/submit-workflow/use-submit-workflow-form"; +import type { LocalpathWarnings } from "@/components/submit-workflow/detect-localpath"; /** Inline code token styled to stand out against the red error banner background. */ function Token({ children }: { children: ReactNode }) { diff --git a/src/ui/src/components/submit-workflow/use-submit-workflow-form.ts b/src/ui/src/components/submit-workflow/use-submit-workflow-form.ts index d912a8cd9..54b4b5cd9 100644 --- a/src/ui/src/components/submit-workflow/use-submit-workflow-form.ts +++ b/src/ui/src/components/submit-workflow/use-submit-workflow-form.ts @@ -31,29 +31,7 @@ import { WorkflowPriority, useSubmitWorkflowApiPoolPoolNameWorkflowPost } from " import { useSubmitWorkflowStore } from "@/stores/submit-workflow-store"; import { useProfile } from "@/lib/api/adapter/hooks"; import { usePoolSelection } from "@/components/workflow/use-pool-selection"; - -/** - * Detect `localpath:` usage in the YAML spec. - * - * - hasFileLocalpath: `files[].localpath` — browser cannot read local files. - * - hasDatasetLocalpath: `dataset.localpath` — browser cannot rsync. - */ -function detectLocalpathUsage(spec: string): { - hasFileLocalpath: boolean; - hasDatasetLocalpath: boolean; -} { - // files[].localpath — per docs, localpath: appears as a key inside a files: list item. - // Handles both the first-key form ( - localpath:) and subsequent-key form ( localpath:). - // The (?:[ \t]+[^\n]*\n)*? intermediary only matches indented lines, so it cannot - // skip past a new top-level key and produce a false positive. - const hasFileLocalpath = /^\s+files:\s*\n(?:[ \t]+[^\n]*\n)*?[ \t]+(?:-[ \t]+)?localpath:/m.test(spec); - - // inputs[].dataset.localpath — per docs, localpath: appears as a child key of dataset:, - // optionally preceded by sibling keys such as name:. - const hasDatasetLocalpath = /(?:^|[ \t])-?[ \t]*dataset:\s*\n(?:[ \t]+[^\n]+\n)*?[ \t]+localpath:/m.test(spec); - - return { hasFileLocalpath, hasDatasetLocalpath }; -} +import { detectLocalpathUsage, type LocalpathWarnings } from "@/components/submit-workflow/detect-localpath"; /** Extract a human-readable error message from various error shapes. */ function extractErrorMessage(err: unknown): string { @@ -77,11 +55,6 @@ function extractErrorMessage(err: unknown): string { return String(err); } -export interface LocalpathWarnings { - hasFileLocalpath: boolean; - hasDatasetLocalpath: boolean; -} - /** Validation result tied to the spec that was validated for freshness detection. */ interface ValidationState { spec: string; diff --git a/src/ui/src/features/pools/components/pool-gpu-summary.tsx b/src/ui/src/features/pools/components/pool-gpu-summary.tsx new file mode 100644 index 000000000..8068a9871 --- /dev/null +++ b/src/ui/src/features/pools/components/pool-gpu-summary.tsx @@ -0,0 +1,117 @@ +//SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. + +//Licensed under the Apache License, Version 2.0 (the "License"); +//you may not use this file except in compliance with the License. +//You may obtain a copy of the License at + +//http://www.apache.org/licenses/LICENSE-2.0 + +//Unless required by applicable law or agreed to in writing, software +//distributed under the License is distributed on an "AS IS" BASIS, +//WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +//See the License for the specific language governing permissions and +//limitations under the License. + +//SPDX-License-Identifier: Apache-2.0 + +"use client"; + +import { memo } from "react"; +import { type LucideIcon, Server, Zap } from "lucide-react"; +import { ProgressBar } from "@/components/progress-bar"; +import { Skeleton } from "@/components/shadcn/skeleton"; +import { formatCompact } from "@/lib/utils"; +import type { Quota } from "@/lib/api/adapter/types"; + +interface PoolGpuSummaryProps { + summary: Quota; + isLoading?: boolean; +} + +function getUtilizationColor(percent: number): string { + if (percent < 65) return "bg-emerald-500"; + if (percent < 85) return "bg-amber-500"; + return "bg-red-500"; +} + +interface PoolGpuSummaryCardProps { + label: string; + icon: LucideIcon; + used: number; + free: number; + total: number; +} + +const PoolGpuSummaryCard = memo(function PoolGpuSummaryCard({ + label, + icon: Icon, + used, + free, + total, +}: PoolGpuSummaryCardProps) { + const percent = total > 0 ? (used / total) * 100 : 0; + + return ( +
+
+ + {label} + + {Math.round(percent)}% + +
+ + + +
+
+ {formatCompact(used)} + / {formatCompact(total)} + used +
+
+ {formatCompact(free)} + free +
+
+
+ ); +}); + +export const PoolGpuSummary = memo(function PoolGpuSummary({ summary, isLoading = false }: PoolGpuSummaryProps) { + return ( +
+
+ {isLoading ? ( + <> + + + + ) : ( + <> + + + + )} +
+
+ ); +}); diff --git a/src/ui/src/features/pools/components/pools-page-content.tsx b/src/ui/src/features/pools/components/pools-page-content.tsx index 34b0b34ad..155619034 100644 --- a/src/ui/src/features/pools/components/pools-page-content.tsx +++ b/src/ui/src/features/pools/components/pools-page-content.tsx @@ -48,6 +48,7 @@ import { usePoolsData } from "@/features/pools/hooks/use-pools-data"; import { usePoolsTableStore } from "@/features/pools/stores/pools-table-store"; import { usePoolsAutoRefresh } from "@/features/pools/hooks/use-pools-auto-refresh"; import { useProfile } from "@/lib/api/adapter/hooks"; +import { PoolGpuSummary } from "@/features/pools/components/pool-gpu-summary"; // ============================================================================= // Client Component @@ -91,12 +92,22 @@ export function PoolsPageContent() { // TanStack Query will refetch in the background if data is stale. // ========================================================================== - const { pools, allPools, sharingGroups, isLoading, error, refetch, total, filteredTotal, hasActiveFilters } = - usePoolsData({ - searchChips: effectiveChips, - accessiblePoolNames, - refetchInterval: autoRefresh.effectiveInterval, - }); + const { + pools, + allPools, + sharingGroups, + gpuSummary, + isLoading, + error, + refetch, + total, + filteredTotal, + hasActiveFilters, + } = usePoolsData({ + searchChips: effectiveChips, + accessiblePoolNames, + refetchInterval: autoRefresh.effectiveInterval, + }); // ========================================================================== // Pool Panel State - URL state controls both selection and mounting @@ -166,6 +177,19 @@ export function PoolsPageContent() { + {/* GPU utilization summary */} +
+ + + +
+ {/* Main pools table - receives pre-filtered data */}
- - + /> ); }); diff --git a/src/ui/src/features/pools/components/table/gpu-progress-cell.tsx b/src/ui/src/features/pools/components/table/gpu-progress-cell.tsx deleted file mode 100644 index c46244c54..000000000 --- a/src/ui/src/features/pools/components/table/gpu-progress-cell.tsx +++ /dev/null @@ -1,156 +0,0 @@ -// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. -// -// SPDX-License-Identifier: Apache-2.0 - -"use client"; - -import { memo, useCallback } from "react"; -import { CirclePile } from "lucide-react"; -import { cn } from "@/lib/utils"; -import { Tooltip, TooltipContent, TooltipTrigger } from "@/components/shadcn/tooltip"; -import { InlineProgress, type DisplayMode } from "@/components/inline-progress"; -import type { Quota } from "@/lib/api/adapter/types"; - -// ============================================================================= -// Types -// ============================================================================= - -export interface GpuProgressCellProps { - /** Pool quota data */ - quota: Quota; - /** Which quota type to display */ - type: "quota" | "capacity"; - /** Display used/total or free count */ - displayMode: DisplayMode; - /** Compact mode: text only, no progress bar */ - compact?: boolean; - /** Whether this pool shares capacity with others */ - isShared?: boolean; - /** Callback when share icon is clicked - filters to show only pools in the same sharing group */ - onFilterBySharedPools?: () => void; -} - -// ============================================================================= -// Share Icon Component -// ============================================================================= - -interface ShareIconProps { - compact: boolean; - interactive: boolean; - onClick?: (e: React.MouseEvent | React.KeyboardEvent) => void; -} - -const ShareIcon = memo(function ShareIcon({ compact, interactive, onClick }: ShareIconProps) { - const iconSize = compact ? "h-3 w-3" : "h-3.5 w-3.5"; - - if (interactive && onClick) { - const handleKeyDown = (e: React.KeyboardEvent) => { - if (e.key === "Enter" || e.key === " ") { - e.preventDefault(); - onClick(e); - } - }; - - return ( - - - - - Show shared pools - - ); - } - - return ( - - - - - - - Show shared pools - - ); -}); - -// ============================================================================= -// Component -// ============================================================================= - -/** - * GpuProgressCell - Pool-specific progress cell for quota/capacity columns. - * - * Composes from InlineProgress and adds pool-specific share icon. - * - * @example - * ```tsx - * - * - * ``` - */ -export const GpuProgressCell = memo(function GpuProgressCell({ - quota, - type, - displayMode, - compact = false, - isShared = false, - onFilterBySharedPools, -}: GpuProgressCellProps) { - const used = type === "quota" ? quota.used : quota.totalUsage; - const total = type === "quota" ? quota.limit : quota.totalCapacity; - const free = type === "quota" ? quota.free : quota.totalFree; - const freeLabel = type === "quota" ? "free" : "idle"; - - const handleShareClick = useCallback( - (e: React.MouseEvent | React.KeyboardEvent) => { - e.stopPropagation(); - onFilterBySharedPools?.(); - }, - [onFilterBySharedPools], - ); - - return ( - - {isShared && ( - - )} - - ); -}); diff --git a/src/ui/src/features/pools/components/table/pool-column-defs.tsx b/src/ui/src/features/pools/components/table/pool-column-defs.tsx index aee8f2579..424e13964 100644 --- a/src/ui/src/features/pools/components/table/pool-column-defs.tsx +++ b/src/ui/src/features/pools/components/table/pool-column-defs.tsx @@ -23,11 +23,11 @@ import type { ColumnDef } from "@tanstack/react-table"; import type { Pool } from "@/lib/api/adapter/types"; -import type { DisplayMode } from "@/stores/shared-preferences-store"; -import { CheckCircle2, Wrench, XCircle } from "lucide-react"; +import { CheckCircle2, CirclePile, Wrench, XCircle } from "lucide-react"; import { cn } from "@/lib/utils"; +import { Tooltip, TooltipContent, TooltipTrigger } from "@/components/shadcn/tooltip"; import { remToPx } from "@/components/data-table/utils/column-sizing"; -import { GpuProgressCell } from "@/features/pools/components/table/gpu-progress-cell"; +import { InlineProgress } from "@/components/inline-progress"; import { PlatformPills } from "@/components/platform-pills"; import { POOL_COLUMN_SIZE_CONFIG, COLUMN_LABELS, type PoolColumnId } from "@/features/pools/lib/pool-columns"; import { getStatusDisplay, STATUS_STYLES, type StatusCategory } from "@/lib/pool-status"; @@ -44,8 +44,6 @@ const STATUS_ICONS = { // ============================================================================= export interface CreatePoolColumnsOptions { - /** Display mode for quota/capacity columns */ - displayMode: DisplayMode; /** Whether to show compact cells */ compact?: boolean; /** Map of pool names to whether they are shared */ @@ -71,19 +69,14 @@ function getMinSize(id: PoolColumnId): number { /** * Create TanStack Table column definitions for pools. * - * Uses plain object notation (not helper.accessor) for correct type inference. - * - * @param options - Display options and callbacks - * @returns Array of column definitions compatible with DataTable + * GPU columns are split into used (bar + fraction) and free (emerald number) + * pairs for clarity. */ export function createPoolColumns({ - displayMode, compact = false, sharingMap, filterBySharedPoolsMap, }: CreatePoolColumnsOptions): ColumnDef[] { - // TanStack handles initial sizing (defaults to 150px per column) - // We only specify minSize to prevent columns from getting too small return [ { id: "name", @@ -91,9 +84,47 @@ export function createPoolColumns({ header: COLUMN_LABELS.name, minSize: getMinSize("name"), enableSorting: true, - cell: ({ getValue }) => ( - {getValue() as string} - ), + cell: ({ row }) => { + const pool = row.original; + const isShared = sharingMap?.has(pool.name) ?? false; + const onFilterBySharedPools = filterBySharedPoolsMap?.get(pool.name); + + return ( +
+ {pool.name} + {isShared && ( + + + {onFilterBySharedPools ? ( + + ) : ( + + + + )} + + Show shared pools + + )} +
+ ); + }, }, { id: "status", @@ -135,36 +166,50 @@ export function createPoolColumns({ minSize: getMinSize("quota"), enableSorting: true, cell: ({ row }) => ( - ), }, + { + id: "quotaFree", + accessorFn: (row) => row.quota.free, + header: COLUMN_LABELS.quotaFree, + minSize: getMinSize("quotaFree"), + enableSorting: true, + cell: ({ row }) => ( + + {Math.max(0, row.original.quota.free)} + + ), + }, { id: "capacity", accessorFn: (row) => row.quota.totalUsage, header: COLUMN_LABELS.capacity, minSize: getMinSize("capacity"), enableSorting: true, - cell: ({ row }) => { - const pool = row.original; - const isShared = sharingMap?.has(pool.name) ?? false; - const onFilterBySharedPools = filterBySharedPoolsMap?.get(pool.name); - - return ( - - ); - }, + cell: ({ row }) => ( + + ), + }, + { + id: "capacityFree", + accessorFn: (row) => row.quota.totalFree, + header: COLUMN_LABELS.capacityFree, + minSize: getMinSize("capacityFree"), + enableSorting: true, + cell: ({ row }) => ( + + {Math.max(0, row.original.quota.totalFree)} + + ), }, { id: "platforms", diff --git a/src/ui/src/features/pools/components/table/pools-data-table.tsx b/src/ui/src/features/pools/components/table/pools-data-table.tsx index 88b13dec3..ab1fc583e 100644 --- a/src/ui/src/features/pools/components/table/pools-data-table.tsx +++ b/src/ui/src/features/pools/components/table/pools-data-table.tsx @@ -33,7 +33,7 @@ import { TableEmptyState } from "@/components/data-table/table-empty-state"; import { TableLoadingSkeleton, TableErrorState } from "@/components/data-table/table-states"; import { useColumnVisibility } from "@/components/data-table/hooks/use-column-visibility"; import type { SortState, ColumnSizingPreference } from "@/components/data-table/types"; -import { useDisplayMode, useCompactMode } from "@/hooks/shared-preferences-hooks"; +import { useCompactMode } from "@/hooks/shared-preferences-hooks"; import type { Pool } from "@/lib/api/adapter/types"; import type { SearchChip } from "@/stores/types"; import { MANDATORY_COLUMN_IDS, asPoolColumnIds, POOL_COLUMN_SIZE_CONFIG } from "@/features/pools/lib/pool-columns"; @@ -97,7 +97,6 @@ export const PoolsDataTable = memo(function PoolsDataTable({ } as const); // Shared preferences (hydration-safe) - const displayMode = useDisplayMode(); const compactMode = useCompactMode(); // Table store state @@ -116,7 +115,6 @@ export const PoolsDataTable = memo(function PoolsDataTable({ pools, sort: sortState, sharingGroups, - displayMode, }); const columnVisibility = useColumnVisibility(columnOrder, storeVisibleColumnIds); @@ -148,12 +146,11 @@ export const PoolsDataTable = memo(function PoolsDataTable({ const columns = useMemo( () => createPoolColumns({ - displayMode, compact: compactMode, sharingMap, filterBySharedPoolsMap, }), - [displayMode, compactMode, sharingMap, filterBySharedPoolsMap], + [compactMode, sharingMap, filterBySharedPoolsMap], ); // Fixed columns (not draggable) diff --git a/src/ui/src/features/pools/hooks/use-pools-data.ts b/src/ui/src/features/pools/hooks/use-pools-data.ts index b0252737e..76dcad3fa 100644 --- a/src/ui/src/features/pools/hooks/use-pools-data.ts +++ b/src/ui/src/features/pools/hooks/use-pools-data.ts @@ -34,11 +34,12 @@ import { useMemo } from "react"; import { useFilteredPools, type PoolFilterParams, type PoolMetadata } from "@/lib/api/adapter/hooks"; -import type { Pool } from "@/lib/api/adapter/types"; +import type { Pool, Quota } from "@/lib/api/adapter/types"; import type { SearchChip } from "@/stores/types"; import { chipsToParams, filterChipsByFields, type ChipMappingConfig } from "@/lib/api/chip-filter-utils"; import { filterByChips } from "@/components/filter-bar/lib/filter"; import { createPoolSearchFields } from "@/features/pools/lib/pool-search-fields"; +import { computePoolGpuSummary } from "@/features/pools/lib/pool-gpu-summary"; // ============================================================================= // Types @@ -61,6 +62,8 @@ interface UsePoolsDataReturn { sharingGroups: string[][]; /** Metadata for filter options (status counts, platforms, backends) */ metadata: PoolMetadata | null; + /** GPU summary for currently visible pools (deduplicates shared capacity) */ + gpuSummary: Quota; /** Whether any filters are active */ hasActiveFilters: boolean; /** Total pools before filtering */ @@ -165,11 +168,14 @@ export function usePoolsData({ // The my/all pools toggle changes scope silently (consistent with workflows/datasets). const hasActiveFilters = hasActiveChipFilters || clientOnlyChips.length > 0; + const gpuSummary = useMemo(() => computePoolGpuSummary(pools, sharingGroups), [pools, sharingGroups]); + return { pools, allPools, sharingGroups, metadata, + gpuSummary, hasActiveFilters, total, filteredTotal, diff --git a/src/ui/src/features/pools/hooks/use-sorted-pools.ts b/src/ui/src/features/pools/hooks/use-sorted-pools.ts index eb04b6b83..ff829b61d 100644 --- a/src/ui/src/features/pools/hooks/use-sorted-pools.ts +++ b/src/ui/src/features/pools/hooks/use-sorted-pools.ts @@ -33,7 +33,7 @@ import { naturalCompare } from "@/lib/utils"; // Sorting // ============================================================================= -function sortPools(pools: Pool[], sort: SortState | null, displayMode: "used" | "free"): Pool[] { +function sortPools(pools: Pool[], sort: SortState | null): Pool[] { if (!sort?.column) return pools; return [...pools].sort((a, b) => { @@ -49,14 +49,17 @@ function sortPools(pools: Pool[], sort: SortState | null, displayMode: " cmp = naturalCompare(a.backend, b.backend); break; case "quota": - // Sort by available (free) or used based on displayMode - cmp = displayMode === "free" ? a.quota.free - b.quota.free : a.quota.used - b.quota.used; + cmp = a.quota.used - b.quota.used; + break; + case "quotaFree": + cmp = a.quota.free - b.quota.free; break; case "capacity": - // Sort by total available (totalFree) or totalUsage based on displayMode - cmp = displayMode === "free" ? a.quota.totalFree - b.quota.totalFree : a.quota.totalUsage - b.quota.totalUsage; + cmp = a.quota.totalUsage - b.quota.totalUsage; + break; + case "capacityFree": + cmp = a.quota.totalFree - b.quota.totalFree; break; - // "platforms" and "description" are not sortable - no case needed } return sort.direction === "asc" ? cmp : -cmp; }); @@ -73,8 +76,6 @@ interface UseSortedPoolsOptions { sort: SortState | null; /** Sharing groups for building sharing map */ sharingGroups: string[][]; - /** Display mode for quota/capacity sorting */ - displayMode: "used" | "free"; } interface UseSortedPoolsResult { @@ -84,14 +85,9 @@ interface UseSortedPoolsResult { sharingMap: Map; } -export function useSortedPools({ - pools, - sort, - sharingGroups, - displayMode, -}: UseSortedPoolsOptions): UseSortedPoolsResult { +export function useSortedPools({ pools, sort, sharingGroups }: UseSortedPoolsOptions): UseSortedPoolsResult { // Sort pools - const sortedPools = useMemo(() => sortPools(pools, sort, displayMode), [pools, sort, displayMode]); + const sortedPools = useMemo(() => sortPools(pools, sort), [pools, sort]); // Build map of pools that are shared (for UI indicators) const sharingMap = useMemo(() => { diff --git a/src/ui/src/features/pools/lib/pool-columns.ts b/src/ui/src/features/pools/lib/pool-columns.ts index 0e7c73b82..278a3a7ee 100644 --- a/src/ui/src/features/pools/lib/pool-columns.ts +++ b/src/ui/src/features/pools/lib/pool-columns.ts @@ -21,26 +21,57 @@ import { COLUMN_MIN_WIDTHS_REM, COLUMN_PREFERRED_WIDTHS_REM } from "@/components // Column IDs // ============================================================================= -export type PoolColumnId = "name" | "status" | "description" | "quota" | "capacity" | "platforms" | "backend"; +export type PoolColumnId = + | "name" + | "status" + | "description" + | "quota" + | "quotaFree" + | "capacity" + | "capacityFree" + | "platforms" + | "backend"; // ============================================================================= // Column Configuration (via factory) // ============================================================================= const poolColumnConfig = createColumnConfig({ - columns: ["name", "status", "description", "quota", "capacity", "platforms", "backend"] as const, + columns: [ + "name", + "status", + "description", + "quota", + "quotaFree", + "capacity", + "capacityFree", + "platforms", + "backend", + ] as const, labels: { name: "Pool", status: "Status", description: "Description", - quota: "Quota (GPU)", - capacity: "Capacity (GPU)", + quota: "Quota Used", + quotaFree: "Quota Free", + capacity: "Capacity Used", + capacityFree: "Capacity Free", platforms: "Platforms", backend: "Backend", }, mandatory: ["name"], - defaultVisible: ["name", "status", "description", "quota", "capacity", "platforms"], - defaultOrder: ["name", "status", "description", "quota", "capacity", "platforms", "backend"], + defaultVisible: ["name", "status", "quota", "quotaFree", "capacity", "capacityFree", "platforms"], + defaultOrder: [ + "name", + "status", + "description", + "quota", + "quotaFree", + "capacity", + "capacityFree", + "platforms", + "backend", + ], sizeConfig: [ { id: "name", @@ -62,11 +93,21 @@ const poolColumnConfig = createColumnConfig({ minWidthRem: COLUMN_MIN_WIDTHS_REM.NUMBER_WITH_PROGRESS_BAR, preferredWidthRem: COLUMN_PREFERRED_WIDTHS_REM.PROGRESS_BAR, }, + { + id: "quotaFree", + minWidthRem: COLUMN_MIN_WIDTHS_REM.TEXT_MEDIUM, + preferredWidthRem: COLUMN_PREFERRED_WIDTHS_REM.TEXT_MEDIUM, + }, { id: "capacity", - minWidthRem: COLUMN_MIN_WIDTHS_REM.NUMBER_WITH_PROGRESS_BAR + COLUMN_MIN_WIDTHS_REM.ACTIONS_ICON, + minWidthRem: COLUMN_MIN_WIDTHS_REM.NUMBER_WITH_PROGRESS_BAR, preferredWidthRem: COLUMN_PREFERRED_WIDTHS_REM.PROGRESS_BAR, }, + { + id: "capacityFree", + minWidthRem: COLUMN_MIN_WIDTHS_REM.TEXT_MEDIUM, + preferredWidthRem: COLUMN_PREFERRED_WIDTHS_REM.TEXT_MEDIUM, + }, { id: "platforms", minWidthRem: COLUMN_MIN_WIDTHS_REM.TEXT_TRUNCATE, diff --git a/src/ui/src/features/pools/lib/pool-gpu-summary.ts b/src/ui/src/features/pools/lib/pool-gpu-summary.ts new file mode 100644 index 000000000..a4b154708 --- /dev/null +++ b/src/ui/src/features/pools/lib/pool-gpu-summary.ts @@ -0,0 +1,58 @@ +//SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION. All rights reserved. + +//Licensed under the Apache License, Version 2.0 (the "License"); +//you may not use this file except in compliance with the License. +//You may obtain a copy of the License at + +//http://www.apache.org/licenses/LICENSE-2.0 + +//Unless required by applicable law or agreed to in writing, software +//distributed under the License is distributed on an "AS IS" BASIS, +//WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +//See the License for the specific language governing permissions and +//limitations under the License. + +//SPDX-License-Identifier: Apache-2.0 + +import type { Pool, Quota } from "@/lib/api/adapter/types"; + +/** + * Compute GPU summary for a filtered subset of pools, correctly handling + * shared hardware deduplication. + * + * - Quota fields (used/free/limit) and totalUsage: per-pool values, summed directly. + * - totalCapacity/totalFree: per-node-set values shared across pools in a + * sharing group — counted once per group if any pool in the group is visible. + */ +export function computePoolGpuSummary(pools: Pool[], sharingGroups: string[][]): Quota { + let quotaUsed = 0; + let quotaFree = 0; + let quotaLimit = 0; + let totalUsage = 0; + let totalCapacity = 0; + let totalFree = 0; + + const countedGroupIndices = new Set(); + + for (const pool of pools) { + quotaUsed += pool.quota.used; + quotaFree += pool.quota.free; + quotaLimit += pool.quota.limit; + totalUsage += pool.quota.totalUsage; + + const groupIndex = sharingGroups.findIndex((g) => g.includes(pool.name)); + const isUngrouped = groupIndex === -1; + const isFirstInGroup = !isUngrouped && !countedGroupIndices.has(groupIndex); + + if (!isUngrouped) { + countedGroupIndices.add(groupIndex); + } + + if (isUngrouped || isFirstInGroup) { + totalCapacity += pool.quota.totalCapacity; + totalFree += pool.quota.totalFree; + } + } + + return { used: quotaUsed, free: quotaFree, limit: quotaLimit, totalUsage, totalCapacity, totalFree }; +} diff --git a/src/ui/src/features/workflows/list/lib/workflow-search-fields.test.ts b/src/ui/src/features/workflows/list/lib/workflow-search-fields.test.ts index 87aa55e00..00e9851e8 100644 --- a/src/ui/src/features/workflows/list/lib/workflow-search-fields.test.ts +++ b/src/ui/src/features/workflows/list/lib/workflow-search-fields.test.ts @@ -189,11 +189,8 @@ describe("STATUS_PRESETS", () => { expect(STATUS_PRESETS.running).toContain("RUNNING"); }); - it("pending preset contains PENDING", () => { - expect(STATUS_PRESETS.pending).toContain("PENDING"); - }); - - it("waiting preset contains WAITING", () => { + it("waiting preset contains PENDING and WAITING", () => { + expect(STATUS_PRESETS.waiting).toContain("PENDING"); expect(STATUS_PRESETS.waiting).toContain("WAITING"); }); @@ -218,20 +215,20 @@ describe("createPresetChips", () => { expect(chips[0].value).toBe("RUNNING"); }); - it("creates chips for pending preset", () => { + it("creates chips for pending preset (empty — PENDING is in waiting)", () => { const chips = createPresetChips("pending"); - expect(chips).toHaveLength(1); - expect(chips[0].field).toBe("status"); - expect(chips[0].value).toBe("PENDING"); + expect(chips).toHaveLength(0); }); - it("creates chips for waiting preset", () => { + it("creates chips for waiting preset (includes PENDING and WAITING)", () => { const chips = createPresetChips("waiting"); - expect(chips).toHaveLength(1); - expect(chips[0].field).toBe("status"); - expect(chips[0].value).toBe("WAITING"); + expect(chips).toHaveLength(2); + expect(chips.every((c) => c.field === "status")).toBe(true); + const values = chips.map((c) => c.value); + expect(values).toContain("PENDING"); + expect(values).toContain("WAITING"); }); it("creates chips for failed preset with all failure statuses", () => { diff --git a/src/ui/src/lib/api/adapter/hooks.ts b/src/ui/src/lib/api/adapter/hooks.ts index a01f57165..8d933dbd0 100644 --- a/src/ui/src/lib/api/adapter/hooks.ts +++ b/src/ui/src/lib/api/adapter/hooks.ts @@ -54,11 +54,12 @@ import { transformCredential, } from "@/lib/api/adapter/transforms"; -import type { - PoolResourcesResponse, - AllResourcesResponse, - ProfileUpdate, - CredentialCreate, +import { + EMPTY_QUOTA, + type PoolResourcesResponse, + type AllResourcesResponse, + type ProfileUpdate, + type CredentialCreate, } from "@/lib/api/adapter/types"; import { fetchPaginatedResources, @@ -84,7 +85,7 @@ export function usePools(enabled = true) { query: { enabled, select: useCallback((rawData: getPoolQuotasApiPoolQuotaGetResponse) => { - if (!rawData.data) return { pools: [], sharingGroups: [] }; + if (!rawData.data) return { pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }; return transformPoolsResponse(rawData.data); }, []), }, @@ -94,6 +95,7 @@ export function usePools(enabled = true) { return { pools: data?.pools ?? [], sharingGroups: data?.sharingGroups ?? [], + gpuSummary: data?.gpuSummary ?? EMPTY_QUOTA, isLoading, error, refetch, @@ -535,7 +537,7 @@ export function usePoolNames(enabled: boolean = true) { enabled, staleTime: QUERY_STALE_TIME_EXPENSIVE_MS, select: useCallback((rawData: getPoolQuotasApiPoolQuotaGetResponse) => { - if (!rawData.data) return { pools: [], sharingGroups: [] }; + if (!rawData.data) return { pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }; return transformPoolsResponse(rawData.data); }, []), }, diff --git a/src/ui/src/lib/api/adapter/transforms.test.ts b/src/ui/src/lib/api/adapter/transforms.test.ts index 944178f12..acb950220 100644 --- a/src/ui/src/lib/api/adapter/transforms.test.ts +++ b/src/ui/src/lib/api/adapter/transforms.test.ts @@ -23,6 +23,7 @@ import { transformVersionResponse, } from "@/lib/api/adapter/transforms"; import { PoolStatus, BackendResourceType } from "@/lib/api/generated"; +import { EMPTY_QUOTA } from "@/lib/api/adapter/types"; // ============================================================================= // Test fixtures - minimal data to verify transforms @@ -79,6 +80,14 @@ const mockPoolResponse = { ], }, ], + resource_sum: { + quota_used: "15", + quota_free: "105", + quota_limit: "120", + total_usage: "25", + total_capacity: "200", + total_free: "175", + }, }; const mockResourceResponse = { @@ -169,9 +178,9 @@ const mockAllResourcesResponse = { describe("transformPoolsResponse", () => { it("transforms empty response", () => { - expect(transformPoolsResponse(null)).toEqual({ pools: [], sharingGroups: [] }); - expect(transformPoolsResponse(undefined)).toEqual({ pools: [], sharingGroups: [] }); - expect(transformPoolsResponse({})).toEqual({ pools: [], sharingGroups: [] }); + expect(transformPoolsResponse(null)).toEqual({ pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }); + expect(transformPoolsResponse(undefined)).toEqual({ pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }); + expect(transformPoolsResponse({})).toEqual({ pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }); }); it("transforms pools from node_sets", () => { @@ -247,6 +256,18 @@ describe("transformPoolsResponse", () => { const result = transformPoolsResponse(mockPoolResponse); expect(result.pools[1].defaultExitActions).toEqual({}); }); + + it("preserves resource_sum as gpuSummary", () => { + const result = transformPoolsResponse(mockPoolResponse); + expect(result.gpuSummary).toEqual({ + used: 15, + free: 105, + limit: 120, + totalUsage: 25, + totalCapacity: 200, + totalFree: 175, + }); + }); }); describe("transformPoolDetail", () => { diff --git a/src/ui/src/lib/api/adapter/transforms.ts b/src/ui/src/lib/api/adapter/transforms.ts index 4caaa9c70..5059be4a5 100644 --- a/src/ui/src/lib/api/adapter/transforms.ts +++ b/src/ui/src/lib/api/adapter/transforms.ts @@ -41,21 +41,22 @@ import { type ResourcesEntry, } from "@/lib/api/generated"; -import type { - Pool, - PoolsResponse, - Quota, - PlatformConfig, - GpuResources, - TimeoutConfig, - Resource, - PoolResourcesResponse, - AllResourcesResponse, - ResourceCapacity, - PoolMembership, - Version, - UserProfile, - Credential, +import { + EMPTY_QUOTA, + type Pool, + type PoolsResponse, + type Quota, + type PlatformConfig, + type GpuResources, + type TimeoutConfig, + type Resource, + type PoolResourcesResponse, + type AllResourcesResponse, + type ResourceCapacity, + type PoolMembership, + type Version, + type UserProfile, + type Credential, } from "@/lib/api/adapter/types"; import { naturalCompare } from "@/lib/utils"; @@ -222,7 +223,7 @@ export function transformPoolsResponse(rawResponse: unknown): PoolsResponse { const response = rawResponse as PoolResponse | undefined; if (!response?.node_sets) { - return { pools: [], sharingGroups: [] }; + return { pools: [], sharingGroups: [], gpuSummary: EMPTY_QUOTA }; } const pools: Pool[] = []; @@ -242,7 +243,7 @@ export function transformPoolsResponse(rawResponse: unknown): PoolsResponse { } } - return { pools, sharingGroups }; + return { pools, sharingGroups, gpuSummary: transformQuota(response.resource_sum) }; } /** diff --git a/src/ui/src/lib/api/adapter/types.ts b/src/ui/src/lib/api/adapter/types.ts index afe85e0f2..e28f728f4 100644 --- a/src/ui/src/lib/api/adapter/types.ts +++ b/src/ui/src/lib/api/adapter/types.ts @@ -55,6 +55,8 @@ export interface Quota { totalFree: number; } +export const EMPTY_QUOTA: Quota = { used: 0, free: 0, limit: 0, totalUsage: 0, totalCapacity: 0, totalFree: 0 }; + /** * Platform configuration within a pool. * Contains task configuration settings. @@ -127,6 +129,11 @@ export interface PoolsResponse { * Example: [["pool-a", "pool-b"], ["pool-c", "pool-d"]] */ sharingGroups: string[][]; + /** + * Aggregate GPU metrics across all pools (from backend's resource_sum). + * Quota fields sum per-pool; capacity fields are deduplicated per node_set. + */ + gpuSummary: Quota; } // ============================================================================= diff --git a/src/ui/src/lib/api/status-metadata.generated.ts b/src/ui/src/lib/api/status-metadata.generated.ts index ce68fc32d..87de47ebf 100644 --- a/src/ui/src/lib/api/status-metadata.generated.ts +++ b/src/ui/src/lib/api/status-metadata.generated.ts @@ -196,7 +196,7 @@ export const TASK_STATUS_METADATA: Record = export const WORKFLOW_STATUS_METADATA: Record = { PENDING: { - category: "pending", + category: "waiting", isTerminal: false, isOngoing: false, isFailed: false, diff --git a/src/ui/src/mocks/generators/event-generator.ts b/src/ui/src/mocks/generators/event-generator.ts index 30c86bd3a..7a0e636ee 100644 --- a/src/ui/src/mocks/generators/event-generator.ts +++ b/src/ui/src/mocks/generators/event-generator.ts @@ -222,7 +222,7 @@ export class EventGenerator { "Warning", "ErrImagePull", taskName, - "Failed to pull image: manifest not found", + `Failed to pull image "nvcr.io/nvidia/invalid:latest": rpc error: code=Unknown desc=failed to pull and unpack image "nvcr.io/nvidia/invalid:latest": failed to resolve reference "nvcr.io/nvidia/invalid:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Finvalid%3Apull&service=nvcr.io: 401 Unauthorized`, ), ); currentTime += faker.number.int({ min: 10000, max: 20000 }); @@ -232,7 +232,7 @@ export class EventGenerator { "Warning", "ImagePullBackOff", taskName, - "Back-off pulling image: manifest not found", + "Back-off pulling image: ErrImagePull:sha256:a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2/nvcr.io/nvidia/invalid:latest:manifest_unknown:manifest_unknown_to_registry", ), ); return events; @@ -270,7 +270,7 @@ export class EventGenerator { "Warning", "Evicted", taskName, - "Pod evicted due to node memory pressure", + `The node ${node} was under DiskPressure condition; pod ${taskName} (UID: ${faker.string.uuid()}) was evicted because the node's ephemeral-storage usage exceeded the eviction threshold. Usage: 92.4Gi of 100Gi limit. Container training was using 48.2Gi of local ephemeral storage for checkpoint files and model weights`, ), ); } else if (fullStatus.toString().includes("OOM")) { @@ -280,7 +280,7 @@ export class EventGenerator { "Warning", "OOMKilled", taskName, - "Container training exceeded memory limit (32Gi)", + `Container training in pod ${taskName} exceeded memory limit: the container was using 33.8Gi against a limit of 32Gi. The kernel OOM killer terminated process pid=4821 (python3) with signal SIGKILL(9). Current memory usage breakdown: RSS=32.1Gi, Cache=1.7Gi, Swap=0B. Peak memory usage recorded at container_memory_working_set_bytes=${faker.number.int({ min: 33000000000, max: 35000000000 })}`, ), ); currentTime += faker.number.int({ min: 1000, max: 3000 }); @@ -290,7 +290,7 @@ export class EventGenerator { "Warning", "BackOff", taskName, - `Back-off restarting failed container training in pod ${taskName}`, + `Back-off restarting failed container training in pod ${taskName}: restart_count=3 last_exit_code=137 reason=OOMKilled back-off_delay=40s container_id=containerd://a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2`, ), ); } else if (fullStatus === TaskGroupStatus.FAILED_START_ERROR) { @@ -300,7 +300,7 @@ export class EventGenerator { "Warning", "BackOff", taskName, - "Container exited with code 1 (error)", + `Error from container runtime: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/local/bin/entrypoint.sh": permission denied: unknown. Container_id=containerd://sha256:f9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4`, ), ); currentTime += faker.number.int({ min: 5000, max: 10000 }); @@ -310,18 +310,19 @@ export class EventGenerator { "Warning", "CrashLoopBackOff", taskName, - "Container is in crash loop, back-off restarting", + `Back-off restarting container training in pod ${taskName}: the container has crashed 5 times consecutively with exit code 1 over the last 240 seconds. Back-off delay increasing exponentially: 10s, 20s, 40s, 80s, 160s. Last known container state: terminated at ${new Date(currentTime).toISOString()} with reason=Error`, ), ); } else { // Generic failure + const exitCode = faker.helpers.arrayElement([1, 137, 139]); events.push( this.createTaskEvent( new Date(currentTime), "Warning", "Failed", taskName, - `Container terminated with exit code ${faker.helpers.arrayElement([1, 137, 139])}`, + `Container terminated with exit code ${exitCode}: the main process (pid 1) in container training received signal ${exitCode === 137 ? "SIGKILL(9)" : exitCode === 139 ? "SIGSEGV(11)" : "EXIT(1)"} after running for ${faker.number.int({ min: 30, max: 600 })}s. Last 512 bytes of stderr: RuntimeError:CUDA_error:an_illegal_memory_access_was_encountered_at_/opt/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1261:block=[256,1,1],thread=[128,0,0]_Assertion_srcIndex 0 + ON CONFLICT (role_name, external_role) DO NOTHING RETURNING 1 ) - SELECT policies, immutable, is_new_role FROM role_result; + SELECT policies, immutable, is_new_role FROM role_upsert; ''' result = database.execute_fetch_command( insert_cmd, ( - # existing_role params - self.name, - # role_insert params + # role_upsert params self.name, self.description, [json.dumps(policy.to_dict()) for policy in self.policies], False, self.sync_mode.value, - # role_update params - self.description, - [json.dumps(policy.to_dict()) for policy in self.policies], - self.sync_mode.value, - self.name, # sync_config params - external_roles_provided, - external_roles_provided, - external_roles_list, - self.name, + external_roles_provided, # first %s in sync_config (should_sync) + external_roles_provided, # WHEN %s in CASE + external_roles_list, # THEN %s::text[] + self.name, # ELSE ARRAY[%s] (default mapping) # delete_mappings params - self.name, + self.name, # WHERE role_name = %s # insert_mappings params - self.name, + self.name, # SELECT %s, unnest(...) ), True ) diff --git a/src/utils/job/task.py b/src/utils/job/task.py index 76d66213d..07c33535d 100644 --- a/src/utils/job/task.py +++ b/src/utils/job/task.py @@ -78,12 +78,14 @@ def create_login_dict(user: str, url: str, token: str | None = None, - refresh_endpoint: str | None = None) -> Dict: + refresh_endpoint: str | None = None, + refresh_token: str | None = None) -> Dict: if token: return { 'token_login': { 'id_token': token, - 'refresh_url': refresh_endpoint + 'refresh_url': refresh_endpoint, + 'refresh_token': refresh_token }, 'url': url, 'osmo_token': True, @@ -2424,10 +2426,10 @@ def convert_to_pod_spec( query = urlencode({'workflow_id': self.workflow_id, 'group_name': self.name, 'task_name': task_spec.name, - 'retry_id': task_obj.retry_id, - 'refresh_token': refresh_token}) + 'retry_id': task_obj.retry_id}) refresh_url = f'{service_url}/api/auth/jwt/refresh_token?{query}' - login_yaml = create_login_dict(user, service_url, token, refresh_url) + login_yaml = create_login_dict(user, service_url, token, refresh_url, + refresh_token=refresh_token) user_config_yaml = create_config_dict(data_endpoints)