Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 0 additions & 92 deletions .claude/skills/osmo-skill/agents/workflow-expert.md

This file was deleted.

6 changes: 3 additions & 3 deletions docs/deployment_guide/appendix/keycloak_setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
.. _keycloak_setup:

================================================
Keycloak as an Identity Provider for OSMO
Keycloak as a sample IdP
================================================

This guide describes how to deploy `Keycloak <https://www.keycloak.org/>`_ and configure it as the identity provider (IdP) for OSMO. Keycloak acts as an authentication broker, allowing OSMO to authenticate users through various identity providers (LDAP, SAML, social logins) while providing centralized group and role management.
Expand Down Expand Up @@ -396,7 +396,7 @@ The typical workflow for setting up access control is:
2. Create groups in Keycloak
3. Assign roles to groups
4. Add users to groups (manually or via identity provider mappings)
5. Create matching pools in OSMO
5. Create matching pools
6. Verify access

.. _keycloak_create_roles:
Expand Down Expand Up @@ -551,7 +551,7 @@ User Cannot Access Pool

**Solutions**:

1. **Verify Role Policy in OSMO**: Ensure the corresponding role has been created in OSMO. Follow the steps in :ref:`troubleshooting_roles_policies`.
1. **Verify Role Policy**: Ensure the corresponding role has been created. Follow the steps in :ref:`troubleshooting_roles_policies`.

2. **Verify Role Names**: Pool access roles must start with ``osmo-`` prefix (see :ref:`role_naming_for_pools`). Pool names must match the role suffix. Example: Role ``osmo-team1`` will make pools named ``team1*`` visible.

Expand Down
File renamed without changes.
155 changes: 94 additions & 61 deletions .claude/skills/osmo-skill/SKILL.md → skills/osmo-agent/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ common OSMO CLI use cases.

The `agents/` directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.

- `agents/workflow-expert.md` — expert for workflow generation, resource check, submission, failure diagnosis
- `agents/logs-reader.md` - expert for fetching and reading logs, extracting important information for monitoring and failure diagnosis.
- `agents/workflow-expert.md` — workflow generation, resource check, submission, failure diagnosis
- `agents/logs-reader.md` — log fetching and summarization for monitoring and failure diagnosis

The `references/` directory has additional documentation:

Expand All @@ -31,6 +31,18 @@ The `references/` directory has additional documentation:

---

## Intent Routing

- Asks about resources, pools, GPUs, or quota → Check Available Resources
- Wants to submit a job (simple, no monitoring) → Generate and Submit a Workflow
- Wants to submit + monitor + handle failures → Orchestrate a Workflow End-to-End
- Asks about a workflow's status or logs → Check Workflow Status
- Lists recent workflows → List Workflows
- Asks what a workflow does → Explain What a Workflow Does
- Wants to publish a workflow as an app → Create an App

---

## Use Case: Check Available Resources

**When to use:** The user asks what resources, nodes, GPUs, or pools are available
Expand Down Expand Up @@ -115,7 +127,8 @@ Derive GPU type from pool names when possible:
**When to use:** The user wants to submit a job to run on OSMO (e.g. "submit a workflow
to run SDG", "run RL training for me", "submit this yaml to OSMO").

Evaluate the complexity of the user's request: if user also wants monitoring, debugging workflows, reporting results, or the workflow complexity is too high, refer to `Orchestrate a Workflow End-to-End` use case to delegate this to a sub-agent instead.
If the user also wants monitoring, debugging, or reporting results, use the
"Orchestrate a Workflow End-to-End" use case instead.

### Steps

Expand Down Expand Up @@ -234,7 +247,8 @@ Evaluate the complexity of the user's request: if user also wants monitoring, de
## Use Case: Check Workflow Status

**When to use:** The user asks about the status or logs of a workflow (e.g. "what's the
status of workflow abc-123?", "is my workflow done?", "show me the logs for xyz").
status of workflow abc-123?", "is my workflow done?", "show me the logs for xyz",
"show me the resource usage for my workflow", "give me the Kubernetes dashboard link").
Also used as the polling step when monitoring a workflow during end-to-end orchestration.

### Steps
Expand All @@ -243,6 +257,9 @@ Also used as the polling step when monitoring a workflow during end-to-end orche
```
osmo workflow query <workflow name> --format-type json
```
**Cache the JSON result for the rest of the conversation.** If you have already queried
this workflow with `osmo workflow query` earlier in the conversation, reuse that JSON
— do not query again just to extract a field.

2. **Get recent logs** — Choose the log-fetching method based on task count
(this rule applies everywhere logs are needed — monitoring, failure diagnosis, etc.):
Expand All @@ -256,82 +273,98 @@ Also used as the polling step when monitoring a workflow during end-to-end orche
- Concisely summarize what the logs show — what stage the job is at, any errors,
or what it completed successfully
- If the workflow failed, highlight the error and suggest next steps if possible
- **If the workflow is COMPLETED and has output datasets, you MUST ask this
explicit question before ending your response:**
`Would you like me to download the output dataset now?`
Also ask whether they want a specific output folder (default to `~/` if not).
Then run the download yourself:
- **Resource usage / Grafana link:** If the user asks about resource usage, GPU
utilization, or metrics for this workflow, extract `grafana_url` from the query
JSON. If present, render it as a clickable link:
`[View resource usage in Grafana](<grafana_url>)`
If the field is empty or null, tell the user: "The Grafana resource usage link is
not available for this workflow."
- **Kubernetes dashboard link:** If the user asks for the Kubernetes dashboard,
pod details, or a k8s link, extract `kubernetes_dashboard` from the query JSON.
If present, render it as a clickable link:
`[Open Kubernetes dashboard](<kubernetes_dashboard>)`
If the field is empty or null, tell the user: "The Kubernetes dashboard link is
not available for this workflow."
- Proactively include both links in any detailed status report (e.g. when the
workflow is RUNNING or has just COMPLETED) — users often want them without
explicitly asking. If a field is empty or null, note it as not available rather
than silently omitting it.
- **If PENDING** (or the user asks why it isn't scheduling), run:
```
osmo dataset download <dataset_name> <path>
osmo workflow events <workflow name>
```
Use `~/` as the output path if the user doesn't specify one.

- **After the dataset download question above**, if the workflow is COMPLETED,
also ask if the user would like to create an
OSMO app for it. Suggest a name derived from the workflow name (e.g. workflow
`sdg-run-42` → app name `sdg-run-42`) and generate a one-sentence description
based on what the workflow does. If the user agrees (or provides their own name),
follow the "Create an App" use case below.
- **When monitoring multiple workflows** that all complete from the same spec, offer
app creation once (not per workflow) after all workflows reach a terminal state.
Since they share the same YAML, a single app covers all runs. Do not skip this
offer just because you were in a batch monitoring loop.

**If the workflow is PENDING** (or the user asks why it isn't scheduling), run:
```
osmo workflow events <workflow name>
```
These are Kubernetes pod conditions and cluster events — translate them into plain
language without Kubernetes jargon (e.g. "there aren't enough free GPUs in the pool
to schedule your job" rather than "Insufficient nvidia.com/gpu"). Also direct the
user to check resource availability in the pool their workflow is waiting in:
Translate Kubernetes events into plain language (e.g. "there aren't enough free
GPUs in the pool" rather than "Insufficient nvidia.com/gpu"). Also check:
```
osmo resource list -p <pool>
```
- If COMPLETED, proceed to Step 4.

4. **Handle completed workflows:**

Offer the output dataset for download:
`Would you like me to download the output dataset now?`
Ask whether they want a specific output folder (default to `~/`). Then run:
```
osmo resource list -p <pool>
osmo dataset download <dataset_name> <path>
```

Also offer to create an OSMO app. Suggest a name derived from the workflow name
(e.g. `sdg-run-42` → app name `sdg-run-42`) and generate a one-sentence description.
If the user agrees, follow the "Create an App" use case.

When monitoring multiple workflows from the same spec, offer app creation once
(not per workflow) after all reach a terminal state. Do not skip this offer
just because you were in a batch monitoring loop.

---

## Use Case: Orchestrate a Workflow End-to-End

**When to use:** The user wants to create workflow, submit and monitor it to completion,
or requests an autonomous workflow cycle (e.g. "train GR00T on my data", "create a SDG workflow and run it",
"submit and monitor my workflow", "run end-to-end training", "submit this and
tell me when it's done").

### Phase-Split Pattern
**When to use:** The user wants to create a workflow, submit it, and monitor it to
completion (e.g. "train GR00T on my data", "submit and monitor my workflow",
"run end-to-end training", "submit this and tell me when it's done").

The lifecycle is split between the `/agents/workflow-expert.md` subagent (workflow generation creation, resource check, submission, failure diagnosis) and **you** (live monitoring so the user sees real-time updates). Follow these steps exactly:
### Steps

#### Step 1: Spawn a `/agents/workflow-expert.md` subagent for setup and submission
The lifecycle is split between the `workflow-expert` subagent (workflow generation,
resource check, submission, failure diagnosis) and **you** (live monitoring so the
user sees real-time updates).

Spawn the `/agents/workflow-expert.md` subagent. Ask it to **write workflow YAML if needed, check resources and submit the workflow only**. Do NOT ask it to monitor, poll status, or report results — that is your job.
1. **Spawn the workflow-expert subagent for setup and submission.**

Example prompt:
> Create a workflow based on user's request, if any. Check resources first, then submit the workflow to an available resource pool. Return the workflow ID when done.
Ask it to **write workflow YAML if needed, check resources, and submit only**.
Do NOT ask it to monitor, poll status, or report results — that is your job.

The subagent returns: workflow ID, pool name, and OSMO Web link.
Example prompt:
> Create a workflow based on user's request, if any. Check resources first,
> then submit the workflow to an available resource pool. Return the workflow
> ID when done.

#### Step 2: Monitor the workflow inline (you do this — user sees live updates)
The subagent returns: workflow ID, pool name, and OSMO Web link.

After getting the workflow ID, use the "Check Workflow Status" use case to
poll and report. Repeat until a terminal state is reached.
2. **Monitor the workflow inline (you do this — user sees live updates).**

Report each state transition to the user:
- `Status: SCHEDULING (queued 15s)`
- `Workflow transitioned: SCHEDULING → RUNNING`
- `Status: RUNNING (task "train" active, 2m elapsed)`
Use the "Check Workflow Status" use case to poll and report. Repeat until a
terminal state is reached. Adjust the polling interval based on how long you
expect the workflow to take — poll more frequently for short jobs (every 10-15s)
and less frequently for long training runs (every 30-60s). Report each state
transition to the user:
- `Status: SCHEDULING (queued 15s)`
- `Workflow transitioned: SCHEDULING → RUNNING`
- `Status: RUNNING (task "train" active, 2m elapsed)`

#### Step 3: Handle the outcome
3. **Handle the outcome.**

**If COMPLETED:** Report results — workflow ID, OSMO Web link, output datasets.
In the same completion message, ask: `Would you like me to download the output dataset now?`
Then follow the COMPLETED handling in "Check Workflow Status".
**If COMPLETED:** Report results — workflow ID, OSMO Web link, output datasets.
Then follow Step 4 of "Check Workflow Status" (download offer + app creation).

**If FAILED:** First, fetch logs using the log-fetching rule from "Check Workflow Status"
Step 2 (1 task = inline, 2+ tasks = delegate to logs-reader subagents). Then resume the
`workflow-expert` subagent (use the `resume` parameter with the agent ID from Step 1)
and pass the logs summary: "Workflow <id> FAILED. Here is the logs summary: <summary>.
Diagnose and fix." It returns a new workflow ID. Resume monitoring from Step 2. Max 3
retries before asking the user for guidance.
**If FAILED:** First, fetch logs using the log-fetching rule from "Check Workflow
Status" Step 2 (1 task = inline, 2+ tasks = delegate to logs-reader subagents).
Then resume the `workflow-expert` subagent (use the `resume` parameter with the
agent ID from Step 1) and pass the logs summary: "Workflow <id> FAILED. Here is
the logs summary: <summary>. Diagnose and fix." It returns a new workflow ID.
Resume monitoring from Step 2. Max 3 retries before asking the user for guidance.

---

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# OSMO Logs Reader Agent

> Spawn a general-purpose subagent and pass these instructions as the prompt.

You are a subagent invoked by the main OSMO agent. Your sole job is to fetch
and summarize logs for a specific workflow, then return a concise digest that
the main agent can use without holding large raw logs in context.
Expand Down
Loading