Skip to content

Commit 34f261d

Browse files
authored
Merge pull request #265 from NVIDIA/release/0.4
Forward-merge release/0.4 into main
2 parents b8cff5c + 5ac870c commit 34f261d

3 files changed

Lines changed: 51 additions & 5 deletions

File tree

docs/observability-plugin/atif.mdx

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,9 @@ local→remote transition keeps object names stable. A trailing `/` is added to
174174

175175
If an upload fails for a given destination, that destination is recorded as
176176
unhealthy and skipped on later trajectories. The other destinations continue
177-
to receive writes. All recorded sink failures are joined into the dispatcher's
178-
last-error result on plugin teardown.
177+
to receive writes. Fatal dispatcher failures, such as trajectory serialization
178+
failures, stop later ATIF observation and are reported during plugin teardown;
179+
per-destination sink failures are isolated to the failed sink.
179180

180181
## Expected Output
181182

docs/observability-plugin/configuration.mdx

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,39 @@ Include only the sections you want to configure. In layered `plugins.toml`
125125
files, omission inherits lower-precedence values; write `enabled = false` to
126126
disable an inherited section.
127127

128+
## Failure Behavior
129+
130+
NeMo Relay treats exporter configuration and exporter delivery failures
131+
differently. Configuration and activation failures are fail-closed for the
132+
observability setup: validation or initialization returns an error before the
133+
new plugin configuration becomes active. Runtime delivery failures are
134+
fail-open for application work: the tool, LLM, or agent run continues while the
135+
affected exporter records, logs, or reports the delivery problem.
136+
137+
| Failure | Behavior |
138+
|---|---|
139+
| Invalid `plugins.toml`, duplicate component kinds, malformed component shapes, unsupported values, unavailable exporter features, ATOF file-open failures, invalid ATOF endpoint config, unavailable ATOF streaming support, or ATOF endpoint worker startup failures | Validation or initialization fails. If a previous plugin configuration was active, NeMo Relay attempts to restore it after a failed replacement. |
140+
| ATOF event serialization or file write/flush failure after activation | Application work continues. The exporter stores the failure, stops accepting later events for that file, and returns the stored error from `force_flush()` or `shutdown()`. |
141+
| ATOF streaming endpoint connection or send failure after activation | File output and other already-started endpoints continue. Endpoint failures are logged with the endpoint index; endpoint flush and close timeouts are logged instead of blocking shutdown indefinitely. |
142+
| ATIF local file write, HTTP storage, or S3-compatible storage failure | Application work continues. The failed sink is recorded as unhealthy and skipped for later trajectories. Other configured sinks continue to receive writes. |
143+
| ATIF dispatcher serialization or subscriber-management failure | The ATIF dispatcher records a fatal exporter error and stops observing later ATIF events. Other observability sections continue to run. |
144+
| OpenTelemetry or OpenInference construction failure | Plugin initialization fails before the subscriber is registered. |
145+
| OpenTelemetry or OpenInference export failure after registration | Application work continues. The OTLP exporter reports failures through its runtime logging and flush or shutdown path. |
146+
147+
Missing or delayed telemetry is represented as absence of exporter output, not
148+
as synthetic success or failure events. NeMo Relay does not backfill events for
149+
subscribers that register late. If the plugin is cleared while an agent scope
150+
is still open, the ATIF dispatcher writes the partial trajectory it has already
151+
observed.
152+
153+
Use `nemo-relay doctor` to validate local ATOF and ATIF output directories,
154+
OTLP HTTP endpoints, and ATOF streaming endpoints. Validate ATIF remote storage
155+
destinations, including HTTP storage and S3-compatible storage, with exporter
156+
logs and backend-side checks because `nemo-relay doctor` does not probe
157+
`atif.storage`. For local artifact paths, verify that the running process can
158+
create the output directory and that teardown calls `plugin.clear()` or
159+
`clear_plugin_configuration()` before the process exits.
160+
128161
## Per-Language Plugin Configuration
129162

130163
<Tabs>

integrations/coding-agents/README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,14 @@ enabled = true
163163
endpoint = "http://127.0.0.1:4318/v1/traces"
164164
```
165165

166+
During setup or launch, invalid shared TOML, malformed plugin config, unsupported exporter settings,
167+
or unavailable exporter features will fail closed. The
168+
wrapper does not start the coding agent against a configuration that cannot be
169+
parsed, validated, or activated. Once the gateway and agent are running,
170+
exporter delivery failures follow the observability plugin policy: application
171+
work continues while the failing ATOF, ATIF, OpenTelemetry, or OpenInference
172+
destination records, logs, or reports the failure.
173+
166174
## Hook Forwarding
167175

168176
The transparent wrapper hooks call `nemo-relay hook-forward <agent>` with the
@@ -174,11 +182,15 @@ Claude Code and Codex plugin hooks call `nemo-relay plugin-shim hook <agent>`.
174182
The plugin shim ensures the local sidecar is reachable, then forwards the hook
175183
payload to the plugin sidecar endpoint.
176184

177-
Since hook forwarding fails open by default, observability outages do not block the
178-
coding agent. For wrapper-generated `hook-forward` commands, add
185+
Since hook forwarding fails open by default, gateway or sidecar outages do not
186+
block the coding agent. The hook command exits successfully after logging the
187+
forwarding problem, so the host agent can continue even though that hook
188+
payload may be missing from telemetry. For wrapper-generated `hook-forward`
189+
commands, add
179190
`--fail-closed` when policy requires hook delivery to block the agent. For
180191
plugin shim hooks, set `NEMO_RELAY_FAIL_CLOSED=1` in the hook execution
181-
environment.
192+
environment. In that mode, forwarding failures return a non-zero hook command
193+
status to the host.
182194

183195
Useful wrapper options:
184196

0 commit comments

Comments
 (0)