You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/observability-plugin/configuration.mdx
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -125,6 +125,39 @@ Include only the sections you want to configure. In layered `plugins.toml`
125
125
files, omission inherits lower-precedence values; write `enabled = false` to
126
126
disable an inherited section.
127
127
128
+
## Failure Behavior
129
+
130
+
NeMo Relay treats exporter configuration and exporter delivery failures
131
+
differently. Configuration and activation failures are fail-closed for the
132
+
observability setup: validation or initialization returns an error before the
133
+
new plugin configuration becomes active. Runtime delivery failures are
134
+
fail-open for application work: the tool, LLM, or agent run continues while the
135
+
affected exporter records, logs, or reports the delivery problem.
136
+
137
+
| Failure | Behavior |
138
+
|---|---|
139
+
| Invalid `plugins.toml`, duplicate component kinds, malformed component shapes, unsupported values, unavailable exporter features, ATOF file-open failures, invalid ATOF endpoint config, unavailable ATOF streaming support, or ATOF endpoint worker startup failures | Validation or initialization fails. If a previous plugin configuration was active, NeMo Relay attempts to restore it after a failed replacement. |
140
+
| ATOF event serialization or file write/flush failure after activation | Application work continues. The exporter stores the failure, stops accepting later events for that file, and returns the stored error from `force_flush()` or `shutdown()`. |
141
+
| ATOF streaming endpoint connection or send failure after activation | File output and other already-started endpoints continue. Endpoint failures are logged with the endpoint index; endpoint flush and close timeouts are logged instead of blocking shutdown indefinitely. |
142
+
| ATIF local file write, HTTP storage, or S3-compatible storage failure | Application work continues. The failed sink is recorded as unhealthy and skipped for later trajectories. Other configured sinks continue to receive writes. |
143
+
| ATIF dispatcher serialization or subscriber-management failure | The ATIF dispatcher records a fatal exporter error and stops observing later ATIF events. Other observability sections continue to run. |
144
+
| OpenTelemetry or OpenInference construction failure | Plugin initialization fails before the subscriber is registered. |
145
+
| OpenTelemetry or OpenInference export failure after registration | Application work continues. The OTLP exporter reports failures through its runtime logging and flush or shutdown path. |
146
+
147
+
Missing or delayed telemetry is represented as absence of exporter output, not
148
+
as synthetic success or failure events. NeMo Relay does not backfill events for
149
+
subscribers that register late. If the plugin is cleared while an agent scope
150
+
is still open, the ATIF dispatcher writes the partial trajectory it has already
151
+
observed.
152
+
153
+
Use `nemo-relay doctor` to validate local ATOF and ATIF output directories,
0 commit comments