feat(http)!: require host_config.providers; fail loudly on misconfig by manojp99 · Pull Request #69 · microsoft/amplifier-agent

manojp99 · 2026-06-22T06:46:03Z

Why

During a recent DTU exploration, we surfaced a set of backlog items (#4 #5 #6 #7 #8) around cryptic startup behavior in amplifier-agent serve chat-completions:

The server iterated a hardcoded KNOWN_PROVIDERS list and silently skipped providers whose credentials were missing or modules were not installed.
When nothing loaded, it fell back to a useless placeholder "amplifier" model entry.
Requests with an unknown model were silently routed to whichever provider happened to load first, then failed with an upstream not_found_error embedded in delta.content 4 seconds later.
stream: false was silently ignored.
Upstream errors raised during session initialization were embedded in SSE delta.content at HTTP 200, making them impossible to distinguish from successful responses at the client level.

This PR resolves all five issues with a single clean redesign: explicit per-provider config + fail loudly on any boot-time issue.

It also bundles the operational lifecycle commands that close the story — without them, users have to pkill -f "amplifier-agent serve" every iteration, and can't tell whether the server is running, hung, or dead. Same audience, complementary change, single review.

Changed

Breaking (server mode only): amplifier-agent serve chat-completions now requires host_config.providers to be a non-empty dict. Any provider declared there that cannot initialize (missing credentials, module not installed, list_models() raises, returns 0 models) causes the server to exit 2 with a structured error listing every problem. The previous behavior — iterating a hardcoded KNOWN_PROVIDERS list, silently skipping unreachable providers, and falling back to an unusable placeholder model — is gone. Single-turn mode (amplifier-agent run) is unaffected; the provider (singular) block continues to work for it.
POST /v1/chat/completions now validates model against the served registry. Requests with an unknown model return HTTP 400 {"error": {"code": "unknown_model", ...}} immediately, instead of being silently routed to whichever provider loaded first and failing 4 seconds later with an upstream not_found_error embedded in delta.content.
stream: false is now honored. Requests with that flag return a single JSON body; only stream: true (or absent) uses SSE.
Upstream errors raised before any content chunks are emitted now surface as HTTP 502 with a structured OpenAI-shape error envelope, instead of being embedded inside delta.content of a 200 SSE response.
/v1/models no longer falls back to a placeholder {"id": "amplifier", ...} entry. The lifespan now guarantees served_models_registry is non-empty (or the server exits at boot), so the fallback was unreachable in practice.

Added

host_config.providers (plural) registry — declares which providers the server-mode lifespan loads and how to instantiate each. Schema: providers: {<provider_id>: {module?: str, config?: dict}}. The module defaults to the provider_id when omitted. Each provider's config is passed through as the extra_config arg to list_provider_models() and then to the provider module's constructor.

Lifecycle commands

Adds serve status / stop / restart operational commands. State persisted at ~/.amplifier-agent/state/serve.json (mode 0600, parent dir 0700; api_key is sensitive — never logged). Status reports running/stale/not-running plus served-model counts by provider; stop honors graceful shutdown with optional --force and --timeout; restart identity-restarts from the stored launch args (host, port, api-key, workspace, host_config).

`amplifier-agent serve status`

Reports whether the server is running, probes GET /v1/models over the wire, and self-cleans stale state files (PID gone). Exits 0 on healthy, 1 when unreachable.

`amplifier-agent serve stop [--force] [--timeout SECONDS]`

Sends SIGTERM with a 5s graceful-exit window. Escalates to SIGKILL on timeout or when --force is given.

`amplifier-agent serve restart`

Identity restart — captures host, port, api-key, workspace, and host_config_path from the state file, stops the old server, and re-launches as a detached background process.

Internal

New _validate_providers_registry() in amplifier_agent_lib/config/loader.py enforces the closed schema for the new block.
HTTP-face tests introduced from scratch under tests/http/ covering lifespan boot scenarios and chat-completions validation.
New src/amplifier_agent_cli/admin/serve_lifecycle.py — single owner of state file path and schema. Callers go through read_state_file, write_state_file, remove_state_file.
ServerConfig gains host and port fields so the lifespan can write them to the state file without re-parsing CLI flags.

Migration

For server-mode users on <= 0.8.0: add a providers block to your host_config.json. Minimum to keep working with just Anthropic:

{
  "providers": {
    "anthropic": {}
  }
}

Multi-provider example:

{
  "providers": {
    "anthropic": {},
    "openai":    {"config": {"base_url": "https://api.openai.com/v1"}}
  }
}

If you don't pass host_config.providers, the server will exit at boot with a clear error message rather than running in a broken half-state.

Test Evidence

uv run pytest tests/config/ -q → 47 passed (no regressions)
uv run pytest tests/http/ -q  → 13 passed (all new tests green)
uv run pytest tests/cli/test_serve_lifecycle.py -q → 18 passed (all new)
ruff check + ruff format + pyright on all modified source files: clean

## Summary This is a breaking change to `serve chat-completions` boot semantics. The server now requires an explicit `host_config.providers` dict; the old behavior (iterate a hardcoded KNOWN_PROVIDERS list, skip silently, fall back to a placeholder model) is removed. ## Changed ### loader.py - Added 'providers' to _VALID_TOP_LEVEL_KEYS. - Added _validate_providers_registry() with closed schema for the new block: each entry may have optional 'module' (defaults to the key, must be a known provider) and optional 'config' (must be a dict). Unknown keys raise config_unknown_key; wrong types raise config_invalid_type. - Called from load_config() after the existing _validate_provider_module call. ### merger.py - Added an explanatory comment: 'providers' is server-mode only and deliberately not merged into bundle_modules by this function. ### app.py (lifespan) - Removed KNOWN_PROVIDERS import and the loop that iterated it. - Added import sys. - New logic: reads app.state.host_config.get('providers'), exits 2 if the block is missing or empty, collects one error per failing provider, exits 2 with the full error list if any provider fails. Passes per-provider 'config' as extra_config to list_provider_models(). ### routes/models.py - Removed the placeholder fallback (lines 107-121). The route now returns the served list directly; the lifespan guarantees it is non-empty or the server has already exited. ### routes/chat_completions.py (three edits) Edit A — model validation: - Replaced served_registry.get(payload.model, 'anthropic') fallback with a hard 400 / unknown_model when the model is not in the registry. Edit B — honor stream: false: - Changed ChatCompletionRequest.stream type from bool = False to bool | None = None so 'field absent' is distinguishable from 'explicit False'. - Added _collect_completion() helper that consumes the SSE generator and returns a single chat.completion JSON object. - In chat_completions(): if payload.stream is False, call _collect_completion and return JSONResponse; otherwise return StreamingResponse. Edit C — upstream errors as HTTP 502: - Moved event_queue / display / approval / host_tool_yield_state creation and turn_task start UP into chat_completions() (the route handler), before StreamingResponse is returned, so Starlette has not yet committed the 200 status line. - Added a 50 ms asyncio.wait pre-flight check: if turn_task completes within the window (immediately-failing mock or provider that raises before its first IO await), the handler raises HTTPException(502) with a structured OpenAI-shape error envelope. - _stream_chat_completion() now accepts pre-started components (turn_task, signal_task, event_queue, display, host_tool_yield_state) instead of creating them internally. - Mid-stream errors (after the role chunk is yielded) continue to be embedded in delta.content (no other option once SSE has started). ### _wire.py - ChatCompletionRequest.stream: bool = False → bool | None = None ### tests/http/ (new) - tests/http/__init__.py (empty, marks the package) - tests/http/test_lifespan_providers.py: 8 tests for lifespan boot semantics covering all exit-2 scenarios and two happy-path scenarios. - tests/http/test_chat_completions_validation.py: 5 tests for Edit A/B/C using a no-op lifespan fixture so TestClient does not trigger bundle loading. ### CHANGELOG.md - Added [Unreleased] section above [0.8.0] documenting all breaking changes, additions, internal changes, and migration steps. ## Migration Server-mode users on <= 0.8.0: add providers to host_config.json. Minimum (Anthropic only): {"providers": {"anthropic": {}}} Multi-provider: {"providers": {"anthropic": {}, "openai": {"config": {"base_url": "https://api.openai.com/v1"}}}} ## Tests - uv run pytest tests/config/ → 47 passed (unchanged) - uv run pytest tests/http/ → 13 passed (new) - ruff check + ruff format + pyright on all modified source files: clean 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

Add operational lifecycle management for the chat-completions HTTP server. ## What changed ### New: src/amplifier_agent_cli/admin/serve_lifecycle.py Single-responsibility module owning the state file at ~/.amplifier-agent/state/serve.json (mode 0600, parent dir 0700). State file helpers: - write_state_file(): atomic write via tempfile + os.replace; enforces mode 0600 on the tempfile BEFORE rename so api_key is never readable at a permissive mode even momentarily. - read_state_file(): returns None when absent; raises ClickException on unknown schema_version with a remediation hint. - remove_state_file(): idempotent (missing_ok=True). - is_pid_alive(pid): uses os.kill(pid, 0) — PermissionError treated as 'alive' (process exists, we don't own it). - wait_for_exit(pid, timeout): 100 ms poll loop. Three click commands: - serve status: reads state file, validates PID is alive, probes GET /v1/models, reports running/stale/not-running with provider counts. Self-cleans stale state files (PID gone). Exits 0 on healthy, 1 on unreachable. - serve stop: SIGTERM → graceful wait (--timeout, default 5s) → SIGKILL escalation on expiry. --force skips straight to SIGKILL. Exits 1 when no state file (nothing to stop). - serve restart: captures launch args from state file, stops old server, re-launches as detached subprocess (start_new_session=True), polls for new state file (30s window) before reporting success. ### Modified: src/amplifier_agent_cli/admin/serve.py - Import and register status_command, stop_command, restart_command on serve_group via add_command(). - Set AMPLIFIER_AGENT_HTTP_BIND and AMPLIFIER_AGENT_HTTP_PORT env vars before uvicorn.run() so load_config() can stash them in the state file. - Install belt-and-suspenders SIGTERM/SIGINT handlers that call remove_state_file(); lifespan finally-block is the primary cleanup path. ### Modified: src/amplifier_agent_http/_config.py - Add host: str and port: int fields to ServerConfig. - load_config() reads AMPLIFIER_AGENT_HTTP_BIND (default 127.0.0.1) and AMPLIFIER_AGENT_HTTP_PORT (default 9099). ### Modified: src/amplifier_agent_http/app.py - After all providers load (and before yield), compute providers_summary {provider_id: model_count} and call write_state_file() with pid, started_at, host, port, api_key, workspace, host_config_path, providers_summary. - In finally block, call remove_state_file() for graceful shutdown cleanup. ### Modified: tests/http/test_lifespan_providers.py - Add host='127.0.0.1' and port=9099 to _server_config() helper (new required ServerConfig fields). - Patch write_state_file and remove_state_file in base_mocks fixture so tests never touch the real state file. ### New: tests/cli/test_serve_lifecycle.py 18 tests covering: - write_state_file: mode bits (0600/0700), content round-trip, atomicity. - read_state_file: missing → None, unknown schema_version → ClickException. - remove_state_file: idempotent. - is_pid_alive: live subprocess, dead subprocess, PID 1. - serve status: not running, stale PID self-clean, running + reachable. - serve stop: no state file → exit 1, graceful SIGTERM, --force SIGKILL, timeout escalation. - serve restart: no state file → exit 1, cmd assembly from stored args. ### Modified: CHANGELOG.md - Added lifecycle commands bullet to [Unreleased] ### Added section. ## Test results - tests/cli/test_serve_lifecycle.py: 18/18 passed - tests/http/: 13/13 passed (no regression) - tests/config/: 47/47 passed (no regression) 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

Manoj Prabhakar Paidiparthy and others added 2 commits June 21, 2026 23:45

manojp99 merged commit f6e1a6e into main Jun 22, 2026
2 of 3 checks passed

manojp99 mentioned this pull request Jun 22, 2026

fix(serve): install bundle modules on cold start; declare all 4 providers #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(http)!: require host_config.providers; fail loudly on misconfig#69

feat(http)!: require host_config.providers; fail loudly on misconfig#69
manojp99 merged 2 commits into
mainfrom
feat/explicit-providers-fail-loud

manojp99 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manojp99 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changed

Added

Lifecycle commands

amplifier-agent serve status

amplifier-agent serve stop [--force] [--timeout SECONDS]

amplifier-agent serve restart

Internal

Migration

Test Evidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

manojp99 commented Jun 22, 2026 •

edited

Loading

`amplifier-agent serve status`

`amplifier-agent serve stop [--force] [--timeout SECONDS]`

`amplifier-agent serve restart`