Skip to content

feat(http)!: require host_config.providers; fail loudly on misconfig#69

Merged
manojp99 merged 2 commits into
mainfrom
feat/explicit-providers-fail-loud
Jun 22, 2026
Merged

feat(http)!: require host_config.providers; fail loudly on misconfig#69
manojp99 merged 2 commits into
mainfrom
feat/explicit-providers-fail-loud

Conversation

@manojp99

@manojp99 manojp99 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why

During a recent DTU exploration, we surfaced a set of backlog items (#4 #5 #6 #7 #8) around cryptic startup behavior in amplifier-agent serve chat-completions:

  • The server iterated a hardcoded KNOWN_PROVIDERS list and silently skipped providers whose credentials were missing or modules were not installed.
  • When nothing loaded, it fell back to a useless placeholder "amplifier" model entry.
  • Requests with an unknown model were silently routed to whichever provider happened to load first, then failed with an upstream not_found_error embedded in delta.content 4 seconds later.
  • stream: false was silently ignored.
  • Upstream errors raised during session initialization were embedded in SSE delta.content at HTTP 200, making them impossible to distinguish from successful responses at the client level.

This PR resolves all five issues with a single clean redesign: explicit per-provider config + fail loudly on any boot-time issue.

It also bundles the operational lifecycle commands that close the story — without them, users have to pkill -f "amplifier-agent serve" every iteration, and can't tell whether the server is running, hung, or dead. Same audience, complementary change, single review.


Changed

  • Breaking (server mode only): amplifier-agent serve chat-completions now requires host_config.providers to be a non-empty dict. Any provider declared there that cannot initialize (missing credentials, module not installed, list_models() raises, returns 0 models) causes the server to exit 2 with a structured error listing every problem. The previous behavior — iterating a hardcoded KNOWN_PROVIDERS list, silently skipping unreachable providers, and falling back to an unusable placeholder model — is gone. Single-turn mode (amplifier-agent run) is unaffected; the provider (singular) block continues to work for it.

  • POST /v1/chat/completions now validates model against the served registry. Requests with an unknown model return HTTP 400 {"error": {"code": "unknown_model", ...}} immediately, instead of being silently routed to whichever provider loaded first and failing 4 seconds later with an upstream not_found_error embedded in delta.content.

  • stream: false is now honored. Requests with that flag return a single JSON body; only stream: true (or absent) uses SSE.

  • Upstream errors raised before any content chunks are emitted now surface as HTTP 502 with a structured OpenAI-shape error envelope, instead of being embedded inside delta.content of a 200 SSE response.

  • /v1/models no longer falls back to a placeholder {"id": "amplifier", ...} entry. The lifespan now guarantees served_models_registry is non-empty (or the server exits at boot), so the fallback was unreachable in practice.

Added

  • host_config.providers (plural) registry — declares which providers the server-mode lifespan loads and how to instantiate each. Schema: providers: {<provider_id>: {module?: str, config?: dict}}. The module defaults to the provider_id when omitted. Each provider's config is passed through as the extra_config arg to list_provider_models() and then to the provider module's constructor.

Lifecycle commands

Adds serve status / stop / restart operational commands. State persisted at ~/.amplifier-agent/state/serve.json (mode 0600, parent dir 0700; api_key is sensitive — never logged). Status reports running/stale/not-running plus served-model counts by provider; stop honors graceful shutdown with optional --force and --timeout; restart identity-restarts from the stored launch args (host, port, api-key, workspace, host_config).

amplifier-agent serve status

Reports whether the server is running, probes GET /v1/models over the wire, and self-cleans stale state files (PID gone). Exits 0 on healthy, 1 when unreachable.

amplifier-agent serve stop [--force] [--timeout SECONDS]

Sends SIGTERM with a 5s graceful-exit window. Escalates to SIGKILL on timeout or when --force is given.

amplifier-agent serve restart

Identity restart — captures host, port, api-key, workspace, and host_config_path from the state file, stops the old server, and re-launches as a detached background process.

Internal

  • New _validate_providers_registry() in amplifier_agent_lib/config/loader.py enforces the closed schema for the new block.
  • HTTP-face tests introduced from scratch under tests/http/ covering lifespan boot scenarios and chat-completions validation.
  • New src/amplifier_agent_cli/admin/serve_lifecycle.py — single owner of state file path and schema. Callers go through read_state_file, write_state_file, remove_state_file.
  • ServerConfig gains host and port fields so the lifespan can write them to the state file without re-parsing CLI flags.

Migration

For server-mode users on <= 0.8.0: add a providers block to your host_config.json. Minimum to keep working with just Anthropic:

{
  "providers": {
    "anthropic": {}
  }
}

Multi-provider example:

{
  "providers": {
    "anthropic": {},
    "openai":    {"config": {"base_url": "https://api.openai.com/v1"}}
  }
}

If you don't pass host_config.providers, the server will exit at boot with a clear error message rather than running in a broken half-state.


Test Evidence

uv run pytest tests/config/ -q → 47 passed (no regressions)
uv run pytest tests/http/ -q  → 13 passed (all new tests green)
uv run pytest tests/cli/test_serve_lifecycle.py -q → 18 passed (all new)
ruff check + ruff format + pyright on all modified source files: clean

Manoj Prabhakar Paidiparthy and others added 2 commits June 21, 2026 23:45
## Summary

This is a breaking change to `serve chat-completions` boot semantics.
The server now requires an explicit `host_config.providers` dict; the old
behavior (iterate a hardcoded KNOWN_PROVIDERS list, skip silently, fall back
to a placeholder model) is removed.

## Changed

### loader.py

- Added 'providers' to _VALID_TOP_LEVEL_KEYS.
- Added _validate_providers_registry() with closed schema for the new block:
  each entry may have optional 'module' (defaults to the key, must be a known
  provider) and optional 'config' (must be a dict). Unknown keys raise
  config_unknown_key; wrong types raise config_invalid_type.
- Called from load_config() after the existing _validate_provider_module call.

### merger.py

- Added an explanatory comment: 'providers' is server-mode only and
  deliberately not merged into bundle_modules by this function.

### app.py (lifespan)

- Removed KNOWN_PROVIDERS import and the loop that iterated it.
- Added import sys.
- New logic: reads app.state.host_config.get('providers'), exits 2 if the
  block is missing or empty, collects one error per failing provider, exits 2
  with the full error list if any provider fails. Passes per-provider 'config'
  as extra_config to list_provider_models().

### routes/models.py

- Removed the placeholder fallback (lines 107-121). The route now returns
  the served list directly; the lifespan guarantees it is non-empty or the
  server has already exited.

### routes/chat_completions.py (three edits)

Edit A — model validation:
- Replaced served_registry.get(payload.model, 'anthropic') fallback with a
  hard 400 / unknown_model when the model is not in the registry.

Edit B — honor stream: false:
- Changed ChatCompletionRequest.stream type from bool = False to
  bool | None = None so 'field absent' is distinguishable from
  'explicit False'.
- Added _collect_completion() helper that consumes the SSE generator and
  returns a single chat.completion JSON object.
- In chat_completions(): if payload.stream is False, call _collect_completion
  and return JSONResponse; otherwise return StreamingResponse.

Edit C — upstream errors as HTTP 502:
- Moved event_queue / display / approval / host_tool_yield_state creation and
  turn_task start UP into chat_completions() (the route handler), before
  StreamingResponse is returned, so Starlette has not yet committed the 200
  status line.
- Added a 50 ms asyncio.wait pre-flight check: if turn_task completes within
  the window (immediately-failing mock or provider that raises before its
  first IO await), the handler raises HTTPException(502) with a structured
  OpenAI-shape error envelope.
- _stream_chat_completion() now accepts pre-started components (turn_task,
  signal_task, event_queue, display, host_tool_yield_state) instead of
  creating them internally.
- Mid-stream errors (after the role chunk is yielded) continue to be embedded
  in delta.content (no other option once SSE has started).

### _wire.py

- ChatCompletionRequest.stream: bool = False → bool | None = None

### tests/http/ (new)

- tests/http/__init__.py (empty, marks the package)
- tests/http/test_lifespan_providers.py: 8 tests for lifespan boot semantics
  covering all exit-2 scenarios and two happy-path scenarios.
- tests/http/test_chat_completions_validation.py: 5 tests for Edit A/B/C
  using a no-op lifespan fixture so TestClient does not trigger bundle loading.

### CHANGELOG.md

- Added [Unreleased] section above [0.8.0] documenting all breaking changes,
  additions, internal changes, and migration steps.

## Migration

Server-mode users on <= 0.8.0: add providers to host_config.json.

Minimum (Anthropic only):
  {"providers": {"anthropic": {}}}

Multi-provider:
  {"providers": {"anthropic": {}, "openai": {"config": {"base_url": "https://api.openai.com/v1"}}}}

## Tests

- uv run pytest tests/config/ → 47 passed (unchanged)
- uv run pytest tests/http/ → 13 passed (new)
- ruff check + ruff format + pyright on all modified source files: clean

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
Add operational lifecycle management for the chat-completions HTTP server.

## What changed

### New: src/amplifier_agent_cli/admin/serve_lifecycle.py
Single-responsibility module owning the state file at
~/.amplifier-agent/state/serve.json (mode 0600, parent dir 0700).

State file helpers:
- write_state_file(): atomic write via tempfile + os.replace; enforces
  mode 0600 on the tempfile BEFORE rename so api_key is never readable
  at a permissive mode even momentarily.
- read_state_file(): returns None when absent; raises ClickException on
  unknown schema_version with a remediation hint.
- remove_state_file(): idempotent (missing_ok=True).
- is_pid_alive(pid): uses os.kill(pid, 0) — PermissionError treated as
  'alive' (process exists, we don't own it).
- wait_for_exit(pid, timeout): 100 ms poll loop.

Three click commands:
- serve status: reads state file, validates PID is alive, probes
  GET /v1/models, reports running/stale/not-running with provider
  counts. Self-cleans stale state files (PID gone). Exits 0 on healthy,
  1 on unreachable.
- serve stop: SIGTERM → graceful wait (--timeout, default 5s) →
  SIGKILL escalation on expiry. --force skips straight to SIGKILL.
  Exits 1 when no state file (nothing to stop).
- serve restart: captures launch args from state file, stops old server,
  re-launches as detached subprocess (start_new_session=True), polls
  for new state file (30s window) before reporting success.

### Modified: src/amplifier_agent_cli/admin/serve.py
- Import and register status_command, stop_command, restart_command on
  serve_group via add_command().
- Set AMPLIFIER_AGENT_HTTP_BIND and AMPLIFIER_AGENT_HTTP_PORT env vars
  before uvicorn.run() so load_config() can stash them in the state file.
- Install belt-and-suspenders SIGTERM/SIGINT handlers that call
  remove_state_file(); lifespan finally-block is the primary cleanup path.

### Modified: src/amplifier_agent_http/_config.py
- Add host: str and port: int fields to ServerConfig.
- load_config() reads AMPLIFIER_AGENT_HTTP_BIND (default 127.0.0.1)
  and AMPLIFIER_AGENT_HTTP_PORT (default 9099).

### Modified: src/amplifier_agent_http/app.py
- After all providers load (and before yield), compute providers_summary
  {provider_id: model_count} and call write_state_file() with pid,
  started_at, host, port, api_key, workspace, host_config_path,
  providers_summary.
- In finally block, call remove_state_file() for graceful shutdown
  cleanup.

### Modified: tests/http/test_lifespan_providers.py
- Add host='127.0.0.1' and port=9099 to _server_config() helper (new
  required ServerConfig fields).
- Patch write_state_file and remove_state_file in base_mocks fixture
  so tests never touch the real state file.

### New: tests/cli/test_serve_lifecycle.py
18 tests covering:
- write_state_file: mode bits (0600/0700), content round-trip, atomicity.
- read_state_file: missing → None, unknown schema_version → ClickException.
- remove_state_file: idempotent.
- is_pid_alive: live subprocess, dead subprocess, PID 1.
- serve status: not running, stale PID self-clean, running + reachable.
- serve stop: no state file → exit 1, graceful SIGTERM, --force SIGKILL,
  timeout escalation.
- serve restart: no state file → exit 1, cmd assembly from stored args.

### Modified: CHANGELOG.md
- Added lifecycle commands bullet to [Unreleased] ### Added section.

## Test results
- tests/cli/test_serve_lifecycle.py: 18/18 passed
- tests/http/: 13/13 passed (no regression)
- tests/config/: 47/47 passed (no regression)

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
@manojp99 manojp99 merged commit f6e1a6e into main Jun 22, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant