Skip to content

fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173)#174

Merged
ai-hpc merged 1 commit into
GeniePod:mainfrom
galuis116:fix/ha-client-request-timeout
May 26, 2026
Merged

fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173)#174
ai-hpc merged 1 commit into
GeniePod:mainfrom
galuis116:fix/ha-client-request-timeout

Conversation

@galuis116
Copy link
Copy Markdown
Contributor

Summary

Enforce a request-lifecycle timeout and a response-size cap on HaClient's
raw-TCP HTTP client. Pre-fix only TcpStream::connect had a timeout (5 s);
every subsequent I/O (write_all, status-line read_line, header loop,
body read_exact / read_to_string / chunked decoder) was unbounded, so a
hung Home Assistant — Python GC pause, Supervisor self-update, integration
reload, slow custom add-on, or dropped packet after handshake — blocked the
calling chat/voice/dashboard task forever. The read-to-EOF fallback was
also unbounded in memory.

Same "the assistant froze" symptom class as #109 / PR #118 (voice cycle
stalls) and #124 / PR #125 (LLM stream not cancelled on disconnect), in a
different subsystem.

Closes #173.

Changes

  • crates/genie-core/src/ha/client.rs:
    • Surface the magic-5 connect timeout as DEFAULT_CONNECT_TIMEOUT = 5s,
      add DEFAULT_REQUEST_TIMEOUT = 30s, add
      DEFAULT_MAX_RESPONSE_BYTES = 8 MiB. Doc-commented to explain the
      failure mode each one prevents.
    • HaClient carries the three values as fields; HaClient::new defaults
      them from the constants — no behaviour change for existing callers.
    • #[cfg(test)] pub(crate) fn with_test_limits(...) builder so the new
      regression tests can run in millisecond-scale.
    • http_request:
      • Connect path uses self.connect_timeout; Elapsed maps to
        anyhow!("Home Assistant connect {method} {path} timed out after Ns").
      • Post-connect (write + read_http_response) is wrapped in
        tokio::time::timeout(self.request_timeout, ...); Elapsed maps to
        anyhow!("Home Assistant {method} {path} timed out after Ns").
    • read_http_response takes max_response_bytes:
      • Reject up front when advertised Content-Length exceeds the cap.
      • Read-to-EOF fallback (no Content-Length, no chunked) uses
        take(cap + 1) and surfaces a clear "exceeded N bytes" error rather
        than consuming unbounded RSS.
    • read_chunked_body takes max_bytes and bails before any chunk whose
      cumulative size would exceed the cap.
  • 5 new #[tokio::test] regressions backed by ephemeral local TcpListeners
    (spawn_listener + test_client helpers, ~50 LOC of scaffolding shared
    across the suite):
    • hung_server_after_connect_times_out_cleanly — drains the request then
      sleeps 60 s without writing a response; asserts Err("…timed out…")
      inside the 300 ms test budget. Pre-fix this hung forever.
    • slow_server_within_budget_succeeds — 50 ms delay then a valid 200
      with []; asserts get_states() returns an empty Vec. Locks in
      "healthy HA still works."
    • oversize_content_length_is_rejected — advertises Content-Length: 1048576 against a 16 KiB test cap; asserts the client bails before
      allocating.
    • read_to_eof_response_is_size_capped — streams 64 KiB with no
      Content-Length and no chunked encoding; asserts the client bails with
      "exceeded N bytes" instead of consuming memory.
    • chunked_body_aggregate_size_is_capped — 8 × 4 KiB chunks (32 KiB) vs
      16 KiB cap; asserts the chunked decoder bails partway through.

No config schema change, no public API change. The defaults are
intentionally conservative (30 s request, 8 MiB body) — well above any
realistic HA response. ASCII-only households see byte-identical behaviour
against a healthy HA.

Real Behavior Proof

  • Built and ran the affected code locally.
  • NOT verified on Jetson hardware. The fix is in pure tokio I/O
    scheduling — no audio, voice, ALSA, CUDA, LLM-backend, or systemd
    surface — exercised end-to-end by 5 #[tokio::test] regressions that
    spin up a real local TcpListener and drive each failure mode (hung
    server, slow but OK, oversize Content-Length, read-to-EOF, oversize
    chunked aggregate).

What I ran

Environment: x86_64 Linux dev host (Ubuntu 22.04, Rust 1.95.0). No Jetson
available.

cargo fmt --all -- --check                                                # clean
cargo clippy --workspace --all-targets -- -D warnings                    # clean
cargo clippy --workspace --all-targets --no-default-features -- -D warnings  # clean
cargo test -p genie-core --lib ha::client::                               # 8 / 0
cargo test                                                                # 610 / 0 / 3
cargo test --workspace --no-default-features                             # 505 / 0

What I observed

  1. Hung server is bounded. hung_server_after_connect_times_out_cleanly
    accepts the TCP connection, drains the request, then tokio::sleep(60s)s.
    With the fix, HaClient::test_connection() returns
    Err("Home Assistant GET /api/ timed out after 300ms") in well under
    2 seconds. Pre-fix the same call hangs for the full 60 seconds.
  2. Healthy slow HA still works. slow_server_within_budget_succeeds
    sleeps 50 ms before writing a minimal 200 response. The client returns
    Ok([]) — no regression on the happy path.
  3. Oversize body rejected up front. oversize_content_length_is_rejected
    advertises 1 MiB against a 16 KiB cap; client bails with "exceeds cap"
    before allocating.
  4. Read-to-EOF is bounded. read_to_eof_response_is_size_capped streams
    64 KiB; client bails with "exceeded 16384 bytes" instead of consuming RSS.
  5. Chunked aggregate is bounded. chunked_body_aggregate_size_is_capped
    emits 32 KiB across 8 chunks; client bails with "chunked response
    exceeded 16384 bytes" partway through.
  6. No direct-path regression. The 3 pre-existing parse_http_url_*
    tests still pass. Full cargo test matches the main baseline:
    610 / 0 / 3 (default features), 505 / 0 (--no-default-features).

Test plan

A reviewer can re-verify on any Rust 1.85+ host (no Jetson, no Home
Assistant, no audio needed):

  • cargo test -p genie-core --lib ha::client:: — 8 tests (3 existing
    parse_http_url_* + 5 new), all green in <1 s.
  • Optional manual proof against a real HA:
    1. Start genie-core with a working HA configured.
    2. Pause HA: sudo kill -STOP $(pgrep -f homeassistant) (or
      supervisorctl stop homeassistant while a request is mid-flight).
    3. From a separate shell:
      curl -m 120 -X POST http://127.0.0.1:3000/api/chat \ -H 'Content-Type: application/json' \ -d '{"message":"is the kitchen light on?"}'.
    4. Pre-fix: genie-core hangs forever (curl gives up after 120 s).
      With the fix: the request returns within ~30 s with the chat reply
      containing the tool dispatch's
      Err("Home Assistant GET /api/states/... timed out after 30s").

Notes for reviewers

  • No merge-conflict surface with my open PRs. The file
    crates/genie-core/src/ha/client.rs is not touched by PR fix(conversation): clip first-message title at UTF-8 char boundary (closes #168) #169
    (conversation.rs) or PR fix(calc): cap paren nesting depth to prevent stack-overflow abort (closes #170) #171 (calc.rs).
  • Why 30 s / 8 MiB defaults. 30 s is comfortably above the slowest
    realistic HA template render (~3-8 s on Orin Nano with a complex
    template). 8 MiB comfortably fits the largest realistic /api/states
    dump (a household with hundreds of entities still serializes well under
    1 MiB). Both are tunable later via the HaClient struct fields if
    operators want.
  • Why with_test_limits is #[cfg(test)] pub(crate) rather than pub.
    Production callers shouldn't reach in and tighten timeouts — that's a
    config-layer concern that belongs in genie-common/src/config.rs if/when
    needed. Keeping the test helper crate-private avoids putting an unstable
    shape into the public surface.
  • Why a local helper instead of a workspace-wide util::http. The same
    unbounded-read pattern exists in crates/genie-core/src/tools/weather.rs
    and arguably the OpenAI-compat client too, but each speaks a different
    HTTP dialect (Weather is a one-shot Open-Meteo GET; the LLM client
    streams SSE; HA needs cookies/auth/template-render specifics). Lifting
    to a shared util makes sense as a follow-up but expands review surface
    beyond the bug this PR closes. Kept out of scope.

@ai-hpc ai-hpc merged commit eedeb8b into GeniePod:main May 26, 2026
6 checks passed
@ai-hpc
Copy link
Copy Markdown
Member

ai-hpc commented May 26, 2026

Reviewed and merged at eedeb8b (eedeb8b).

Summary: adds HA client request deadlines and response-size caps so stalled or oversized Home Assistant responses cannot hang tool dispatch.

Thanks @galuis116.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] ha/client: only TcpStream::connect has a timeout — any hung HA reply blocks chat/voice/dashboard tool calls indefinitely

2 participants