fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173) by galuis116 · Pull Request #174 · GeniePod/genie-claw

galuis116 · 2026-05-24T13:43:16Z

Summary

Enforce a request-lifecycle timeout and a response-size cap on HaClient's
raw-TCP HTTP client. Pre-fix only TcpStream::connect had a timeout (5 s);
every subsequent I/O (write_all, status-line read_line, header loop,
body read_exact / read_to_string / chunked decoder) was unbounded, so a
hung Home Assistant — Python GC pause, Supervisor self-update, integration
reload, slow custom add-on, or dropped packet after handshake — blocked the
calling chat/voice/dashboard task forever. The read-to-EOF fallback was
also unbounded in memory.

Same "the assistant froze" symptom class as #109 / PR #118 (voice cycle
stalls) and #124 / PR #125 (LLM stream not cancelled on disconnect), in a
different subsystem.

Closes #173.

Changes

crates/genie-core/src/ha/client.rs:
- Surface the magic-5 connect timeout as DEFAULT_CONNECT_TIMEOUT = 5s,
  add DEFAULT_REQUEST_TIMEOUT = 30s, add
  DEFAULT_MAX_RESPONSE_BYTES = 8 MiB. Doc-commented to explain the
  failure mode each one prevents.
- HaClient carries the three values as fields; HaClient::new defaults
  them from the constants — no behaviour change for existing callers.
- #[cfg(test)] pub(crate) fn with_test_limits(...) builder so the new
  regression tests can run in millisecond-scale.
- http_request:
  - Connect path uses self.connect_timeout; Elapsed maps to
    anyhow!("Home Assistant connect {method} {path} timed out after Ns").
  - Post-connect (write + read_http_response) is wrapped in
    tokio::time::timeout(self.request_timeout, ...); Elapsed maps to
    anyhow!("Home Assistant {method} {path} timed out after Ns").
- read_http_response takes max_response_bytes:
  - Reject up front when advertised Content-Length exceeds the cap.
  - Read-to-EOF fallback (no Content-Length, no chunked) uses
    take(cap + 1) and surfaces a clear "exceeded N bytes" error rather
    than consuming unbounded RSS.
- read_chunked_body takes max_bytes and bails before any chunk whose
  cumulative size would exceed the cap.
5 new #[tokio::test] regressions backed by ephemeral local TcpListeners
(spawn_listener + test_client helpers, ~50 LOC of scaffolding shared
across the suite):
- hung_server_after_connect_times_out_cleanly — drains the request then
  sleeps 60 s without writing a response; asserts Err("…timed out…")
  inside the 300 ms test budget. Pre-fix this hung forever.
- slow_server_within_budget_succeeds — 50 ms delay then a valid 200
  with []; asserts get_states() returns an empty Vec. Locks in
  "healthy HA still works."
- oversize_content_length_is_rejected — advertises Content-Length: 1048576 against a 16 KiB test cap; asserts the client bails before
  allocating.
- read_to_eof_response_is_size_capped — streams 64 KiB with no
  Content-Length and no chunked encoding; asserts the client bails with
  "exceeded N bytes" instead of consuming memory.
- chunked_body_aggregate_size_is_capped — 8 × 4 KiB chunks (32 KiB) vs
  16 KiB cap; asserts the chunked decoder bails partway through.

No config schema change, no public API change. The defaults are
intentionally conservative (30 s request, 8 MiB body) — well above any
realistic HA response. ASCII-only households see byte-identical behaviour
against a healthy HA.

Real Behavior Proof

Built and ran the affected code locally.
NOT verified on Jetson hardware. The fix is in pure tokio I/O
scheduling — no audio, voice, ALSA, CUDA, LLM-backend, or systemd
surface — exercised end-to-end by 5 #[tokio::test] regressions that
spin up a real local TcpListener and drive each failure mode (hung
server, slow but OK, oversize Content-Length, read-to-EOF, oversize
chunked aggregate).

What I ran

Environment: x86_64 Linux dev host (Ubuntu 22.04, Rust 1.95.0). No Jetson
available.

cargo fmt --all -- --check                                                # clean
cargo clippy --workspace --all-targets -- -D warnings                    # clean
cargo clippy --workspace --all-targets --no-default-features -- -D warnings  # clean
cargo test -p genie-core --lib ha::client::                               # 8 / 0
cargo test                                                                # 610 / 0 / 3
cargo test --workspace --no-default-features                             # 505 / 0

What I observed

Hung server is bounded. hung_server_after_connect_times_out_cleanly
accepts the TCP connection, drains the request, then tokio::sleep(60s)s.
With the fix, HaClient::test_connection() returns
Err("Home Assistant GET /api/ timed out after 300ms") in well under
2 seconds. Pre-fix the same call hangs for the full 60 seconds.
Healthy slow HA still works. slow_server_within_budget_succeeds
sleeps 50 ms before writing a minimal 200 response. The client returns
Ok([]) — no regression on the happy path.
Oversize body rejected up front. oversize_content_length_is_rejected
advertises 1 MiB against a 16 KiB cap; client bails with "exceeds cap"
before allocating.
Read-to-EOF is bounded. read_to_eof_response_is_size_capped streams
64 KiB; client bails with "exceeded 16384 bytes" instead of consuming RSS.
Chunked aggregate is bounded. chunked_body_aggregate_size_is_capped
emits 32 KiB across 8 chunks; client bails with "chunked response
exceeded 16384 bytes" partway through.
No direct-path regression. The 3 pre-existing parse_http_url_*
tests still pass. Full cargo test matches the main baseline:
610 / 0 / 3 (default features), 505 / 0 (--no-default-features).

Test plan

A reviewer can re-verify on any Rust 1.85+ host (no Jetson, no Home
Assistant, no audio needed):

cargo test -p genie-core --lib ha::client:: — 8 tests (3 existing
parse_http_url_* + 5 new), all green in <1 s.
Optional manual proof against a real HA:
1. Start genie-core with a working HA configured.
2. Pause HA: sudo kill -STOP $(pgrep -f homeassistant) (or
  supervisorctl stop homeassistant while a request is mid-flight).
3. From a separate shell:
  curl -m 120 -X POST http://127.0.0.1:3000/api/chat \ -H 'Content-Type: application/json' \ -d '{"message":"is the kitchen light on?"}'.
4. Pre-fix: genie-core hangs forever (curl gives up after 120 s).
  With the fix: the request returns within ~30 s with the chat reply
  containing the tool dispatch's
  Err("Home Assistant GET /api/states/... timed out after 30s").

Notes for reviewers

No merge-conflict surface with my open PRs. The file
crates/genie-core/src/ha/client.rs is not touched by PR fix(conversation): clip first-message title at UTF-8 char boundary (closes #168) #169
(conversation.rs) or PR fix(calc): cap paren nesting depth to prevent stack-overflow abort (closes #170) #171 (calc.rs).
Why 30 s / 8 MiB defaults. 30 s is comfortably above the slowest
realistic HA template render (~3-8 s on Orin Nano with a complex
template). 8 MiB comfortably fits the largest realistic /api/states
dump (a household with hundreds of entities still serializes well under
1 MiB). Both are tunable later via the HaClient struct fields if
operators want.
Why with_test_limits is #[cfg(test)] pub(crate) rather than pub.
Production callers shouldn't reach in and tighten timeouts — that's a
config-layer concern that belongs in genie-common/src/config.rs if/when
needed. Keeping the test helper crate-private avoids putting an unstable
shape into the public surface.
Why a local helper instead of a workspace-wide util::http. The same
unbounded-read pattern exists in crates/genie-core/src/tools/weather.rs
and arguably the OpenAI-compat client too, but each speaks a different
HTTP dialect (Weather is a one-shot Open-Meteo GET; the LLM client
streams SSE; HA needs cookies/auth/template-render specifics). Lifting
to a shared util makes sense as a follow-up but expands review surface
beyond the bug this PR closes. Kept out of scope.

…nt (closes GeniePod#173)

ai-hpc · 2026-05-26T21:40:17Z

Reviewed and merged at eedeb8b (eedeb8b).

Summary: adds HA client request deadlines and response-size caps so stalled or oversized Home Assistant responses cannot hang tool dispatch.

Thanks @galuis116.

fix(ha): enforce request-deadline + response-size cap on HA HTTP clie…

1cbeedc

…nt (closes GeniePod#173)

ai-hpc merged commit eedeb8b into GeniePod:main May 26, 2026
6 checks passed

This was referenced May 26, 2026

[bug] tools/weather: unbounded HTTP read + silent chunked corruption + bare-spaces URL encoding lose chat or return wrong location #203

Closed

fix(weather): enforce read deadline + size cap + percent-encode geocode (closes #203) #205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173)#174

fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173)#174
ai-hpc merged 1 commit into
GeniePod:mainfrom
galuis116:fix/ha-client-request-timeout

galuis116 commented May 24, 2026

Uh oh!

Uh oh!

ai-hpc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galuis116 commented May 24, 2026

Summary

Changes

Real Behavior Proof

What I ran

What I observed

Test plan

Notes for reviewers

Uh oh!

Uh oh!

ai-hpc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants