fix(ha): enforce request-deadline + response-size cap on HA HTTP client (closes #173)#174
Merged
Merged
Conversation
Member
|
Reviewed and merged at eedeb8b (eedeb8b). Summary: adds HA client request deadlines and response-size caps so stalled or oversized Home Assistant responses cannot hang tool dispatch. Thanks @galuis116. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enforce a request-lifecycle timeout and a response-size cap on
HaClient'sraw-TCP HTTP client. Pre-fix only
TcpStream::connecthad a timeout (5 s);every subsequent I/O (
write_all, status-lineread_line, header loop,body
read_exact/read_to_string/ chunked decoder) was unbounded, so ahung Home Assistant — Python GC pause, Supervisor self-update, integration
reload, slow custom add-on, or dropped packet after handshake — blocked the
calling chat/voice/dashboard task forever. The read-to-EOF fallback was
also unbounded in memory.
Same "the assistant froze" symptom class as #109 / PR #118 (voice cycle
stalls) and #124 / PR #125 (LLM stream not cancelled on disconnect), in a
different subsystem.
Closes #173.
Changes
crates/genie-core/src/ha/client.rs:DEFAULT_CONNECT_TIMEOUT = 5s,add
DEFAULT_REQUEST_TIMEOUT = 30s, addDEFAULT_MAX_RESPONSE_BYTES = 8 MiB. Doc-commented to explain thefailure mode each one prevents.
HaClientcarries the three values as fields;HaClient::newdefaultsthem from the constants — no behaviour change for existing callers.
#[cfg(test)] pub(crate) fn with_test_limits(...)builder so the newregression tests can run in millisecond-scale.
http_request:self.connect_timeout;Elapsedmaps toanyhow!("Home Assistant connect {method} {path} timed out after Ns").tokio::time::timeout(self.request_timeout, ...);Elapsedmaps toanyhow!("Home Assistant {method} {path} timed out after Ns").read_http_responsetakesmax_response_bytes:Content-Lengthexceeds the cap.take(cap + 1)and surfaces a clear "exceeded N bytes" error ratherthan consuming unbounded RSS.
read_chunked_bodytakesmax_bytesand bails before any chunk whosecumulative size would exceed the cap.
#[tokio::test]regressions backed by ephemeral localTcpListeners(
spawn_listener+test_clienthelpers, ~50 LOC of scaffolding sharedacross the suite):
hung_server_after_connect_times_out_cleanly— drains the request thensleeps 60 s without writing a response; asserts
Err("…timed out…")inside the 300 ms test budget. Pre-fix this hung forever.
slow_server_within_budget_succeeds— 50 ms delay then a valid 200with
[]; assertsget_states()returns an emptyVec. Locks in"healthy HA still works."
oversize_content_length_is_rejected— advertisesContent-Length: 1048576against a 16 KiB test cap; asserts the client bails beforeallocating.
read_to_eof_response_is_size_capped— streams 64 KiB with noContent-Lengthand no chunked encoding; asserts the client bails with"exceeded N bytes" instead of consuming memory.
chunked_body_aggregate_size_is_capped— 8 × 4 KiB chunks (32 KiB) vs16 KiB cap; asserts the chunked decoder bails partway through.
No config schema change, no public API change. The defaults are
intentionally conservative (30 s request, 8 MiB body) — well above any
realistic HA response. ASCII-only households see byte-identical behaviour
against a healthy HA.
Real Behavior Proof
scheduling — no audio, voice, ALSA, CUDA, LLM-backend, or systemd
surface — exercised end-to-end by 5
#[tokio::test]regressions thatspin up a real local
TcpListenerand drive each failure mode (hungserver, slow but OK, oversize Content-Length, read-to-EOF, oversize
chunked aggregate).
What I ran
Environment: x86_64 Linux dev host (Ubuntu 22.04, Rust 1.95.0). No Jetson
available.
What I observed
hung_server_after_connect_times_out_cleanlyaccepts the TCP connection, drains the request, then
tokio::sleep(60s)s.With the fix,
HaClient::test_connection()returnsErr("Home Assistant GET /api/ timed out after 300ms")in well under2 seconds. Pre-fix the same call hangs for the full 60 seconds.
slow_server_within_budget_succeedssleeps 50 ms before writing a minimal 200 response. The client returns
Ok([])— no regression on the happy path.oversize_content_length_is_rejectedadvertises 1 MiB against a 16 KiB cap; client bails with "exceeds cap"
before allocating.
read_to_eof_response_is_size_cappedstreams64 KiB; client bails with "exceeded 16384 bytes" instead of consuming RSS.
chunked_body_aggregate_size_is_cappedemits 32 KiB across 8 chunks; client bails with "chunked response
exceeded 16384 bytes" partway through.
parse_http_url_*tests still pass. Full
cargo testmatches themainbaseline:610 / 0 / 3 (default features), 505 / 0 (
--no-default-features).Test plan
A reviewer can re-verify on any Rust 1.85+ host (no Jetson, no Home
Assistant, no audio needed):
cargo test -p genie-core --lib ha::client::— 8 tests (3 existingparse_http_url_*+ 5 new), all green in <1 s.genie-corewith a working HA configured.sudo kill -STOP $(pgrep -f homeassistant)(orsupervisorctl stop homeassistantwhile a request is mid-flight).curl -m 120 -X POST http://127.0.0.1:3000/api/chat \ -H 'Content-Type: application/json' \ -d '{"message":"is the kitchen light on?"}'.With the fix: the request returns within ~30 s with the chat reply
containing the tool dispatch's
Err("Home Assistant GET /api/states/... timed out after 30s").Notes for reviewers
crates/genie-core/src/ha/client.rsis not touched by PR fix(conversation): clip first-message title at UTF-8 char boundary (closes #168) #169(conversation.rs) or PR fix(calc): cap paren nesting depth to prevent stack-overflow abort (closes #170) #171 (calc.rs).
realistic HA template render (~3-8 s on Orin Nano with a complex
template). 8 MiB comfortably fits the largest realistic
/api/statesdump (a household with hundreds of entities still serializes well under
1 MiB). Both are tunable later via the
HaClientstruct fields ifoperators want.
with_test_limitsis#[cfg(test)] pub(crate)rather thanpub.Production callers shouldn't reach in and tighten timeouts — that's a
config-layer concern that belongs in
genie-common/src/config.rsif/whenneeded. Keeping the test helper crate-private avoids putting an unstable
shape into the public surface.
util::http. The sameunbounded-read pattern exists in
crates/genie-core/src/tools/weather.rsand arguably the OpenAI-compat client too, but each speaks a different
HTTP dialect (Weather is a one-shot Open-Meteo GET; the LLM client
streams SSE; HA needs cookies/auth/template-render specifics). Lifting
to a shared util makes sense as a follow-up but expands review surface
beyond the bug this PR closes. Kept out of scope.