Add fleetnode command artifact transfer foundation#596
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: NONE FindingsNo high-impact security, correctness, or reliability findings were identified in the scoped diff. NotesReviewed The new artifact RPCs are fleet-node session authenticated, request/response bodies are suppressed from streaming logs, transfers are bound to in-flight command expectations, artifact IDs are server generated/canonicalized, uploaded bytes are size and SHA-256 checked, filenames are sanitized, and concurrent upload/download streams are capped per fleet node. I did not find changes affecting mining pool URLs, wallet/payout addresses, shell command construction, SQL query construction, frontend rendering, Docker/Nginx config, Rust, or Python plugin code in this PR diff. Generated by Codex Security Review | |
- Remove fleetd's global ReadTimeout so long-lived fleet node streams are not capped by the HTTP server
- Reserve the command artifact metadata sidecar filename during upload sanitization - Add finite upload header, chunk, and total-transfer deadlines that release per-node slots - Make duplicate upload test tolerant of transport EOF timing
- Bind artifact transfer slot releases to acquired leases across ControlStream reconnects - Add per-node download slots plus bounded download send and total deadlines - Cover reconnect upload accounting and blocked download sends
- Treat corrupt artifact directories and metadata size mismatches as internal storage errors - Map canceled upload receives distinctly from deadlines and no-active-stream admissions explicitly - Replace per-chunk upload receive goroutines with a single receive loop per RPC
This comment was marked as outdated.
This comment was marked as outdated.
- Store completed upload refs in in-flight artifact expectations - Return matching completed refs for duplicate upload headers while the command remains active - Cover idempotent upload retry at the registry and gateway layers
- Reject upload chunks larger than the 1 MiB command artifact chunk size - Leave successful downloads retryable while the command remains in flight - Cover oversized chunks and repeated in-flight downloads
- Read and verify retry chunks before returning a stored completed upload ref - Cover duplicate upload retry using the normal chunk-sending client flow
- add HTTP/2 stalled-write timeout coverage for artifact downloads - charge completed upload retries against the attempt cap - bind completed upload retry drains to the command lifetime
- add a scoped Connect read limit for UploadCommandArtifact - cover oversized upload chunks and retry reinstatement
- fold redundant HTTP/2 timeout assertion into the behavioral test - share artifact upload test setup and message builders
Reviewable diff: +1299/-43 across 15 files (excludes generated, test, and story files).
Summary
This adds the fleetnode command artifact transfer foundation for large command-scoped files. Fleet nodes can now stream command artifacts, such as miner log bundles, to the server and stream server-stored artifacts back down without pushing large payloads through
ControlAck. This PR wires the transport, storage, validation, lifecycle cleanup, and command-expectation gates; miner log download and firmware update command consumers remain follow-up work.How it works
A server-issued control command can register artifact expectations in the control registry with a command ID, direction, purpose, optional artifact ID, and optional device identifier. The fleet node opens
UploadCommandArtifactorDownloadCommandArtifactwith its normal fleetnode gateway auth token, and the gateway admits the stream only when the command is still in flight and the transfer matches a registered expectation.For uploads, the gateway reserves a lease-bound per-fleetnode upload slot before reading the first stream message, bounds the header wait plus chunk progress and total transfer duration, then admits the transfer after the header matches an in-flight command expectation. The files service streams bytes through staging, validates declared size and SHA-256, promotes the artifact under a server-generated ID, and writes a metadata sidecar used for later opens. Failed uploads reinstate the expectation until the command-level attempt cap is reached; if the bytes were saved but the response was lost, matching duplicate upload retries drain and verify the retry chunks before returning the stored artifact ref while the command remains in flight.
For downloads, the gateway reserves a per-fleetnode download slot, marks the matching expectation in progress, opens the artifact and validates stored metadata against the issued reference, then streams a header and chunks back to the fleet node with finite send and total-transfer deadlines. Downloads reinstate the expectation after EOF so the same in-flight command can retry until its attempt cap; command completion removes the expectation. The fleetnode helper verifies received size and checksum before returning the reference.
Finalized artifacts are retained for 7 days by default and swept periodically. The sweep is intentionally TTL-based and does not pin artifacts for unusually long-lived in-flight commands; downstream command consumers should keep command lifetimes shorter than the retention window or issue fresh artifact references.
Diagrams
Areas of the code involved
proto/fleetnodegateway/v1server/generated/grpc/...,client/src/protoFleet/api/generated/...server/internal/domain/fleetnode/controlserver/internal/handlers/fleetnode/gatewayserver/internal/infrastructure/filesserver/cmd/fleetdserver/cmd/fleetnodeserver/internal/handlers/interceptors/config.goKey technical decisions & trade-offs
ControlAckwould keep the API smaller but hit existing payload limits and retry semanticsTesting & validation
buf lintgit diff --checksource ./bin/activate-hermit && just gensource ./bin/activate-hermit && go test ./server/internal/infrastructure/files ./server/internal/domain/fleetnode/control ./server/internal/handlers/fleetnode/gateway -run 'TestCommandArtifact|TestRunCommandArtifactDownloadSend|TestContextConnectError|TestMapArtifactAdmissionError|TestAdmitCommandArtifact|TestReinstateCommandArtifact|TestCompletedCommandArtifactUpload|TestCommandArtifactUpload|TestCommandArtifactUploadReader|TestSaveCommandArtifact|TestOpenCommandArtifact|TestDeleteCommandArtifact|TestSweepExpiredCommandArtifacts' ./server/cmd/fleetnode ./server/cmd/fleetdsource ./bin/activate-hermit && go test ./server/internal/handlers/fleetnode/gateway -run TestCommandArtifactUploadAndDownloadRequireInFlightExpectation -count=50source ./bin/activate-hermit && go test ./server/cmd/fleetnode ./server/cmd/fleetdcd server && source ../bin/activate-hermit && golangci-lint run -c .golangci.yamlsource ./bin/activate-hermit && go test ./server/cmd/fleetd && (cd server && golangci-lint run -c .golangci.yaml ./cmd/fleetd)source ./bin/activate-hermit && just lintserver/internal/handlers/middleware -run TestRPCContract_EveryRegisteredProcedureIsClassified.The full DB-backed gateway suite was not run locally because Postgres rejected the configured
fleetuser credentials. The targeted non-DB gateway artifact test and middleware RPC contract test passed.Post-Deploy Monitoring & Validation
UploadCommandArtifact/DownloadCommandArtifactFailedPreconditioncasesResourceExhausted,Internal, checksum, or metadata mismatch errorsartifact transfer not expected, size mismatch, SHA mismatch, or cleanup sweep errorscommand-artifactsThis PR does not yet wire a miner-log command consumer, so rollback for this foundation is low risk: disable or avoid enabling downstream consumers, then revert the transfer foundation if gateway error rates move unexpectedly.