Skip to content

Add fleetnode command artifact transfer foundation#596

Open
ankitgoswami wants to merge 20 commits into
mainfrom
ankitg/fleetnode-command-artifacts
Open

Add fleetnode command artifact transfer foundation#596
ankitgoswami wants to merge 20 commits into
mainfrom
ankitg/fleetnode-command-artifacts

Conversation

@ankitgoswami

@ankitgoswami ankitgoswami commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Reviewable diff: +1299/-43 across 15 files (excludes generated, test, and story files).

Summary

This adds the fleetnode command artifact transfer foundation for large command-scoped files. Fleet nodes can now stream command artifacts, such as miner log bundles, to the server and stream server-stored artifacts back down without pushing large payloads through ControlAck. This PR wires the transport, storage, validation, lifecycle cleanup, and command-expectation gates; miner log download and firmware update command consumers remain follow-up work.

How it works

A server-issued control command can register artifact expectations in the control registry with a command ID, direction, purpose, optional artifact ID, and optional device identifier. The fleet node opens UploadCommandArtifact or DownloadCommandArtifact with its normal fleetnode gateway auth token, and the gateway admits the stream only when the command is still in flight and the transfer matches a registered expectation.

For uploads, the gateway reserves a lease-bound per-fleetnode upload slot before reading the first stream message, bounds the header wait plus chunk progress and total transfer duration, then admits the transfer after the header matches an in-flight command expectation. The files service streams bytes through staging, validates declared size and SHA-256, promotes the artifact under a server-generated ID, and writes a metadata sidecar used for later opens. Failed uploads reinstate the expectation until the command-level attempt cap is reached; if the bytes were saved but the response was lost, matching duplicate upload retries drain and verify the retry chunks before returning the stored artifact ref while the command remains in flight.

For downloads, the gateway reserves a per-fleetnode download slot, marks the matching expectation in progress, opens the artifact and validates stored metadata against the issued reference, then streams a header and chunks back to the fleet node with finite send and total-transfer deadlines. Downloads reinstate the expectation after EOF so the same in-flight command can retry until its attempt cap; command completion removes the expectation. The fleetnode helper verifies received size and checksum before returning the reference.

Finalized artifacts are retained for 7 days by default and swept periodically. The sweep is intentionally TTL-based and does not pin artifacts for unusually long-lived in-flight commands; downstream command consumers should keep command lifetimes shorter than the retention window or issue fresh artifact references.

Diagrams

flowchart LR
  CommandPath["Operator command path"] --> Registry["Control registry"]
  Registry --> ControlCommand["ControlCommand with command_id"]
  ControlCommand --> FleetNode["Fleet node agent"]
  FleetNode --> UploadRPC["UploadCommandArtifact stream"]
  UploadRPC --> Gateway["Fleetnode gateway"]
  Gateway --> Store["Command artifact store"]
  Store --> ArtifactRef["CommandArtifactRef"]
  ArtifactRef --> FleetNode
Loading
sequenceDiagram
  participant Server as Server command path
  participant Registry as Control registry
  participant Node as Fleet node
  participant Gateway as Gateway RPC handler
  participant Files as Files service

  Server->>Registry: Register command artifact expectation
  Server->>Node: Deliver control command
  Node->>Gateway: Upload or download artifact stream
  Gateway->>Registry: Admit transfer for command_id and purpose
  Registry-->>Gateway: Accepted only while command is in flight
  Gateway->>Files: Save or open artifact bytes
  Files-->>Gateway: Validated metadata and stream
  Gateway-->>Node: Artifact ref or streamed bytes
Loading

Areas of the code involved

Area / package / file What changed Why it matters for review
proto/fleetnodegateway/v1 Adds artifact purpose/reference messages plus upload/download streaming RPCs Defines the cross-process API contract
server/generated/grpc/..., client/src/protoFleet/api/generated/... Regenerated gateway bindings Generated -- skip except for sanity checks
server/internal/domain/fleetnode/control Adds artifact expectations, in-progress/completed transfer state, lease-bound upload/download slot limits, and attempt caps Ensures transfers are tied to in-flight commands and bounded while retries remain possible
server/internal/handlers/fleetnode/gateway Implements upload and download stream handlers with finite upload receive, single-loop upload reads, and download send deadlines Main trust boundary for fleetnode artifact movement
server/internal/infrastructure/files Adds command artifact storage, reserved metadata sidecar handling, retention sweep, deletion, and size limits Persists uploaded bytes, avoids checksum recompute on every download, and bounds disk growth
server/cmd/fleetd Keeps only the header read timeout and starts the command artifact cleanup loop Preserves long-lived fleet node streams while expiring finalized artifacts
server/cmd/fleetnode Adds client-side upload/download stream helpers Gives fleetnode command consumers a small verified transfer API
server/internal/handlers/interceptors/config.go Classifies new RPCs as authenticated and sensitive-body procedures Keeps auth and request logging policy aligned with the new streams
Tests Adds storage, registry, gateway stream, and config coverage Exercises duplicate rejection, retry reinstatement, lifecycle cleanup, and byte verification

Key technical decisions & trade-offs

Decision Alternative considered
Stream artifacts out-of-band from control acknowledgements Embedding logs or firmware-sized payloads in ControlAck would keep the API smaller but hit existing payload limits and retry semantics
Gate every transfer through an in-flight command expectation Allowing nodes to upload or download by artifact ID alone would be simpler but would create a broader file access surface
Track artifact expectations as in-progress/completed and retain completed upload refs for idempotent retry A single consumed boolean is simpler but burns retries on transient pre-completion failures or lost upload responses
Keep downloads retryable until command completion, with send deadlines and attempt caps Completing the expectation at EOF limits reads but can burn retries when trailers are lost before the node observes success
Cap per-command artifact transfer attempts and per-node upload/download concurrency Unlimited reinstatement and unbounded streams are friendlier to flaky nodes but let bad peers pin server resources
Persist validated artifact metadata in a reserved sidecar filename Recomputing SHA-256 on every open is simpler but doubles disk I/O for downloads
Retain finalized artifacts by TTL with explicit delete helper Persistent references/quota tracking would give tighter ownership controls but needs org/node metadata not present in this foundation
Store artifacts under opaque server-generated IDs Reusing client filenames as identifiers would aid debugging but increases collision and path-safety risk

Testing & validation

  • buf lint
  • git diff --check
  • source ./bin/activate-hermit && just gen
  • source ./bin/activate-hermit && go test ./server/internal/infrastructure/files ./server/internal/domain/fleetnode/control ./server/internal/handlers/fleetnode/gateway -run 'TestCommandArtifact|TestRunCommandArtifactDownloadSend|TestContextConnectError|TestMapArtifactAdmissionError|TestAdmitCommandArtifact|TestReinstateCommandArtifact|TestCompletedCommandArtifactUpload|TestCommandArtifactUpload|TestCommandArtifactUploadReader|TestSaveCommandArtifact|TestOpenCommandArtifact|TestDeleteCommandArtifact|TestSweepExpiredCommandArtifacts' ./server/cmd/fleetnode ./server/cmd/fleetd
  • source ./bin/activate-hermit && go test ./server/internal/handlers/fleetnode/gateway -run TestCommandArtifactUploadAndDownloadRequireInFlightExpectation -count=50
  • source ./bin/activate-hermit && go test ./server/cmd/fleetnode ./server/cmd/fleetd
  • cd server && source ../bin/activate-hermit && golangci-lint run -c .golangci.yaml
  • source ./bin/activate-hermit && go test ./server/cmd/fleetd && (cd server && golangci-lint run -c .golangci.yaml ./cmd/fleetd)
  • source ./bin/activate-hermit && just lint
  • Earlier focused validation also covered server/internal/handlers/middleware -run TestRPCContract_EveryRegisteredProcedureIsClassified.

The full DB-backed gateway suite was not run locally because Postgres rejected the configured fleet user credentials. The targeted non-DB gateway artifact test and middleware RPC contract test passed.

Post-Deploy Monitoring & Validation

Signal Healthy Investigate
Gateway RPC errors for UploadCommandArtifact / DownloadCommandArtifact Low error rate, mostly stale-command FailedPrecondition cases Spikes in ResourceExhausted, Internal, checksum, or metadata mismatch errors
Fleetd logs around command artifacts Expected transfers only when command consumers start using this path Repeated artifact transfer not expected, size mismatch, SHA mismatch, or cleanup sweep errors
Artifact storage growth under command-artifacts Growth only during artifact-consuming command workflows, followed by TTL cleanup Unexpected disk growth before consumers are enabled, or artifacts older than the retention TTL
Fleetnode command success rates No regression for existing commands; artifact consumers retry only transient failures Commands exhausting artifact transfer attempts or hitting upload concurrency limits

This PR does not yet wire a miner-log command consumer, so rollback for this foundation is low risk: disable or avoid enabling downstream consumers, then revert the transfer foundation if gateway error rates move unexpectedly.

@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels Jun 25, 2026
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (70ace71b898e835144381e4d19aedb25450672fa...6e72e9edfa2c28553a8641845b5c3254708852ee, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: NONE

Findings

No high-impact security, correctness, or reliability findings were identified in the scoped diff.

Notes

Reviewed .git/codex-review.diff only, covering the new command artifact upload/download RPCs, registry admission state, disk storage helpers, HTTP/2 write timeout, auth/interceptor configuration, protobuf changes, and generated code.

The new artifact RPCs are fleet-node session authenticated, request/response bodies are suppressed from streaming logs, transfers are bound to in-flight command expectations, artifact IDs are server generated/canonicalized, uploaded bytes are size and SHA-256 checked, filenames are sanitized, and concurrent upload/download streams are capped per fleet node.

I did not find changes affecting mining pool URLs, wallet/payout addresses, shell command construction, SQL query construction, frontend rendering, Docker/Nginx config, Rust, or Python plugin code in this PR diff.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

- Remove fleetd's global ReadTimeout so long-lived fleet node streams are not capped by the HTTP server
- Reserve the command artifact metadata sidecar filename during upload sanitization

- Add finite upload header, chunk, and total-transfer deadlines that release per-node slots

- Make duplicate upload test tolerant of transport EOF timing
@ankitgoswami ankitgoswami marked this pull request as ready for review June 25, 2026 23:29
@ankitgoswami ankitgoswami requested a review from a team as a code owner June 25, 2026 23:29
Copilot AI review requested due to automatic review settings June 25, 2026 23:29

This comment was marked as low quality.

- Bind artifact transfer slot releases to acquired leases across ControlStream reconnects

- Add per-node download slots plus bounded download send and total deadlines

- Cover reconnect upload accounting and blocked download sends
chatgpt-codex-connector[bot]

This comment was marked as outdated.

- Treat corrupt artifact directories and metadata size mismatches as internal storage errors

- Map canceled upload receives distinctly from deadlines and no-active-stream admissions explicitly

- Replace per-chunk upload receive goroutines with a single receive loop per RPC
@ankitgoswami

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

- Store completed upload refs in in-flight artifact expectations

- Return matching completed refs for duplicate upload headers while the command remains active

- Cover idempotent upload retry at the registry and gateway layers
- Reject upload chunks larger than the 1 MiB command artifact chunk size

- Leave successful downloads retryable while the command remains in flight

- Cover oversized chunks and repeated in-flight downloads
chatgpt-codex-connector[bot]

This comment was marked as outdated.

- Read and verify retry chunks before returning a stored completed upload ref

- Cover duplicate upload retry using the normal chunk-sending client flow
chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

- add HTTP/2 stalled-write timeout coverage for artifact downloads

- charge completed upload retries against the attempt cap

- bind completed upload retry drains to the command lifetime
chatgpt-codex-connector[bot]

This comment was marked as outdated.

- add a scoped Connect read limit for UploadCommandArtifact

- cover oversized upload chunks and retry reinstatement
- fold redundant HTTP/2 timeout assertion into the behavioral test

- share artifact upload test setup and message builders
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants