Skip to content

fix(device-upload): async worker reads shared hash store, not cross-container /tmp — v2.72.1#31

Merged
BigBill1418 merged 1 commit into
mainfrom
fix/async-device-upload-cross-container-tmp
Jun 25, 2026
Merged

fix(device-upload): async worker reads shared hash store, not cross-container /tmp — v2.72.1#31
BigBill1418 merged 1 commit into
mainfrom
fix/async-device-upload-cross-container-tmp

Conversation

@BigBill1418

Copy link
Copy Markdown
Owner

Problem

The shipped ADR-0023 async device-upload path failed for every real upload. Field-reported on a DJI Mavic 4 Pro flight log (2026-06-24 21:43 PDT): the DroneOpsSync client saw the full path succeed — SAF scan OK → POST .../device-upload/async 202 → poll complete / done / 100 — then a per-file error:

async per-file error: FlightRecord_..._.txt: [Errno 2] No such file or directory: '/tmp/flight_upload_bj_i2gq7'

Root cause

The async route spools the upload to a local /tmp file (_spool_uploadtempfile.mkstemp(prefix="flight_upload_")) and passed that path string to parse_device_flight_task.delay(...). The Celery worker runs in a different container than the backend (separate services in docker-compose.yml). /tmp is per-container ephemeral — not a shared volume — so the worker's open(tmp_path) raised FileNotFoundError. The task caught it, marked the file state=error, and completed the batch (the §2.4 complete-with-error split) → client shows complete/100 and a red row.

Not Mavic-4-Pro-specific — it was the first async upload to cross the API→worker container boundary at all. The DJI v13+ AES decryption dependency is downstream; the file never reached the parser. Secondary latent defect: the backend's /tmp spool leaked forever (the worker's unlink ran in the wrong container).

Why tests missed it: _run_task created a real local temp file in the same process — the cross-container boundary was never exercised.

Fix

The original bytes are already persisted at spool time to the hash store /data/uploads/flight_logs/{hash}{ext} on the shared app_data:/data volume both containers mount.

  • Worker resolves the file via flight_library._get_stored_file_path(file_hash); falls back to tmp_path only when it exists; clear diagnosable error if neither exists (never a bare ENOENT). The canonical stored original is never unlinked.
  • Route spooled.close() immediately after enqueue — releases the redundant /tmp spool instead of leaking it. tmp_path still passed for signature stability + same-container fallback.
  • Regression tests: test_task_reads_shared_store_when_tmp_spool_absent + test_task_errors_clearly_when_artifact_missing_everywhere; harness gained tmp_exists / use_shared_store knobs modeling the real file topology.

Verification

  • tests/test_device_upload_async.py: 12 passed (10 original + 2 new). New tests verified RED against the unpatched worker first (one reproduces the exact prod ENOENT string).
  • Full backend suite: 433 passed, 3 skipped.

Failover & Resilience Guard

Reads from the persistent app_data volume instead of ephemeral /tmp — strictly more recreation-safe. No change to PostgreSQL replication, port bindings, the blue-green swap, or the failover engine. ✅

Client

DroneOpsSync needs no change — it behaved correctly end to end.

Docs

CHANGELOG (2026-06-24), PROGRESS, ADR-0023 §6 amendment. Version bumped 2.72.0 → 2.72.1 (README, main.py, package.json, AppShell.tsx).

Follow-up (non-blocking, tracked in PROGRESS)

  • Consider hard-failing the async spool (no 202) if the shared-store write fails, so a 202 is never returned for a file the worker can't read.
  • Once live on BOS-HQ: verify end-to-end on the controller with a real M4P upload (the ADR-0023 §5 operator gate never reached because the first real upload hit this bug), then confirm the DJI dji_api_key decryption path now that files actually reach the parser.

🤖 Generated with Claude Code

…ontainer /tmp — v2.72.1

The shipped ADR-0023 async device-upload path failed for every real upload.
The API container spooled the upload to its private /tmp and handed the path
to parse_device_flight_task; the Celery worker runs in a SEPARATE container
and cannot see that /tmp, so open(tmp_path) raised FileNotFoundError. The task
caught it, marked the file state=error, and completed the batch — so the client
saw 202 + poll complete/100 then "[Errno 2] No such file or directory:
'/tmp/flight_upload_*'". Field-reported on a DJI Mavic 4 Pro log 2026-06-24;
not M4P-specific — it was the first async upload to cross the container boundary.

Fix:
- Worker resolves the file from the SHARED hash store
  (/data/uploads/flight_logs/{hash}, on the app_data:/data volume both
  containers mount) via _get_stored_file_path(file_hash), where the original
  bytes were already persisted at spool time. Falls back to tmp_path only when
  present; clear diagnosable error if neither exists (never a bare ENOENT). The
  canonical stored original is never unlinked.
- Route closes the redundant /tmp spool immediately after enqueue (it was also
  leaking on the backend, since the worker's unlink ran in the wrong container).
- Two regression tests reproduce the cross-container topology the old hermetic
  harness mocked away. Full backend suite green (433 passed, 3 skipped).

Reads from the persistent app_data volume instead of ephemeral /tmp — strictly
more recreation-safe. No DB/replication/blue-green/failover impact. DroneOpsSync
client needs no change. Docs: CHANGELOG 2026-06-24, PROGRESS, ADR-0023 §6.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@BigBill1418 BigBill1418 merged commit 1b9f2ab into main Jun 25, 2026
0 of 2 checks passed
@BigBill1418 BigBill1418 deleted the fix/async-device-upload-cross-container-tmp branch June 25, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant