fix(device-upload): async worker reads shared hash store, not cross-container /tmp — v2.72.1#31
Merged
Conversation
…ontainer /tmp — v2.72.1
The shipped ADR-0023 async device-upload path failed for every real upload.
The API container spooled the upload to its private /tmp and handed the path
to parse_device_flight_task; the Celery worker runs in a SEPARATE container
and cannot see that /tmp, so open(tmp_path) raised FileNotFoundError. The task
caught it, marked the file state=error, and completed the batch — so the client
saw 202 + poll complete/100 then "[Errno 2] No such file or directory:
'/tmp/flight_upload_*'". Field-reported on a DJI Mavic 4 Pro log 2026-06-24;
not M4P-specific — it was the first async upload to cross the container boundary.
Fix:
- Worker resolves the file from the SHARED hash store
(/data/uploads/flight_logs/{hash}, on the app_data:/data volume both
containers mount) via _get_stored_file_path(file_hash), where the original
bytes were already persisted at spool time. Falls back to tmp_path only when
present; clear diagnosable error if neither exists (never a bare ENOENT). The
canonical stored original is never unlinked.
- Route closes the redundant /tmp spool immediately after enqueue (it was also
leaking on the backend, since the worker's unlink ran in the wrong container).
- Two regression tests reproduce the cross-container topology the old hermetic
harness mocked away. Full backend suite green (433 passed, 3 skipped).
Reads from the persistent app_data volume instead of ephemeral /tmp — strictly
more recreation-safe. No DB/replication/blue-green/failover impact. DroneOpsSync
client needs no change. Docs: CHANGELOG 2026-06-24, PROGRESS, ADR-0023 §6.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The shipped ADR-0023 async device-upload path failed for every real upload. Field-reported on a DJI Mavic 4 Pro flight log (2026-06-24 21:43 PDT): the DroneOpsSync client saw the full path succeed — SAF scan OK →
POST .../device-upload/async202 → pollcomplete / done / 100— then a per-file error:Root cause
The async route spools the upload to a local
/tmpfile (_spool_upload→tempfile.mkstemp(prefix="flight_upload_")) and passed that path string toparse_device_flight_task.delay(...). The Celeryworkerruns in a different container than thebackend(separate services indocker-compose.yml)./tmpis per-container ephemeral — not a shared volume — so the worker'sopen(tmp_path)raisedFileNotFoundError. The task caught it, marked the filestate=error, and completed the batch (the §2.4 complete-with-error split) → client showscomplete/100and a red row.Not Mavic-4-Pro-specific — it was the first async upload to cross the API→worker container boundary at all. The DJI v13+ AES decryption dependency is downstream; the file never reached the parser. Secondary latent defect: the backend's
/tmpspool leaked forever (the worker'sunlinkran in the wrong container).Why tests missed it:
_run_taskcreated a real local temp file in the same process — the cross-container boundary was never exercised.Fix
The original bytes are already persisted at spool time to the hash store
/data/uploads/flight_logs/{hash}{ext}on the sharedapp_data:/datavolume both containers mount.flight_library._get_stored_file_path(file_hash); falls back totmp_pathonly when it exists; clear diagnosable error if neither exists (never a bare ENOENT). The canonical stored original is never unlinked.spooled.close()immediately after enqueue — releases the redundant/tmpspool instead of leaking it.tmp_pathstill passed for signature stability + same-container fallback.test_task_reads_shared_store_when_tmp_spool_absent+test_task_errors_clearly_when_artifact_missing_everywhere; harness gainedtmp_exists/use_shared_storeknobs modeling the real file topology.Verification
tests/test_device_upload_async.py: 12 passed (10 original + 2 new). New tests verified RED against the unpatched worker first (one reproduces the exact prod ENOENT string).Failover & Resilience Guard
Reads from the persistent
app_datavolume instead of ephemeral/tmp— strictly more recreation-safe. No change to PostgreSQL replication, port bindings, the blue-green swap, or the failover engine. ✅Client
DroneOpsSync needs no change — it behaved correctly end to end.
Docs
CHANGELOG (2026-06-24), PROGRESS, ADR-0023 §6 amendment. Version bumped 2.72.0 → 2.72.1 (README,
main.py,package.json,AppShell.tsx).Follow-up (non-blocking, tracked in PROGRESS)
dji_api_keydecryption path now that files actually reach the parser.🤖 Generated with Claude Code