diff --git a/AGENTS.md b/AGENTS.md index d8549be..780738c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -31,9 +31,9 @@ GCS staging bucket ← gs://your-bucket/ ▼ Cloud Run Job: dmaf-scan ← Docker image from Cloud Build │ scans each file, face recognition against known_people/ - │ dedup via Firestore (never re-processes a file) + │ two-layer dedup via Firestore (path + content SHA-256) ▼ -Google Photos ← matched faces only, organised into album +Google Photos ← matched faces only, organised into named album ``` **Key constraint**: OpenClaw's self-chat protection means your OWN sent photos never reach @@ -176,6 +176,14 @@ When a GCS file is downloaded to `/tmp/dmaf_gcs_xxxx.jpg`, the dedup key must be original `gs://bucket/file.jpg`, not the local path. Firestore docs are keyed by `sha256(gcs_uri)[:32]`. Using the temp path creates a separate doc → mark_uploaded 404. +**1b. Two-layer dedup: path first, then content** +`seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)` +catches the same photo arriving via two different GCS paths (e.g. forwarded across groups). +Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content +check happens in `_process_image_file` and `_process_video_file` before face recognition runs. +Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the +same compressed bytes are served to all recipients. + **2. `mark_uploaded()` uses `set(merge=True)`, not `update()`** `update()` raises 404 if the doc doesn't exist. `set(merge=True)` is idempotent. This was a real bug — don't revert it. @@ -264,6 +272,8 @@ Tests live in `tests/test_mcp_server.py` — all tools mocked via `patch("subpro |----------|-----------| | GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only | | Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run | +| Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) | +| `google_photos_album_name` recommended | Native iOS backup + DMAF would both upload the same photo (WA strips EXIF so bytes differ); named album keeps DMAF uploads visually separated | | Cloud Run Job, not Service | Batch workload — runs, exits, scales to zero | | `set(merge=True)` for `mark_uploaded` | `update()` raises 404 on missing doc; `set+merge` is idempotent | | `iter_frames` generator + early exit | Large videos; stop decoding after first match | diff --git a/README.md b/README.md index 48b5632..e2c4b4c 100644 --- a/README.md +++ b/README.md @@ -110,7 +110,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b ### ☁️ Google Photos Integration - **Automatic uploads**: Photos and full video clips backed up seamlessly -- **Album organization**: Optionally organize into a named album +- **Album organization**: Upload to a named album (recommended — keeps face-matched photos separate from your native camera-roll backup) - **OAuth2 authentication**: Secure, offline token-based access - **Cloud staging support**: Delete source files after upload (ideal for GCS pipelines) @@ -121,7 +121,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b ### ⚡ Efficient & Token-Free - **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls -- **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore +- **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts - **Video early exit**: Sampling stops the moment a known face is found — no wasted compute - **Intelligent retry logic**: Exponential backoff for network resilience - **Scale-to-zero**: Cloud Run Job — no cost when idle, GCP free tier eligible @@ -226,7 +226,7 @@ graph LR D -->|No match| F[⏭️ Skip] E --> G[🗄️ Firestore Dedup] F --> G - G -->|SHA256| H[🚫 Never Reprocess] + G -->|path + content SHA256| H[🚫 Never Reprocess] ``` 1. **Capture** — OpenClaw intercepts WhatsApp group media and saves it locally; a system cron (zero LLM tokens) uploads it to GCS every 30 min @@ -234,7 +234,7 @@ graph LR 3. **Load** — Reference photos downloaded from GCS bucket at job startup 4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match 5. **Upload** — Matched photos and full video clips are uploaded to Google Photos -6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice +6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content --- @@ -252,7 +252,7 @@ recognition: tolerance: 0.5 # 0.0 (strictest) → 1.0 (loosest) min_face_size_pixels: 20 -google_photos_album_name: "Family — Auto WhatsApp" +google_photos_album_name: "Family Faces" # recommended: keeps DMAF uploads separate from camera-roll backup alerting: enabled: true diff --git a/deploy/setup-secrets.md b/deploy/setup-secrets.md index 2e1460b..c1ac17a 100644 --- a/deploy/setup-secrets.md +++ b/deploy/setup-secrets.md @@ -187,9 +187,17 @@ recognition: # ── Google Photos ─────────────────────────────────────────────────────────── google_photos_token_secret: "dmaf-photos-token" # Secret Manager secret name -google_photos_album_name: "DMAF Auto-Import" # Leave empty to skip album +google_photos_album_name: "Family Faces" # Recommended: keeps DMAF uploads separate + # from native camera-roll backup in Google Photos. + # Without this, the same photo may appear twice: + # once from iOS backup (original) and once from + # DMAF (WA-compressed). Set null to upload to root. # ── Deduplication ────────────────────────────────────────────────────────── +# DMAF uses two-layer dedup to avoid processing the same content twice: +# 1. Path-based: Firestore doc per GCS path (fast, O(1) lookup) +# 2. Content-based: SHA-256 of file bytes — catches the same photo forwarded +# across multiple WhatsApp groups (same compression = same hash) dedup: backend: firestore # firestore (cloud) | sqlite (local dev) firestore_project: dmaf-production