yhyatt · yhyatt · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026 · Copilot
@@ -31,9 +31,9 @@ GCS staging bucket        ← gs://your-bucket/
       ▼
 Cloud Run Job: dmaf-scan  ← Docker image from Cloud Build
       │  scans each file, face recognition against known_people/
-      │  dedup via Firestore (never re-processes a file)
+      │  two-layer dedup via Firestore (path + content SHA-256)
       ▼
-Google Photos             ← matched faces only, organised into album
+Google Photos             ← matched faces only, organised into named album
 ```
 
 **Key constraint**: OpenClaw's self-chat protection means your OWN sent photos never reach
@@ -176,6 +176,14 @@ When a GCS file is downloaded to `/tmp/dmaf_gcs_xxxx.jpg`, the dedup key must be
 original `gs://bucket/file.jpg`, not the local path. Firestore docs are keyed by
 `sha256(gcs_uri)[:32]`. Using the temp path creates a separate doc → mark_uploaded 404.
 
+**1b. Two-layer dedup: path first, then content**
+`seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)`
+catches the same photo arriving via two different GCS paths (e.g. forwarded across groups).
+Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content
+check happens in `_process_image_file` and `_process_video_file` before face recognition runs.
+Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the
+same compressed bytes are served to all recipients.
+
 **2. `mark_uploaded()` uses `set(merge=True)`, not `update()`**
 `update()` raises 404 if the doc doesn't exist. `set(merge=True)` is idempotent.
 This was a real bug — don't revert it.
@@ -264,6 +272,8 @@ Tests live in `tests/test_mcp_server.py` — all tools mocked via `patch("subpro
 |----------|-----------|
 | GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only |
 | Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run |
+| Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) |
+| `google_photos_album_name` recommended | Native iOS backup + DMAF would both upload the same photo (WA strips EXIF so bytes differ); named album keeps DMAF uploads visually separated |
 | Cloud Run Job, not Service | Batch workload — runs, exits, scales to zero |
 | `set(merge=True)` for `mark_uploaded` | `update()` raises 404 on missing doc; `set+merge` is idempotent |
 | `iter_frames` generator + early exit | Large videos; stop decoding after first match |

@@ -110,7 +110,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b
 
 ### ☁️ Google Photos Integration
 - **Automatic uploads**: Photos and full video clips backed up seamlessly
-- **Album organization**: Optionally organize into a named album
+- **Album organization**: Upload to a named album (recommended — keeps face-matched photos separate from your native camera-roll backup)
 - **OAuth2 authentication**: Secure, offline token-based access
 - **Cloud staging support**: Delete source files after upload (ideal for GCS pipelines)
 
@@ -121,7 +121,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b
 
 ### ⚡ Efficient & Token-Free
 - **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls
-- **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore
+- **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts
 - **Video early exit**: Sampling stops the moment a known face is found — no wasted compute
 - **Intelligent retry logic**: Exponential backoff for network resilience
 - **Scale-to-zero**: Cloud Run Job — no cost when idle, GCP free tier eligible
@@ -226,15 +226,15 @@ graph LR
     D -->|No match| F[⏭️ Skip]
     E --> G[🗄️ Firestore Dedup]
     F --> G
-    G -->|SHA256| H[🚫 Never Reprocess]
+    G -->|path + content SHA256| H[🚫 Never Reprocess]
 ```
 
 1. **Capture** — OpenClaw intercepts WhatsApp group media and saves it locally; a system cron (zero LLM tokens) uploads it to GCS every 30 min
 2. **Schedule** — Cloud Scheduler triggers the Cloud Run job hourly — no agent, no AI cost
 3. **Load** — Reference photos downloaded from GCS bucket at job startup
 4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match
 5. **Upload** — Matched photos and full video clips are uploaded to Google Photos
-6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice
+6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content
 
 ---
 
@@ -252,7 +252,7 @@ recognition:
   tolerance: 0.5           # 0.0 (strictest) → 1.0 (loosest)
   min_face_size_pixels: 20
 
-google_photos_album_name: "Family — Auto WhatsApp"
+google_photos_album_name: "Family Faces"  # recommended: keeps DMAF uploads separate from camera-roll backup
 
 alerting:
   enabled: true

@@ -187,9 +187,17 @@ recognition:
 
 # ── Google Photos ───────────────────────────────────────────────────────────
 google_photos_token_secret: "dmaf-photos-token"  # Secret Manager secret name
-google_photos_album_name: "DMAF Auto-Import"     # Leave empty to skip album
+google_photos_album_name: "Family Faces"         # Recommended: keeps DMAF uploads separate
+                                                 # from native camera-roll backup in Google Photos.
+                                                 # Without this, the same photo may appear twice:
+                                                 # once from iOS backup (original) and once from
+                                                 # DMAF (WA-compressed). Set null to upload to root.
 
 # ── Deduplication ──────────────────────────────────────────────────────────
+# DMAF uses two-layer dedup to avoid processing the same content twice:
+#   1. Path-based: Firestore doc per GCS path (fast, O(1) lookup)
+#   2. Content-based: SHA-256 of file bytes — catches the same photo forwarded
+#      across multiple WhatsApp groups (same compression = same hash)
 dedup:
   backend: firestore            # firestore (cloud) | sqlite (local dev)
   firestore_project: dmaf-production