-
Notifications
You must be signed in to change notification settings - Fork 0
docs: document two-layer dedup and google_photos_album_name recommendation #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,9 +31,9 @@ GCS staging bucket ← gs://your-bucket/ | |
| ▼ | ||
| Cloud Run Job: dmaf-scan ← Docker image from Cloud Build | ||
| │ scans each file, face recognition against known_people/ | ||
| │ dedup via Firestore (never re-processes a file) | ||
| │ two-layer dedup via Firestore (path + content SHA-256) | ||
| ▼ | ||
| Google Photos ← matched faces only, organised into album | ||
| Google Photos ← matched faces only, organised into named album | ||
| ``` | ||
|
|
||
| **Key constraint**: OpenClaw's self-chat protection means your OWN sent photos never reach | ||
|
|
@@ -176,6 +176,14 @@ When a GCS file is downloaded to `/tmp/dmaf_gcs_xxxx.jpg`, the dedup key must be | |
| original `gs://bucket/file.jpg`, not the local path. Firestore docs are keyed by | ||
| `sha256(gcs_uri)[:32]`. Using the temp path creates a separate doc → mark_uploaded 404. | ||
|
|
||
| **1b. Two-layer dedup: path first, then content** | ||
| `seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)` | ||
| catches the same photo arriving via two different GCS paths (e.g. forwarded across groups). | ||
| Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content | ||
| check happens in `_process_image_file` and `_process_video_file` before face recognition runs. | ||
| Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the | ||
| same compressed bytes are served to all recipients. | ||
|
Comment on lines
+179
to
+185
|
||
|
|
||
| **2. `mark_uploaded()` uses `set(merge=True)`, not `update()`** | ||
| `update()` raises 404 if the doc doesn't exist. `set(merge=True)` is idempotent. | ||
| This was a real bug — don't revert it. | ||
|
|
@@ -264,6 +272,8 @@ Tests live in `tests/test_mcp_server.py` — all tools mocked via `patch("subpro | |
| |----------|-----------| | ||
| | GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only | | ||
| | Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run | | ||
| | Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) | | ||
|
||
| | `google_photos_album_name` recommended | Native iOS backup + DMAF would both upload the same photo (WA strips EXIF so bytes differ); named album keeps DMAF uploads visually separated | | ||
| | Cloud Run Job, not Service | Batch workload — runs, exits, scales to zero | | ||
| | `set(merge=True)` for `mark_uploaded` | `update()` raises 404 on missing doc; `set+merge` is idempotent | | ||
| | `iter_frames` generator + early exit | Large videos; stop decoding after first match | | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -110,7 +110,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b | |
|
|
||
| ### ☁️ Google Photos Integration | ||
| - **Automatic uploads**: Photos and full video clips backed up seamlessly | ||
| - **Album organization**: Optionally organize into a named album | ||
| - **Album organization**: Upload to a named album (recommended — keeps face-matched photos separate from your native camera-roll backup) | ||
| - **OAuth2 authentication**: Secure, offline token-based access | ||
| - **Cloud staging support**: Delete source files after upload (ideal for GCS pipelines) | ||
|
|
||
|
|
@@ -121,7 +121,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b | |
|
|
||
| ### ⚡ Efficient & Token-Free | ||
| - **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls | ||
| - **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore | ||
| - **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts | ||
|
||
| - **Video early exit**: Sampling stops the moment a known face is found — no wasted compute | ||
| - **Intelligent retry logic**: Exponential backoff for network resilience | ||
| - **Scale-to-zero**: Cloud Run Job — no cost when idle, GCP free tier eligible | ||
|
|
@@ -226,15 +226,15 @@ graph LR | |
| D -->|No match| F[⏭️ Skip] | ||
| E --> G[🗄️ Firestore Dedup] | ||
| F --> G | ||
| G -->|SHA256| H[🚫 Never Reprocess] | ||
| G -->|path + content SHA256| H[🚫 Never Reprocess] | ||
|
||
| ``` | ||
|
|
||
| 1. **Capture** — OpenClaw intercepts WhatsApp group media and saves it locally; a system cron (zero LLM tokens) uploads it to GCS every 30 min | ||
| 2. **Schedule** — Cloud Scheduler triggers the Cloud Run job hourly — no agent, no AI cost | ||
| 3. **Load** — Reference photos downloaded from GCS bucket at job startup | ||
| 4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match | ||
| 5. **Upload** — Matched photos and full video clips are uploaded to Google Photos | ||
| 6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice | ||
| 6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content | ||
|
||
|
|
||
| --- | ||
|
|
||
|
|
@@ -252,7 +252,7 @@ recognition: | |
| tolerance: 0.5 # 0.0 (strictest) → 1.0 (loosest) | ||
| min_face_size_pixels: 20 | ||
|
|
||
| google_photos_album_name: "Family — Auto WhatsApp" | ||
| google_photos_album_name: "Family Faces" # recommended: keeps DMAF uploads separate from camera-roll backup | ||
|
|
||
| alerting: | ||
| enabled: true | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -187,9 +187,17 @@ recognition: | |
|
|
||
| # ── Google Photos ─────────────────────────────────────────────────────────── | ||
| google_photos_token_secret: "dmaf-photos-token" # Secret Manager secret name | ||
| google_photos_album_name: "DMAF Auto-Import" # Leave empty to skip album | ||
| google_photos_album_name: "Family Faces" # Recommended: keeps DMAF uploads separate | ||
| # from native camera-roll backup in Google Photos. | ||
| # Without this, the same photo may appear twice: | ||
| # once from iOS backup (original) and once from | ||
| # DMAF (WA-compressed). Set null to upload to root. | ||
|
|
||
| # ── Deduplication ────────────────────────────────────────────────────────── | ||
| # DMAF uses two-layer dedup to avoid processing the same content twice: | ||
| # 1. Path-based: Firestore doc per GCS path (fast, O(1) lookup) | ||
| # 2. Content-based: SHA-256 of file bytes — catches the same photo forwarded | ||
| # across multiple WhatsApp groups (same compression = same hash) | ||
|
Comment on lines
+197
to
+200
|
||
| dedup: | ||
| backend: firestore # firestore (cloud) | sqlite (local dev) | ||
| firestore_project: dmaf-production | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This architecture diagram claims "two-layer dedup via Firestore (path + content SHA-256)" but the actual implementation only performs path-based deduplication. The content SHA-256 is stored but never used as a deduplication check.