Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ GCS staging bucket ← gs://your-bucket/
Cloud Run Job: dmaf-scan ← Docker image from Cloud Build
│ scans each file, face recognition against known_people/
│ dedup via Firestore (never re-processes a file)
two-layer dedup via Firestore (path + content SHA-256)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This architecture diagram claims "two-layer dedup via Firestore (path + content SHA-256)" but the actual implementation only performs path-based deduplication. The content SHA-256 is stored but never used as a deduplication check.

Copilot uses AI. Check for mistakes.
Google Photos ← matched faces only, organised into album
Google Photos ← matched faces only, organised into named album
```

**Key constraint**: OpenClaw's self-chat protection means your OWN sent photos never reach
Expand Down Expand Up @@ -176,6 +176,14 @@ When a GCS file is downloaded to `/tmp/dmaf_gcs_xxxx.jpg`, the dedup key must be
original `gs://bucket/file.jpg`, not the local path. Firestore docs are keyed by
`sha256(gcs_uri)[:32]`. Using the temp path creates a separate doc → mark_uploaded 404.

**1b. Two-layer dedup: path first, then content**
`seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)`
catches the same photo arriving via two different GCS paths (e.g. forwarded across groups).
Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content
check happens in `_process_image_file` and `_process_video_file` before face recognition runs.
Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the
same compressed bytes are served to all recipients.
Comment on lines +179 to +185
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section documents a seen_by_sha256 method and content-based deduplication that doesn't exist in the codebase. I verified that:

  1. No seen_by_sha256 method exists in either Database or FirestoreDatabase classes
  2. The code only performs path-based deduplication via seen(path)
  3. SHA-256 hashes are computed and stored but never queried for deduplication
  4. No tests for content deduplication exist (TestContentDedup class not found)

The actual implementation only has single-layer (path-based) deduplication, not two-layer deduplication. Either this documentation describes a feature that wasn't actually implemented in PR #17, or PR #17 was never merged.

Copilot uses AI. Check for mistakes.

**2. `mark_uploaded()` uses `set(merge=True)`, not `update()`**
`update()` raises 404 if the doc doesn't exist. `set(merge=True)` is idempotent.
This was a real bug — don't revert it.
Expand Down Expand Up @@ -264,6 +272,8 @@ Tests live in `tests/test_mcp_server.py` — all tools mocked via `patch("subpro
|----------|-----------|
| GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only |
| Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run |
| Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) |
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design decision documents a "two-layer dedup" feature that doesn't exist in the codebase. The actual implementation only has path-based deduplication via the seen(path) method. There is no content SHA-256 deduplication check - the SHA-256 hash is computed and stored but never queried to prevent duplicate processing. The claim "same WA compression = same bytes" may be true, but the feature to leverage this doesn't exist.

Copilot uses AI. Check for mistakes.
| `google_photos_album_name` recommended | Native iOS backup + DMAF would both upload the same photo (WA strips EXIF so bytes differ); named album keeps DMAF uploads visually separated |
| Cloud Run Job, not Service | Batch workload — runs, exits, scales to zero |
| `set(merge=True)` for `mark_uploaded` | `update()` raises 404 on missing doc; `set+merge` is idempotent |
| `iter_frames` generator + early exit | Large videos; stop decoding after first match |
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b

### ☁️ Google Photos Integration
- **Automatic uploads**: Photos and full video clips backed up seamlessly
- **Album organization**: Optionally organize into a named album
- **Album organization**: Upload to a named album (recommended — keeps face-matched photos separate from your native camera-roll backup)
- **OAuth2 authentication**: Secure, offline token-based access
- **Cloud staging support**: Delete source files after upload (ideal for GCS pipelines)

Expand All @@ -121,7 +121,7 @@ Your agent will walk through the full setup: GCP project, service account, GCS b

### ⚡ Efficient & Token-Free
- **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls
- **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore
- **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature card claims "Two-layer deduplication: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup" but the actual implementation only has path-based deduplication. Content SHA-256 hashes are computed and stored but never queried to prevent duplicate processing.

Copilot uses AI. Check for mistakes.
- **Video early exit**: Sampling stops the moment a known face is found — no wasted compute
- **Intelligent retry logic**: Exponential backoff for network resilience
- **Scale-to-zero**: Cloud Run Job — no cost when idle, GCP free tier eligible
Expand Down Expand Up @@ -226,15 +226,15 @@ graph LR
D -->|No match| F[⏭️ Skip]
E --> G[🗄️ Firestore Dedup]
F --> G
G -->|SHA256| H[🚫 Never Reprocess]
G -->|path + content SHA256| H[🚫 Never Reprocess]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Mermaid diagram node label claims "path + content SHA256" deduplication, but the actual implementation only performs path-based deduplication. The content SHA-256 is not used as a deduplication mechanism.

Copilot uses AI. Check for mistakes.
```

1. **Capture** — OpenClaw intercepts WhatsApp group media and saves it locally; a system cron (zero LLM tokens) uploads it to GCS every 30 min
2. **Schedule** — Cloud Scheduler triggers the Cloud Run job hourly — no agent, no AI cost
3. **Load** — Reference photos downloaded from GCS bucket at job startup
4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match
5. **Upload** — Matched photos and full video clips are uploaded to Google Photos
6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice
6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "How It Works" step describes a two-layer deduplication system with content SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication. Face recognition is not "skipped entirely for known content" based on SHA-256 - it's only skipped for previously-seen paths.

Copilot uses AI. Check for mistakes.

---

Expand All @@ -252,7 +252,7 @@ recognition:
tolerance: 0.5 # 0.0 (strictest) → 1.0 (loosest)
min_face_size_pixels: 20

google_photos_album_name: "Family — Auto WhatsApp"
google_photos_album_name: "Family Faces" # recommended: keeps DMAF uploads separate from camera-roll backup

alerting:
enabled: true
Expand Down
10 changes: 9 additions & 1 deletion deploy/setup-secrets.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,17 @@ recognition:

# ── Google Photos ───────────────────────────────────────────────────────────
google_photos_token_secret: "dmaf-photos-token" # Secret Manager secret name
google_photos_album_name: "DMAF Auto-Import" # Leave empty to skip album
google_photos_album_name: "Family Faces" # Recommended: keeps DMAF uploads separate
# from native camera-roll backup in Google Photos.
# Without this, the same photo may appear twice:
# once from iOS backup (original) and once from
# DMAF (WA-compressed). Set null to upload to root.

# ── Deduplication ──────────────────────────────────────────────────────────
# DMAF uses two-layer dedup to avoid processing the same content twice:
# 1. Path-based: Firestore doc per GCS path (fast, O(1) lookup)
# 2. Content-based: SHA-256 of file bytes — catches the same photo forwarded
# across multiple WhatsApp groups (same compression = same hash)
Comment on lines +197 to +200
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment block describes two-layer deduplication with content-based SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication via Firestore document lookups. The claim "catches the same photo forwarded across multiple WhatsApp groups" is not implemented.

Copilot uses AI. Check for mistakes.
dedup:
backend: firestore # firestore (cloud) | sqlite (local dev)
firestore_project: dmaf-production
Expand Down