docs: document two-layer dedup and google_photos_album_name recommendation#19
docs: document two-layer dedup and google_photos_album_name recommendation#19
Conversation
- README: update feature bullet, How It Works step 6, mermaid diagram, and config example to reflect two-layer dedup and album recommendation - AGENTS.md: update architecture diagram, dev notes (seen_by_sha256 pattern, WA EXIF stripping behaviour), and design decisions table - deploy/setup-secrets.md: expand album config comment explaining why it is recommended (native backup vs DMAF double-upload); add dedup layer comments
There was a problem hiding this comment.
Pull request overview
This documentation PR claims to document a "two-layer deduplication" feature (path-based + content SHA-256) that was supposedly added in PR #17 and explains the rationale for recommending google_photos_album_name from PR #18. However, the actual codebase does not contain the content-based deduplication functionality described throughout the documentation.
Changes:
- Documents a two-layer deduplication system with
seen_by_sha256method that doesn't exist in the codebase - Updates configuration examples to recommend
google_photos_album_name: "Family Faces" - Adds architectural notes explaining content SHA-256 deduplication that isn't implemented
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| README.md | Updates feature card, How It Works section, and Mermaid diagram to describe two-layer deduplication; changes album example to "Family Faces" |
| AGENTS.md | Updates architecture diagram, adds dev notes section 1b about two-layer dedup, adds design decision table rows |
| deploy/setup-secrets.md | Expands album config comments with rationale; adds deduplication section explaining two layers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match | ||
| 5. **Upload** — Matched photos and full video clips are uploaded to Google Photos | ||
| 6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice | ||
| 6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content |
There was a problem hiding this comment.
This "How It Works" step describes a two-layer deduplication system with content SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication. Face recognition is not "skipped entirely for known content" based on SHA-256 - it's only skipped for previously-seen paths.
| # DMAF uses two-layer dedup to avoid processing the same content twice: | ||
| # 1. Path-based: Firestore doc per GCS path (fast, O(1) lookup) | ||
| # 2. Content-based: SHA-256 of file bytes — catches the same photo forwarded | ||
| # across multiple WhatsApp groups (same compression = same hash) |
There was a problem hiding this comment.
This comment block describes two-layer deduplication with content-based SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication via Firestore document lookups. The claim "catches the same photo forwarded across multiple WhatsApp groups" is not implemented.
| **1b. Two-layer dedup: path first, then content** | ||
| `seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)` | ||
| catches the same photo arriving via two different GCS paths (e.g. forwarded across groups). | ||
| Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content | ||
| check happens in `_process_image_file` and `_process_video_file` before face recognition runs. | ||
| Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the | ||
| same compressed bytes are served to all recipients. |
There was a problem hiding this comment.
This section documents a seen_by_sha256 method and content-based deduplication that doesn't exist in the codebase. I verified that:
- No
seen_by_sha256method exists in eitherDatabaseorFirestoreDatabaseclasses - The code only performs path-based deduplication via
seen(path) - SHA-256 hashes are computed and stored but never queried for deduplication
- No tests for content deduplication exist (TestContentDedup class not found)
The actual implementation only has single-layer (path-based) deduplication, not two-layer deduplication. Either this documentation describes a feature that wasn't actually implemented in PR #17, or PR #17 was never merged.
| |----------|-----------| | ||
| | GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only | | ||
| | Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run | | ||
| | Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) | |
There was a problem hiding this comment.
This design decision documents a "two-layer dedup" feature that doesn't exist in the codebase. The actual implementation only has path-based deduplication via the seen(path) method. There is no content SHA-256 deduplication check - the SHA-256 hash is computed and stored but never queried to prevent duplicate processing. The claim "same WA compression = same bytes" may be true, but the feature to leverage this doesn't exist.
| Cloud Run Job: dmaf-scan ← Docker image from Cloud Build | ||
| │ scans each file, face recognition against known_people/ | ||
| │ dedup via Firestore (never re-processes a file) | ||
| │ two-layer dedup via Firestore (path + content SHA-256) |
There was a problem hiding this comment.
This architecture diagram claims "two-layer dedup via Firestore (path + content SHA-256)" but the actual implementation only performs path-based deduplication. The content SHA-256 is stored but never used as a deduplication check.
| ### ⚡ Efficient & Token-Free | ||
| - **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls | ||
| - **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore | ||
| - **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts |
There was a problem hiding this comment.
This feature card claims "Two-layer deduplication: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup" but the actual implementation only has path-based deduplication. Content SHA-256 hashes are computed and stored but never queried to prevent duplicate processing.
| E --> G[🗄️ Firestore Dedup] | ||
| F --> G | ||
| G -->|SHA256| H[🚫 Never Reprocess] | ||
| G -->|path + content SHA256| H[🚫 Never Reprocess] |
There was a problem hiding this comment.
This Mermaid diagram node label claims "path + content SHA256" deduplication, but the actual implementation only performs path-based deduplication. The content SHA-256 is not used as a deduplication mechanism.
Follows PR #17 (SHA-256 content dedup) and PR #18 (album config example).
Changes
README.md
google_photos_album_name: "Family Faces"with inline noteAGENTS.md
seen_by_sha256, where it's called, and why WA same-compression = same-hash holdsdeploy/setup-secrets.md