docs: document two-layer dedup and google_photos_album_name recommendation by yhyatt · Pull Request #19 · yhyatt/DMAF

yhyatt · 2026-02-26T17:00:47Z

Follows PR #17 (SHA-256 content dedup) and PR #18 (album config example).

Changes

README.md

Feature card: "Two-layer deduplication" — path-based + content SHA-256
How It Works step 6: explains both dedup layers
Mermaid diagram: updated dedup node label
Config example: google_photos_album_name: "Family Faces" with inline note

AGENTS.md

Architecture diagram: "two-layer dedup via Firestore"
Dev notes: new section 1b explaining seen_by_sha256, where it's called, and why WA same-compression = same-hash holds
Design decisions table: two new rows for two-layer dedup and album recommendation

deploy/setup-secrets.md

Album config: expanded comment explaining the native-backup + DMAF double-upload problem
Dedup section: comment block explaining the two layers

- README: update feature bullet, How It Works step 6, mermaid diagram, and config example to reflect two-layer dedup and album recommendation - AGENTS.md: update architecture diagram, dev notes (seen_by_sha256 pattern, WA EXIF stripping behaviour), and design decisions table - deploy/setup-secrets.md: expand album config comment explaining why it is recommended (native backup vs DMAF double-upload); add dedup layer comments

Copilot

Pull request overview

This documentation PR claims to document a "two-layer deduplication" feature (path-based + content SHA-256) that was supposedly added in PR #17 and explains the rationale for recommending google_photos_album_name from PR #18. However, the actual codebase does not contain the content-based deduplication functionality described throughout the documentation.

Changes:

Documents a two-layer deduplication system with seen_by_sha256 method that doesn't exist in the codebase
Updates configuration examples to recommend google_photos_album_name: "Family Faces"
Adds architectural notes explaining content SHA-256 deduplication that isn't implemented

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
README.md	Updates feature card, How It Works section, and Mermaid diagram to describe two-layer deduplication; changes album example to "Family Faces"
AGENTS.md	Updates architecture diagram, adds dev notes section 1b about two-layer dedup, adds design decision table rows
deploy/setup-secrets.md	Expands album config comments with rationale; adds deduplication section explaining two layers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T17:05:01Z

README.md

 4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match
 5. **Upload** — Matched photos and full video clips are uploaded to Google Photos
-6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice
+6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content


This "How It Works" step describes a two-layer deduplication system with content SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication. Face recognition is not "skipped entirely for known content" based on SHA-256 - it's only skipped for previously-seen paths.

Copilot · 2026-02-26T17:05:02Z

deploy/setup-secrets.md

+# DMAF uses two-layer dedup to avoid processing the same content twice:
+#   1. Path-based: Firestore doc per GCS path (fast, O(1) lookup)
+#   2. Content-based: SHA-256 of file bytes — catches the same photo forwarded
+#      across multiple WhatsApp groups (same compression = same hash)


This comment block describes two-layer deduplication with content-based SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication via Firestore document lookups. The claim "catches the same photo forwarded across multiple WhatsApp groups" is not implemented.

Copilot · 2026-02-26T17:05:02Z

AGENTS.md

+**1b. Two-layer dedup: path first, then content**
+`seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)`
+catches the same photo arriving via two different GCS paths (e.g. forwarded across groups).
+Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content
+check happens in `_process_image_file` and `_process_video_file` before face recognition runs.
+Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the
+same compressed bytes are served to all recipients.


This section documents a seen_by_sha256 method and content-based deduplication that doesn't exist in the codebase. I verified that:

No seen_by_sha256 method exists in either Database or FirestoreDatabase classes

The code only performs path-based deduplication via seen(path)

SHA-256 hashes are computed and stored but never queried for deduplication

No tests for content deduplication exist (TestContentDedup class not found)

The actual implementation only has single-layer (path-based) deduplication, not two-layer deduplication. Either this documentation describes a feature that wasn't actually implemented in PR #17, or PR #17 was never merged.

Copilot · 2026-02-26T17:05:02Z

AGENTS.md

 |----------|-----------|
 | GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only |
 | Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run |
+| Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) |


This design decision documents a "two-layer dedup" feature that doesn't exist in the codebase. The actual implementation only has path-based deduplication via the seen(path) method. There is no content SHA-256 deduplication check - the SHA-256 hash is computed and stored but never queried to prevent duplicate processing. The claim "same WA compression = same bytes" may be true, but the feature to leverage this doesn't exist.

Copilot · 2026-02-26T17:05:03Z

AGENTS.md

 Cloud Run Job: dmaf-scan  ← Docker image from Cloud Build
      │  scans each file, face recognition against known_people/
-      │  dedup via Firestore (never re-processes a file)
+      │  two-layer dedup via Firestore (path + content SHA-256)


This architecture diagram claims "two-layer dedup via Firestore (path + content SHA-256)" but the actual implementation only performs path-based deduplication. The content SHA-256 is stored but never used as a deduplication check.

Copilot · 2026-02-26T17:05:03Z

README.md

 ### ⚡ Efficient & Token-Free
 - **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls
- **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore
+- **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts


This feature card claims "Two-layer deduplication: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup" but the actual implementation only has path-based deduplication. Content SHA-256 hashes are computed and stored but never queried to prevent duplicate processing.

Copilot · 2026-02-26T17:05:03Z

README.md

    E --> G[🗄️ Firestore Dedup]
    F --> G
-    G -->|SHA256| H[🚫 Never Reprocess]
+    G -->|path + content SHA256| H[🚫 Never Reprocess]


This Mermaid diagram node label claims "path + content SHA256" deduplication, but the actual implementation only performs path-based deduplication. The content SHA-256 is not used as a deduplication mechanism.

Copilot AI review requested due to automatic review settings February 26, 2026 17:00

Copilot started reviewing on behalf of yhyatt February 26, 2026 17:01 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Merge branch 'main' into docs/dedup-and-album-updates

491e794

yhyatt merged commit f4b58eb into main Feb 26, 2026
6 checks passed

yhyatt deleted the docs/dedup-and-album-updates branch February 26, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document two-layer dedup and google_photos_album_name recommendation#19

docs: document two-layer dedup and google_photos_album_name recommendation#19
yhyatt merged 2 commits intomainfrom
docs/dedup-and-album-updates

yhyatt commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yhyatt commented Feb 26, 2026

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants