Skip to content

docs: document two-layer dedup and google_photos_album_name recommendation#19

Merged
yhyatt merged 2 commits intomainfrom
docs/dedup-and-album-updates
Feb 26, 2026
Merged

docs: document two-layer dedup and google_photos_album_name recommendation#19
yhyatt merged 2 commits intomainfrom
docs/dedup-and-album-updates

Conversation

@yhyatt
Copy link
Owner

@yhyatt yhyatt commented Feb 26, 2026

Follows PR #17 (SHA-256 content dedup) and PR #18 (album config example).

Changes

README.md

  • Feature card: "Two-layer deduplication" — path-based + content SHA-256
  • How It Works step 6: explains both dedup layers
  • Mermaid diagram: updated dedup node label
  • Config example: google_photos_album_name: "Family Faces" with inline note

AGENTS.md

  • Architecture diagram: "two-layer dedup via Firestore"
  • Dev notes: new section 1b explaining seen_by_sha256, where it's called, and why WA same-compression = same-hash holds
  • Design decisions table: two new rows for two-layer dedup and album recommendation

deploy/setup-secrets.md

  • Album config: expanded comment explaining the native-backup + DMAF double-upload problem
  • Dedup section: comment block explaining the two layers

- README: update feature bullet, How It Works step 6, mermaid diagram,
  and config example to reflect two-layer dedup and album recommendation
- AGENTS.md: update architecture diagram, dev notes (seen_by_sha256 pattern,
  WA EXIF stripping behaviour), and design decisions table
- deploy/setup-secrets.md: expand album config comment explaining why it is
  recommended (native backup vs DMAF double-upload); add dedup layer comments
Copilot AI review requested due to automatic review settings February 26, 2026 17:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This documentation PR claims to document a "two-layer deduplication" feature (path-based + content SHA-256) that was supposedly added in PR #17 and explains the rationale for recommending google_photos_album_name from PR #18. However, the actual codebase does not contain the content-based deduplication functionality described throughout the documentation.

Changes:

  • Documents a two-layer deduplication system with seen_by_sha256 method that doesn't exist in the codebase
  • Updates configuration examples to recommend google_photos_album_name: "Family Faces"
  • Adds architectural notes explaining content SHA-256 deduplication that isn't implemented

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
README.md Updates feature card, How It Works section, and Mermaid diagram to describe two-layer deduplication; changes album example to "Family Faces"
AGENTS.md Updates architecture diagram, adds dev notes section 1b about two-layer dedup, adds design decision table rows
deploy/setup-secrets.md Expands album config comments with rationale; adds deduplication section explaining two layers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

4. **Detect** — Each file is scanned: images once, videos sampled at 1–2fps with early exit on first match
5. **Upload** — Matched photos and full video clips are uploaded to Google Photos
6. **Deduplicate** — SHA256 hash stored in Firestore; the same file is never processed twice
6. **Deduplicate** — Two-layer check: (1) path-based Firestore lookup catches already-seen GCS paths; (2) content SHA-256 check catches the same photo arriving via multiple groups or sync paths — face recognition is skipped entirely for known content
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "How It Works" step describes a two-layer deduplication system with content SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication. Face recognition is not "skipped entirely for known content" based on SHA-256 - it's only skipped for previously-seen paths.

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +200
# DMAF uses two-layer dedup to avoid processing the same content twice:
# 1. Path-based: Firestore doc per GCS path (fast, O(1) lookup)
# 2. Content-based: SHA-256 of file bytes — catches the same photo forwarded
# across multiple WhatsApp groups (same compression = same hash)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment block describes two-layer deduplication with content-based SHA-256 checking that doesn't exist in the codebase. The actual implementation only performs path-based deduplication via Firestore document lookups. The claim "catches the same photo forwarded across multiple WhatsApp groups" is not implemented.

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +185
**1b. Two-layer dedup: path first, then content**
`seen(path)` is checked before downloading (cheap). After downloading, `seen_by_sha256(hash)`
catches the same photo arriving via two different GCS paths (e.g. forwarded across groups).
Both `Database` (SQLite) and `FirestoreDatabase` implement `seen_by_sha256`. The content
check happens in `_process_image_file` and `_process_video_file` before face recognition runs.
Note: WA strips all EXIF on iOS — content SHA-256 works because WA compresses once and the
same compressed bytes are served to all recipients.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section documents a seen_by_sha256 method and content-based deduplication that doesn't exist in the codebase. I verified that:

  1. No seen_by_sha256 method exists in either Database or FirestoreDatabase classes
  2. The code only performs path-based deduplication via seen(path)
  3. SHA-256 hashes are computed and stored but never queried for deduplication
  4. No tests for content deduplication exist (TestContentDedup class not found)

The actual implementation only has single-layer (path-based) deduplication, not two-layer deduplication. Either this documentation describes a feature that wasn't actually implemented in PR #17, or PR #17 was never merged.

Copilot uses AI. Check for mistakes.
|----------|-----------|
| GCS as first-class watch source | Pipeline is cloud-native; local dir support for dev only |
| Firestore for dedup (cloud) | Survives container restarts; no SQLite in Cloud Run |
| Two-layer dedup (path + SHA-256) | Path dedup is O(1) and catches restarts; content SHA-256 catches the same photo forwarded across multiple WA groups (same WA compression = same bytes) |
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design decision documents a "two-layer dedup" feature that doesn't exist in the codebase. The actual implementation only has path-based deduplication via the seen(path) method. There is no content SHA-256 deduplication check - the SHA-256 hash is computed and stored but never queried to prevent duplicate processing. The claim "same WA compression = same bytes" may be true, but the feature to leverage this doesn't exist.

Copilot uses AI. Check for mistakes.
Cloud Run Job: dmaf-scan ← Docker image from Cloud Build
│ scans each file, face recognition against known_people/
│ dedup via Firestore (never re-processes a file)
two-layer dedup via Firestore (path + content SHA-256)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This architecture diagram claims "two-layer dedup via Firestore (path + content SHA-256)" but the actual implementation only performs path-based deduplication. The content SHA-256 is stored but never used as a deduplication check.

Copilot uses AI. Check for mistakes.
### ⚡ Efficient & Token-Free
- **Zero LLM tokens after setup**: The entire pipeline — sync cron, face recognition, upload — runs without any AI calls
- **SHA256 deduplication**: Never process the same file twice — survives container restarts via Firestore
- **Two-layer deduplication**: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup — the same photo arriving via multiple WhatsApp groups is only processed and uploaded once; survives container restarts
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature card claims "Two-layer deduplication: Path-based dedup (fast Firestore lookup) + content SHA-256 dedup" but the actual implementation only has path-based deduplication. Content SHA-256 hashes are computed and stored but never queried to prevent duplicate processing.

Copilot uses AI. Check for mistakes.
E --> G[🗄️ Firestore Dedup]
F --> G
G -->|SHA256| H[🚫 Never Reprocess]
G -->|path + content SHA256| H[🚫 Never Reprocess]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Mermaid diagram node label claims "path + content SHA256" deduplication, but the actual implementation only performs path-based deduplication. The content SHA-256 is not used as a deduplication mechanism.

Copilot uses AI. Check for mistakes.
@yhyatt yhyatt merged commit f4b58eb into main Feb 26, 2026
6 checks passed
@yhyatt yhyatt deleted the docs/dedup-and-album-updates branch February 26, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants