fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads by yhyatt · Pull Request #17 · yhyatt/DMAF

yhyatt · 2026-02-26T16:32:24Z

Problem

When Yonatan takes a photo and it gets synced to GCS via two different paths (e.g. camera roll + WhatsApp group, or the same photo shared to multiple groups), the pipeline was uploading it to Google Photos twice.

Root cause: Deduplication was path-based only. The Firestore document ID is sha256(gcs_path_string), so gs://bucket/camera/IMG_001.jpg and gs://bucket/whatsapp/IMG_001.jpg were treated as entirely different files — even though the bytes were identical.

The content SHA-256 was already being computed and stored, but never queried for dedup.

Fix

Add content-based deduplication as a second check, after the path-based check:

seen_by_sha256(hash) added to both Database (SQLite) and FirestoreDatabase
In _process_image_file: after computing SHA-256, check if any record with that content hash already exists → skip face recognition, record path as seen, return early
In _process_video_file: same check, moved SHA-256 computation before find_face_in_video to avoid expensive frame extraction on duplicate videos

Dedup flow (after this fix)

1. seen(path)?       → skip immediately (cheap Firestore lookup by doc ID)
2. download file
3. compute sha256
4. seen_by_sha256()?  → skip if duplicate content (new), record path as seen
5. face recognition
6. upload to Google Photos

Tests

TestContentDedup::test_duplicate_content_skipped_for_image — face recognition not called, path recorded as seen
TestContentDedup::test_duplicate_video_content_skipped — frame extraction not called
TestContentDedup::test_unique_content_not_skipped — new content still processes normally
All 15 existing db_conn mock setups updated with seen_by_sha256.return_value = False

Same photo sent to multiple WhatsApp groups, or synced from both camera roll and a group, arrives at different GCS paths but with identical bytes. The previous path-only dedup missed these cases. Changes: - Add seen_by_sha256(hash) to Database (SQLite) and FirestoreDatabase - SQLite: SELECT 1 FROM files WHERE sha256=? LIMIT 1 - Firestore: single-field where query on the 'sha256' field (auto-indexed) - In _process_image_file: check seen_by_sha256 after computing hash, before running face recognition; record path as seen and return early - In _process_video_file: same check, moved sha256_of_file call before find_face_in_video to avoid expensive frame extraction on duplicates - Add 3 tests in TestContentDedup covering image skip, video skip, and unique content still processed normally - Add seen_by_sha256.return_value = False to all existing mock db_conn setups that were affected by the new call

Copilot

Pull request overview

This PR implements content-based deduplication using SHA-256 hashes to prevent duplicate uploads when the same photo arrives via multiple GCS sync paths (e.g., camera roll and WhatsApp). The solution adds a seen_by_sha256() method to both database backends and checks for duplicate content before expensive face recognition processing.

Changes:

Added seen_by_sha256() method to Database (SQLite) and FirestoreDatabase classes for content hash lookups
Modified _process_image_file and _process_video_file to check content hash before face recognition
Moved SHA-256 computation in _process_video_file to occur before face recognition to enable early deduplication
Updated 15 existing test mocks to include seen_by_sha256 behavior and added 3 new integration tests

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/dmaf/database.py	Added `seen_by_sha256()` query methods to both Database and FirestoreDatabase classes for content-based deduplication
src/dmaf/watcher.py	Implemented content hash checks before face recognition in batch processing functions; moved SHA-256 computation earlier for videos
tests/test_watcher.py	Updated all test mocks with `seen_by_sha256` behavior and added 3 new tests for duplicate/unique content scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T16:40:34Z

tests/test_watcher.py

+        assert matched is False
+        assert had_error is False
+        process_fn.assert_not_called()
+        handler.on_match_video.assert_not_called()


The test should also verify that add_file_with_score was called to record the duplicate path. This is crucial because the implementation at line 270 in watcher.py calls add_file_with_score to record the path as seen, preventing redundant checks on subsequent scans. Without this assertion, the test doesn't fully validate the deduplication behavior.

Suggested change

handler.on_match_video.assert_not_called()

handler.on_match_video.assert_not_called()

db_conn.add_file_with_score.assert_called_once()

Copilot · 2026-02-26T16:40:35Z

src/dmaf/database.py

+    def seen_by_sha256(self, sha256: str) -> bool:
+        """Check if a file with this content hash has already been processed.
+
+        Used for content-based deduplication: the same photo shared via multiple
+        WhatsApp groups or synced from both camera roll and a group will have
+        different GCS paths but identical bytes, and thus the same SHA-256.
+        """
+        conn = self._get_conn()
+        cur = conn.execute("SELECT 1 FROM files WHERE sha256=? LIMIT 1", (sha256,))
+        return cur.fetchone() is not None
+


The new seen_by_sha256 method in both Database classes lacks test coverage. While existing tests mock this method (lines 175, 200, 257, etc.), there are no direct unit tests validating the actual database queries for SQLite or Firestore. Tests should verify: 1) returns False for unknown hashes, 2) returns True after a file with that hash is added, and 3) handles NULL sha256 values gracefully. This is important because the method is a critical part of the new deduplication logic.

Copilot · 2026-02-26T16:40:35Z

src/dmaf/database.py

+    def seen_by_sha256(self, sha256: str) -> bool:
+        """Check if a file with this content hash has already been processed.
+
+        Used for content-based deduplication: the same photo shared via multiple
+        WhatsApp groups or synced from both camera roll and a group will have
+        different GCS paths but identical bytes, and thus the same SHA-256.
+
+        Note: requires the ``sha256`` field to be queryable in Firestore (it is
+        indexed automatically as a single-field index on all collections).
+        """
+        docs = self.collection.where("sha256", "==", sha256).limit(1).stream()
+        return any(True for _ in docs)
+


The new seen_by_sha256 method in FirestoreDatabase lacks test coverage. While the SQLite Database class has comprehensive tests for similar methods (e.g., test_seen_returns_false_for_new_file, test_seen_returns_true_after_add), there are no tests for FirestoreDatabase methods. Tests should validate the Firestore query behavior and confirm that the single-field index on the sha256 field works as expected.

Copilot AI review requested due to automatic review settings February 26, 2026 16:32

Copilot started reviewing on behalf of yhyatt February 26, 2026 16:32 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

yhyatt mentioned this pull request Feb 26, 2026

docs: document two-layer dedup and google_photos_album_name recommendation #19

Merged

yhyatt merged commit a48b141 into main Feb 26, 2026
10 checks passed

yhyatt deleted the fix/content-dedup-sha256 branch February 26, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17

fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17
yhyatt merged 1 commit intomainfrom
fix/content-dedup-sha256

yhyatt commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	handler.on_match_video.assert_not_called()
	handler.on_match_video.assert_not_called()
	db_conn.add_file_with_score.assert_called_once()

Conversation

yhyatt commented Feb 26, 2026

Problem

Fix

Dedup flow (after this fix)

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants