fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17
fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17
Conversation
Same photo sent to multiple WhatsApp groups, or synced from both camera roll and a group, arrives at different GCS paths but with identical bytes. The previous path-only dedup missed these cases. Changes: - Add seen_by_sha256(hash) to Database (SQLite) and FirestoreDatabase - SQLite: SELECT 1 FROM files WHERE sha256=? LIMIT 1 - Firestore: single-field where query on the 'sha256' field (auto-indexed) - In _process_image_file: check seen_by_sha256 after computing hash, before running face recognition; record path as seen and return early - In _process_video_file: same check, moved sha256_of_file call before find_face_in_video to avoid expensive frame extraction on duplicates - Add 3 tests in TestContentDedup covering image skip, video skip, and unique content still processed normally - Add seen_by_sha256.return_value = False to all existing mock db_conn setups that were affected by the new call
There was a problem hiding this comment.
Pull request overview
This PR implements content-based deduplication using SHA-256 hashes to prevent duplicate uploads when the same photo arrives via multiple GCS sync paths (e.g., camera roll and WhatsApp). The solution adds a seen_by_sha256() method to both database backends and checks for duplicate content before expensive face recognition processing.
Changes:
- Added
seen_by_sha256()method toDatabase(SQLite) andFirestoreDatabaseclasses for content hash lookups - Modified
_process_image_fileand_process_video_fileto check content hash before face recognition - Moved SHA-256 computation in
_process_video_fileto occur before face recognition to enable early deduplication - Updated 15 existing test mocks to include
seen_by_sha256behavior and added 3 new integration tests
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/dmaf/database.py | Added seen_by_sha256() query methods to both Database and FirestoreDatabase classes for content-based deduplication |
| src/dmaf/watcher.py | Implemented content hash checks before face recognition in batch processing functions; moved SHA-256 computation earlier for videos |
| tests/test_watcher.py | Updated all test mocks with seen_by_sha256 behavior and added 3 new tests for duplicate/unique content scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert matched is False | ||
| assert had_error is False | ||
| process_fn.assert_not_called() | ||
| handler.on_match_video.assert_not_called() |
There was a problem hiding this comment.
The test should also verify that add_file_with_score was called to record the duplicate path. This is crucial because the implementation at line 270 in watcher.py calls add_file_with_score to record the path as seen, preventing redundant checks on subsequent scans. Without this assertion, the test doesn't fully validate the deduplication behavior.
| handler.on_match_video.assert_not_called() | |
| handler.on_match_video.assert_not_called() | |
| db_conn.add_file_with_score.assert_called_once() |
| def seen_by_sha256(self, sha256: str) -> bool: | ||
| """Check if a file with this content hash has already been processed. | ||
|
|
||
| Used for content-based deduplication: the same photo shared via multiple | ||
| WhatsApp groups or synced from both camera roll and a group will have | ||
| different GCS paths but identical bytes, and thus the same SHA-256. | ||
| """ | ||
| conn = self._get_conn() | ||
| cur = conn.execute("SELECT 1 FROM files WHERE sha256=? LIMIT 1", (sha256,)) | ||
| return cur.fetchone() is not None | ||
|
|
There was a problem hiding this comment.
The new seen_by_sha256 method in both Database classes lacks test coverage. While existing tests mock this method (lines 175, 200, 257, etc.), there are no direct unit tests validating the actual database queries for SQLite or Firestore. Tests should verify: 1) returns False for unknown hashes, 2) returns True after a file with that hash is added, and 3) handles NULL sha256 values gracefully. This is important because the method is a critical part of the new deduplication logic.
| def seen_by_sha256(self, sha256: str) -> bool: | ||
| """Check if a file with this content hash has already been processed. | ||
|
|
||
| Used for content-based deduplication: the same photo shared via multiple | ||
| WhatsApp groups or synced from both camera roll and a group will have | ||
| different GCS paths but identical bytes, and thus the same SHA-256. | ||
|
|
||
| Note: requires the ``sha256`` field to be queryable in Firestore (it is | ||
| indexed automatically as a single-field index on all collections). | ||
| """ | ||
| docs = self.collection.where("sha256", "==", sha256).limit(1).stream() | ||
| return any(True for _ in docs) | ||
|
|
There was a problem hiding this comment.
The new seen_by_sha256 method in FirestoreDatabase lacks test coverage. While the SQLite Database class has comprehensive tests for similar methods (e.g., test_seen_returns_false_for_new_file, test_seen_returns_true_after_add), there are no tests for FirestoreDatabase methods. Tests should validate the Firestore query behavior and confirm that the single-field index on the sha256 field works as expected.
Problem
When Yonatan takes a photo and it gets synced to GCS via two different paths (e.g. camera roll + WhatsApp group, or the same photo shared to multiple groups), the pipeline was uploading it to Google Photos twice.
Root cause: Deduplication was path-based only. The Firestore document ID is
sha256(gcs_path_string), sogs://bucket/camera/IMG_001.jpgandgs://bucket/whatsapp/IMG_001.jpgwere treated as entirely different files — even though the bytes were identical.The content SHA-256 was already being computed and stored, but never queried for dedup.
Fix
Add content-based deduplication as a second check, after the path-based check:
seen_by_sha256(hash)added to bothDatabase(SQLite) andFirestoreDatabase_process_image_file: after computing SHA-256, check if any record with that content hash already exists → skip face recognition, record path as seen, return early_process_video_file: same check, moved SHA-256 computation beforefind_face_in_videoto avoid expensive frame extraction on duplicate videosDedup flow (after this fix)
Tests
TestContentDedup::test_duplicate_content_skipped_for_image— face recognition not called, path recorded as seenTestContentDedup::test_duplicate_video_content_skipped— frame extraction not calledTestContentDedup::test_unique_content_not_skipped— new content still processes normallydb_connmock setups updated withseen_by_sha256.return_value = False