Skip to content

fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17

Merged
yhyatt merged 1 commit intomainfrom
fix/content-dedup-sha256
Feb 26, 2026
Merged

fix: content-based SHA-256 dedup to prevent duplicate Google Photos uploads#17
yhyatt merged 1 commit intomainfrom
fix/content-dedup-sha256

Conversation

@yhyatt
Copy link
Owner

@yhyatt yhyatt commented Feb 26, 2026

Problem

When Yonatan takes a photo and it gets synced to GCS via two different paths (e.g. camera roll + WhatsApp group, or the same photo shared to multiple groups), the pipeline was uploading it to Google Photos twice.

Root cause: Deduplication was path-based only. The Firestore document ID is sha256(gcs_path_string), so gs://bucket/camera/IMG_001.jpg and gs://bucket/whatsapp/IMG_001.jpg were treated as entirely different files — even though the bytes were identical.

The content SHA-256 was already being computed and stored, but never queried for dedup.

Fix

Add content-based deduplication as a second check, after the path-based check:

  1. seen_by_sha256(hash) added to both Database (SQLite) and FirestoreDatabase
  2. In _process_image_file: after computing SHA-256, check if any record with that content hash already exists → skip face recognition, record path as seen, return early
  3. In _process_video_file: same check, moved SHA-256 computation before find_face_in_video to avoid expensive frame extraction on duplicate videos

Dedup flow (after this fix)

1. seen(path)?       → skip immediately (cheap Firestore lookup by doc ID)
2. download file
3. compute sha256
4. seen_by_sha256()?  → skip if duplicate content (new), record path as seen
5. face recognition
6. upload to Google Photos

Tests

  • TestContentDedup::test_duplicate_content_skipped_for_image — face recognition not called, path recorded as seen
  • TestContentDedup::test_duplicate_video_content_skipped — frame extraction not called
  • TestContentDedup::test_unique_content_not_skipped — new content still processes normally
  • All 15 existing db_conn mock setups updated with seen_by_sha256.return_value = False

Same photo sent to multiple WhatsApp groups, or synced from both camera
roll and a group, arrives at different GCS paths but with identical bytes.
The previous path-only dedup missed these cases.

Changes:
- Add seen_by_sha256(hash) to Database (SQLite) and FirestoreDatabase
- SQLite: SELECT 1 FROM files WHERE sha256=? LIMIT 1
- Firestore: single-field where query on the 'sha256' field (auto-indexed)
- In _process_image_file: check seen_by_sha256 after computing hash,
  before running face recognition; record path as seen and return early
- In _process_video_file: same check, moved sha256_of_file call before
  find_face_in_video to avoid expensive frame extraction on duplicates
- Add 3 tests in TestContentDedup covering image skip, video skip, and
  unique content still processed normally
- Add seen_by_sha256.return_value = False to all existing mock db_conn
  setups that were affected by the new call
Copilot AI review requested due to automatic review settings February 26, 2026 16:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements content-based deduplication using SHA-256 hashes to prevent duplicate uploads when the same photo arrives via multiple GCS sync paths (e.g., camera roll and WhatsApp). The solution adds a seen_by_sha256() method to both database backends and checks for duplicate content before expensive face recognition processing.

Changes:

  • Added seen_by_sha256() method to Database (SQLite) and FirestoreDatabase classes for content hash lookups
  • Modified _process_image_file and _process_video_file to check content hash before face recognition
  • Moved SHA-256 computation in _process_video_file to occur before face recognition to enable early deduplication
  • Updated 15 existing test mocks to include seen_by_sha256 behavior and added 3 new integration tests

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/dmaf/database.py Added seen_by_sha256() query methods to both Database and FirestoreDatabase classes for content-based deduplication
src/dmaf/watcher.py Implemented content hash checks before face recognition in batch processing functions; moved SHA-256 computation earlier for videos
tests/test_watcher.py Updated all test mocks with seen_by_sha256 behavior and added 3 new tests for duplicate/unique content scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

assert matched is False
assert had_error is False
process_fn.assert_not_called()
handler.on_match_video.assert_not_called()
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should also verify that add_file_with_score was called to record the duplicate path. This is crucial because the implementation at line 270 in watcher.py calls add_file_with_score to record the path as seen, preventing redundant checks on subsequent scans. Without this assertion, the test doesn't fully validate the deduplication behavior.

Suggested change
handler.on_match_video.assert_not_called()
handler.on_match_video.assert_not_called()
db_conn.add_file_with_score.assert_called_once()

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +147
def seen_by_sha256(self, sha256: str) -> bool:
"""Check if a file with this content hash has already been processed.

Used for content-based deduplication: the same photo shared via multiple
WhatsApp groups or synced from both camera roll and a group will have
different GCS paths but identical bytes, and thus the same SHA-256.
"""
conn = self._get_conn()
cur = conn.execute("SELECT 1 FROM files WHERE sha256=? LIMIT 1", (sha256,))
return cur.fetchone() is not None

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new seen_by_sha256 method in both Database classes lacks test coverage. While existing tests mock this method (lines 175, 200, 257, etc.), there are no direct unit tests validating the actual database queries for SQLite or Firestore. Tests should verify: 1) returns False for unknown hashes, 2) returns True after a file with that hash is added, and 3) handles NULL sha256 values gracefully. This is important because the method is a critical part of the new deduplication logic.

Copilot uses AI. Check for mistakes.
Comment on lines +586 to +598
def seen_by_sha256(self, sha256: str) -> bool:
"""Check if a file with this content hash has already been processed.

Used for content-based deduplication: the same photo shared via multiple
WhatsApp groups or synced from both camera roll and a group will have
different GCS paths but identical bytes, and thus the same SHA-256.

Note: requires the ``sha256`` field to be queryable in Firestore (it is
indexed automatically as a single-field index on all collections).
"""
docs = self.collection.where("sha256", "==", sha256).limit(1).stream()
return any(True for _ in docs)

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new seen_by_sha256 method in FirestoreDatabase lacks test coverage. While the SQLite Database class has comprehensive tests for similar methods (e.g., test_seen_returns_false_for_new_file, test_seen_returns_true_after_add), there are no tests for FirestoreDatabase methods. Tests should validate the Firestore query behavior and confirm that the single-field index on the sha256 field works as expected.

Copilot uses AI. Check for mistakes.
@yhyatt yhyatt merged commit a48b141 into main Feb 26, 2026
10 checks passed
@yhyatt yhyatt deleted the fix/content-dedup-sha256 branch February 26, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants