feat: add native GCS watch directory support by yhyatt · Pull Request #5 · yhyatt/DMAF

yhyatt · 2026-02-25T12:12:13Z

Add GCSWatchSource in gcs_watcher.py for listing/downloading GCS blobs
Modify scan_and_process_once() to detect gs:// URIs and process them
Dedup key = gs://bucket/path (stable across runs, not temp file path)
Add 'gcs' optional dependency: pip install dmaf[gcs]
Temp files cleaned up after processing (finally block)
Clear ImportError if google-cloud-storage not installed
Local directory scanning unchanged
Add tests with mocked GCS client

📝 Description

Brief description of the changes in this PR.

🔗 Related Issue

Fixes #(issue number)

🔄 Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🧹 Code refactoring (no functional changes)
✅ Test update

✅ Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

📸 Screenshots (if applicable)

Add screenshots to help explain your changes.

📝 Additional Notes

Any additional information that reviewers should know.

- Add GCSWatchSource in gcs_watcher.py for listing/downloading GCS blobs - Modify scan_and_process_once() to detect gs:// URIs and process them - Dedup key = gs://bucket/path (stable across runs, not temp file path) - Add 'gcs' optional dependency: pip install dmaf[gcs] - Temp files cleaned up after processing (finally block) - Clear ImportError if google-cloud-storage not installed - Local directory scanning unchanged - Add tests with mocked GCS client

Copilot

Pull request overview

Adds support for scanning Google Cloud Storage (GCS) buckets/prefixes as “watch directories” during batch scans, so images can be listed/downloaded from gs://... sources and deduplicated using stable GCS paths rather than local temp file paths.

Changes:

Introduces dmaf.gcs_watcher to parse gs:// URIs, list image blobs, download blobs to temp files, and clean up temps.
Refactors scan_and_process_once() to detect gs:// inputs and process GCS blobs while using gs://bucket/object as the dedup key.
Adds tests for URI parsing/cleanup and an integration-style scan test using a mocked GCS client; updates extras and cloud config example.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`tests/test_gcs_watcher.py`	Adds unit/integration tests for GCS URI helpers and scan-once dedup behavior.
`src/dmaf/watcher.py`	Adds shared image-processing helper and extends scan-once to support `gs://` sources and stable dedup keys.
`src/dmaf/gcs_watcher.py`	New module for GCS listing/downloading and temp file cleanup.
`pyproject.toml`	Adds `gcs` optional dependency and includes it in `all`.
`config.cloud.example.yaml`	Documents use of `gs://` watch dirs and dedup behavior for cloud runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/dmaf/gcs_watcher.py

Copilot · 2026-02-25T12:17:21Z

src/dmaf/gcs_watcher.py

+    client = _get_storage_client()
+    bucket_name, prefix = parse_gcs_uri(uri)
+    bucket = client.bucket(bucket_name)
+
+    gcs_paths = []
+    for blob in bucket.list_blobs(prefix=prefix):
+        # Skip "directory" markers
+        if blob.name.endswith("/"):
+            continue
+        suffix = Path(blob.name).suffix.lower()
+        if suffix in IMAGE_EXTENSIONS:
+            gcs_paths.append(f"gs://{bucket_name}/{blob.name}")
+    return gcs_paths
+
+
+def download_gcs_blob(gcs_path: str) -> Path:
+    """
+    Download a GCS blob to a temporary file.
+
+    Args:
+        gcs_path: Full GCS path like 'gs://bucket/path/to/image.jpg'
+
+    Returns:
+        Path to the downloaded temporary file. Caller must clean up with cleanup_temp_file().
+    """
+    client = _get_storage_client()
+    bucket_name, blob_name = parse_gcs_uri(gcs_path)
+    # blob_name from parse_gcs_uri is the prefix, but for a full path it's the object key
+    bucket = client.bucket(bucket_name)
+    blob = bucket.blob(blob_name)
+


_get_storage_client() is called inside both list_gcs_images() and download_gcs_blob(), and scan_and_process_once() calls download_gcs_blob() per blob. This will create a new GCS client for every object download, which is expensive and can become a bottleneck. Consider caching the client (e.g., module-level singleton or @functools.lru_cache) and/or passing a client/bucket object through to avoid repeated initialization.

src/dmaf/gcs_watcher.py

src/dmaf/watcher.py

Copilot · 2026-02-25T12:17:22Z

src/dmaf/watcher.py

+        logger.info(f"Match {Path(dedup_key).name} -> {who}")
+        try:
+            handler.on_match(image_path, who)
+            if handler.cfg.delete_source_after_upload:
+                try:
+                    image_path.unlink()
+                    logger.info(f"Deleted source: {image_path.name}")
+                except Exception as e:
+                    logger.warning(f"Failed to delete {image_path.name}: {e}")
+        except Exception as e:
+            logger.error(f"Upload failed for {Path(dedup_key).name}: {e}")
+            had_error = True
+            if handler.alert_manager:
+                handler.alert_manager.record_error("upload", str(e), dedup_key)


For GCS processing, dedup_key is the gs:// path but on_match() is still called with image_path (the downloaded temp file). Existing on_match implementations (e.g., in main.py) call db_conn.mark_uploaded(str(p)), which will use the temp path rather than the gs:// dedup key; for the Firestore backend this can raise (document not found) and cause uploads to be treated as failures. Consider passing dedup_key into on_match (or moving mark_uploaded(dedup_key) into _process_image_file after a successful upload) so uploaded status is recorded against the same key that was added to the DB.

Copilot · 2026-02-25T12:17:22Z

src/dmaf/watcher.py

+    Supports both local directories and GCS URIs (gs://bucket/prefix/).
+    For GCS URIs, the dedup key is the full gs:// path (not the temp file path).
+
    Args:
-        dirs: List of directory paths to scan
+        dirs: List of directory paths or GCS URIs to scan


scan_and_process_once() now supports gs:// URIs, but run_watch() (continuous watcher mode) still treats every entry as a local filesystem path (Path(d).mkdir / watchdog schedule). If a user configures watch_dirs with a gs:// entry and runs without --scan-once, this will likely create invalid local directories or error. Consider rejecting/ignoring gs:// entries in run_watch() (or documenting that GCS is scan-once only).

tests/test_gcs_watcher.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2026-02-25T12:20:55Z

@yhyatt I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix PR #5 CI failures in GCS temp-file handling and watcher batch-mode tests

Copilot AI review requested due to automatic review settings February 25, 2026 12:12

Copilot started reviewing on behalf of yhyatt February 25, 2026 12:12 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Update tests/test_gcs_watcher.py

effdb30

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Feb 25, 2026

fix: close temp file handle before GCS blob download #6

Draft

yhyatt and others added 2 commits February 25, 2026 14:21

Update src/dmaf/gcs_watcher.py

5b6c68c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/dmaf/watcher.py

3d9fd96

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yhyatt merged commit bbaf1e4 into main Feb 25, 2026
0 of 6 checks passed

Copilot AI mentioned this pull request Feb 25, 2026

Fix PR #5 CI failures in GCS temp-file handling and watcher batch-mode tests #7

Merged

yhyatt added a commit that referenced this pull request Feb 25, 2026

Merge pull request #7 from yhyatt/copilot/fix-ci-merge-issues

3793c2c

Fix PR #5 CI failures in GCS temp-file handling and watcher batch-mode tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native GCS watch directory support#5

feat: add native GCS watch directory support#5
yhyatt merged 4 commits intomainfrom
feature/gcs-watch-source

yhyatt commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yhyatt commented Feb 25, 2026

📝 Description

🔗 Related Issue

🔄 Type of Change

✅ Checklist

📸 Screenshots (if applicable)

📝 Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants