Skip to content

Add azure support#749

Open
shrek wants to merge 25 commits intoNVIDIA:mainfrom
shrek:add-azure-support
Open

Add azure support#749
shrek wants to merge 25 commits intoNVIDIA:mainfrom
shrek:add-azure-support

Conversation

@shrek
Copy link
Collaborator

@shrek shrek commented Mar 13, 2026

Earth2Studio Pull Request

Description

This PR adds the following functions required for running inference on azure with integrations with azure blob for inference results, and planetary computer geocatalog for ingestion of inference results.

  • Object storage functionality is enhanced to support azure blob so that inference results can be uploaded to the azure-blob. For this, multi-storage-client is updated to a more recent version that supports azure default identity.

  • Geocatalog client is added which has python utilities to interface to the Planetary Computer geocatalog API.

  • Inference pipeline is updated to add config knobs to be able to perform:
    API -> inference -> upload results to azure-blob -> trigger ingestion into geocatalog

  • Two foundry inference workflows are added. For these 2 workflows, json metadata files are added for geocatalog APIs for ingestion.

Misc enhancements unrelated to azure

  • Range support is added for inference result file-download from server. This helps in scenarios for large-file download from the server itself.

  • Configuration to limit the EXPOSED_WORKFLOWS in the API. This allows a subset of available workflows to be exposed in the API

  • cpu workers were consuming gpu memory. Fix this by not exposing any gpus to them.

Tests

The above functionality is tested e-2-e on azure as an online-endpoint. This includes uploading inference results to azure blob, and geocatalog ingestion.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.
  • Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

@shrek shrek marked this pull request as ready for review March 16, 2026 19:03
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR adds Azure Blob Storage support for inference result uploads, a GeoCatalog (Planetary Computer) STAC ingestion client, two new foundry inference workflows (foundry_fcn3, foundry_fcn3_stormscope_goes), workflow exposure filtering, and RFC 9110-compliant range-request support for file downloads. The changes integrate a new geocatalog_ingestion pipeline stage between object-storage upload and metadata finalization.

The implementation is generally well-structured and has good test coverage. Several issues flagged in earlier review rounds (missing HTTP status checks in _create_element, polling error handling, end >= file_size RFC 9110 violation, ValueError from int() on malformed Range headers) appear to have been addressed. Key remaining concerns:

  • Geocatalog workers are unconditionally started and verified at server startup, even when AZURE_GEOCATALOG_URL is not configured. The startup script will hard-fail (exit 1) if geocatalog workers don't start, and check_admission_control always monitors the geocatalog queue depth — meaning a stalled geocatalog queue can block inference requests in non-Azure deployments.
  • Azure connection strings are written to the global process environment (os.environ["AZURE_CONNECTION_STRING"]), which exposes credentials to child processes; this may be unavoidable given MSC's env-var expansion model but should be documented.
  • test_planetary_computer.py GeoCatalog tests lack an importorskip guard for azure-identity, which will produce confusing ImportErrors instead of clean skips in environments without the serve extras.
  • A typo "preceed""precede" in foundry_fcn3_stormscope_goes.py.

Confidence Score: 2/5

  • Needs attention before merging — unconditional geocatalog worker requirement will break non-Azure server startups if workers fail, and several issues from earlier rounds remain open.
  • The core Azure integration logic is sound and well-tested, but the geocatalog worker startup is now a mandatory dependency for all deployments (the script exits with code 1 if no geocatalog workers are running), even when Azure geocatalog is not configured. Combined with the unaddressed wildcard SAS URL issue from previous rounds and the timezone-naive sentinel in validate_start_time/validate_start_times, the PR carries meaningful risk for both Azure and non-Azure deployments.
  • serve/server/scripts/start_api_server.sh (mandatory geocatalog worker check), earth2studio/serve/server/object_storage.py (wildcard SAS URL from prior thread, connection string in env), serve/server/example_workflows/foundry_fcn3_stormscope_goes.py (open issues from prior threads).

Important Files Changed

Filename Overview
earth2studio/data/planetary_computer.py Adds GeoCatalogClient for STAC ingestion into Planetary Computer GeoCatalog. Token refresh happens once per create_feature call; polling loop may run up to 5 minutes but tokens generally outlive that. Prior review threads addressed missing status checks and polling error handling — those issues appear resolved in this version.
earth2studio/serve/server/cpu_worker.py Adds process_geocatalog_ingestion pipeline stage and significant refactoring for Azure storage support. Key concern: geocatalog workers are unconditionally included in admission control checks regardless of whether AZURE_GEOCATALOG_URL is configured, which could cause unnecessary queue-depth-related request blocking in non-Azure deployments.
earth2studio/serve/server/object_storage.py Adds Azure Blob Storage support to MSCObjectStorage. The connection string (which may contain AccountKey) is written to the global process environment via os.environ. The wildcard SAS URL issue previously flagged in review threads remains unaddressed in this iteration.
earth2studio/serve/server/utils.py Adds RFC 9110-compliant parse_range_header and create_file_stream utilities for partial content delivery. Implementation looks correct; end >= file_size is now clamped rather than rejected per §14.1.2.
earth2studio/serve/server/workflow.py Adds is_workflow_exposed and updates list_workflows to support filtering. Warmup workflows are intentionally accessible via API endpoints but excluded from the public listing — design is clearly documented and tests cover both behaviors.
serve/server/scripts/start_api_server.sh Adds geocatalog worker startup and CUDA_VISIBLE_DEVICES="" isolation for all CPU workers. Geocatalog workers are always started (and verified to have started) regardless of whether AZURE_GEOCATALOG_URL is set, consuming a process slot unconditionally.
serve/server/example_workflows/foundry_fcn3_stormscope_goes.py New FCN3+StormScopeGOES workflow. Contains if not seeds check (flagged in previous thread) and timezone-naive sentinel in validate_start_times (flagged in previous thread). Also contains typo "preceed" → "precede" in error message.

Comments Outside Diff (2)

  1. serve/server/scripts/start_api_server.sh, line 208-215 (link)

    P1 Geocatalog workers always required, even without Azure

    The script unconditionally starts NUM_GEOCATALOG_WORKERS geocatalog workers and then hard-fails (exit 1) if none are found running. This means every deployment — including those that never set AZURE_GEOCATALOG_URL — must have geocatalog workers running or the server won't start.

    Similarly, check_admission_control() in main.py always monitors the geocatalog_ingestion queue depth. If the geocatalog queue fills up for any reason (e.g., stale jobs, worker restart lag), it will block new inference requests even in non-Azure deployments.

    Consider making both the worker startup and the admission-control check conditional on AZURE_GEOCATALOG_URL being configured:

    # Only start geocatalog workers when geocatalog is enabled
    if [ -n "$AZURE_GEOCATALOG_URL" ]; then
        echo "Starting $NUM_GEOCATALOG_WORKERS geocatalog ingestion workers..."
        GEOCATALOG_WORKER_PIDS=()
        for i in $(seq 1 $NUM_GEOCATALOG_WORKERS); do
            CUDA_VISIBLE_DEVICES="" rq worker -w rq.worker.SimpleWorker geocatalog_ingestion &
            GEOCATALOG_WORKER_PIDS+=($!)
            echo "Started geocatalog ingestion worker $i (PID: $!)"
        done
    fi
  2. earth2studio/serve/server/object_storage.py, line 1437-1442 (link)

    P2 Connection string written to global process environment

    os.environ["AZURE_CONNECTION_STRING"] = azure_connection_string persists the full connection string (which typically includes the storage account key) into the process environment. This value is visible to all child processes spawned after this call and is readable from /proc/self/environ on Linux.

    While MSC requires the env-var reference for the ${AZURE_CONNECTION_STRING} substitution in the profile config, it's worth checking whether MSC supports passing the value inline in the profile dict rather than via env-var expansion. If inline values are supported, that would avoid leaking credentials into the process environment. Otherwise, please add a comment explaining why the env var must be set here so future readers understand the trade-off.

Last reviewed commit: 056b714

storage_info["remote_path"] = (
f"azure://{container_name}/{remote_prefix}"
)
# Build HTTPS blob URL for primary netcdf file (for GeoCatalog ingestion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this only work for single NetCDF4 files or also Zarr archives?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good catch ! this works only for netcdf4, i am fixing it so, it works for zarr as well.

)


class GeoCatalogClient:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will let @NickGeneva comment whether this is the right place to put the client that starts the data ingestion into Microsoft Planetary Computer from Azure Blob Storage. It is not a data source, more like an IO utility so we may want to put it somewhere else.

"Install with the serve extra or pip install azure-identity."
) from e
self._DefaultAzureCredential = _DefaultAzureCredential
self._workflow_name = workflow_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create filenames with workflow name as prefix. keep workflow name consistent throughout. parameter mapping is awkward - fix that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants