Conversation
…o add-azure-support
Greptile SummaryThis PR adds Azure Blob Storage support for inference result uploads, a GeoCatalog (Planetary Computer) STAC ingestion client, two new foundry inference workflows ( The implementation is generally well-structured and has good test coverage. Several issues flagged in earlier review rounds (missing HTTP status checks in
|
| Filename | Overview |
|---|---|
| earth2studio/data/planetary_computer.py | Adds GeoCatalogClient for STAC ingestion into Planetary Computer GeoCatalog. Token refresh happens once per create_feature call; polling loop may run up to 5 minutes but tokens generally outlive that. Prior review threads addressed missing status checks and polling error handling — those issues appear resolved in this version. |
| earth2studio/serve/server/cpu_worker.py | Adds process_geocatalog_ingestion pipeline stage and significant refactoring for Azure storage support. Key concern: geocatalog workers are unconditionally included in admission control checks regardless of whether AZURE_GEOCATALOG_URL is configured, which could cause unnecessary queue-depth-related request blocking in non-Azure deployments. |
| earth2studio/serve/server/object_storage.py | Adds Azure Blob Storage support to MSCObjectStorage. The connection string (which may contain AccountKey) is written to the global process environment via os.environ. The wildcard SAS URL issue previously flagged in review threads remains unaddressed in this iteration. |
| earth2studio/serve/server/utils.py | Adds RFC 9110-compliant parse_range_header and create_file_stream utilities for partial content delivery. Implementation looks correct; end >= file_size is now clamped rather than rejected per §14.1.2. |
| earth2studio/serve/server/workflow.py | Adds is_workflow_exposed and updates list_workflows to support filtering. Warmup workflows are intentionally accessible via API endpoints but excluded from the public listing — design is clearly documented and tests cover both behaviors. |
| serve/server/scripts/start_api_server.sh | Adds geocatalog worker startup and CUDA_VISIBLE_DEVICES="" isolation for all CPU workers. Geocatalog workers are always started (and verified to have started) regardless of whether AZURE_GEOCATALOG_URL is set, consuming a process slot unconditionally. |
| serve/server/example_workflows/foundry_fcn3_stormscope_goes.py | New FCN3+StormScopeGOES workflow. Contains if not seeds check (flagged in previous thread) and timezone-naive sentinel in validate_start_times (flagged in previous thread). Also contains typo "preceed" → "precede" in error message. |
Comments Outside Diff (2)
-
serve/server/scripts/start_api_server.sh, line 208-215 (link)Geocatalog workers always required, even without Azure
The script unconditionally starts
NUM_GEOCATALOG_WORKERSgeocatalog workers and then hard-fails (exit 1) if none are found running. This means every deployment — including those that never setAZURE_GEOCATALOG_URL— must have geocatalog workers running or the server won't start.Similarly,
check_admission_control()inmain.pyalways monitors thegeocatalog_ingestionqueue depth. If the geocatalog queue fills up for any reason (e.g., stale jobs, worker restart lag), it will block new inference requests even in non-Azure deployments.Consider making both the worker startup and the admission-control check conditional on
AZURE_GEOCATALOG_URLbeing configured:# Only start geocatalog workers when geocatalog is enabled if [ -n "$AZURE_GEOCATALOG_URL" ]; then echo "Starting $NUM_GEOCATALOG_WORKERS geocatalog ingestion workers..." GEOCATALOG_WORKER_PIDS=() for i in $(seq 1 $NUM_GEOCATALOG_WORKERS); do CUDA_VISIBLE_DEVICES="" rq worker -w rq.worker.SimpleWorker geocatalog_ingestion & GEOCATALOG_WORKER_PIDS+=($!) echo "Started geocatalog ingestion worker $i (PID: $!)" done fi
-
earth2studio/serve/server/object_storage.py, line 1437-1442 (link)Connection string written to global process environment
os.environ["AZURE_CONNECTION_STRING"] = azure_connection_stringpersists the full connection string (which typically includes the storage account key) into the process environment. This value is visible to all child processes spawned after this call and is readable from/proc/self/environon Linux.While MSC requires the env-var reference for the
${AZURE_CONNECTION_STRING}substitution in the profile config, it's worth checking whether MSC supports passing the value inline in the profile dict rather than via env-var expansion. If inline values are supported, that would avoid leaking credentials into the process environment. Otherwise, please add a comment explaining why the env var must be set here so future readers understand the trade-off.
Last reviewed commit: 056b714
| storage_info["remote_path"] = ( | ||
| f"azure://{container_name}/{remote_prefix}" | ||
| ) | ||
| # Build HTTPS blob URL for primary netcdf file (for GeoCatalog ingestion) |
There was a problem hiding this comment.
Will this only work for single NetCDF4 files or also Zarr archives?
There was a problem hiding this comment.
thanks, good catch ! this works only for netcdf4, i am fixing it so, it works for zarr as well.
| ) | ||
|
|
||
|
|
||
| class GeoCatalogClient: |
There was a problem hiding this comment.
Will let @NickGeneva comment whether this is the right place to put the client that starts the data ingestion into Microsoft Planetary Computer from Azure Blob Storage. It is not a data source, more like an IO utility so we may want to put it somewhere else.
| "Install with the serve extra or pip install azure-identity." | ||
| ) from e | ||
| self._DefaultAzureCredential = _DefaultAzureCredential | ||
| self._workflow_name = workflow_name |
There was a problem hiding this comment.
create filenames with workflow name as prefix. keep workflow name consistent throughout. parameter mapping is awkward - fix that too.
Earth2Studio Pull Request
Description
This PR adds the following functions required for running inference on azure with integrations with azure blob for inference results, and planetary computer geocatalog for ingestion of inference results.
Object storage functionality is enhanced to support azure blob so that inference results can be uploaded to the azure-blob. For this, multi-storage-client is updated to a more recent version that supports azure default identity.
Geocatalog client is added which has python utilities to interface to the Planetary Computer geocatalog API.
Inference pipeline is updated to add config knobs to be able to perform:
API -> inference -> upload results to azure-blob -> trigger ingestion into geocatalog
Two foundry inference workflows are added. For these 2 workflows, json metadata files are added for geocatalog APIs for ingestion.
Misc enhancements unrelated to azure
Range support is added for inference result file-download from server. This helps in scenarios for large-file download from the server itself.
Configuration to limit the EXPOSED_WORKFLOWS in the API. This allows a subset of available workflows to be exposed in the API
cpu workers were consuming gpu memory. Fix this by not exposing any gpus to them.
Tests
The above functionality is tested e-2-e on azure as an online-endpoint. This includes uploading inference results to azure blob, and geocatalog ingestion.
Checklist
Dependencies