Hunt for Tropical Cyclones pt. 1 by mariusaurus · Pull Request #760 · NVIDIA/earth2studio

mariusaurus · 2026-03-18T15:48:04Z

Earth2Studio Pull Request

Description

Adds a new recipes/tc_tracking recipe that generates ensemble forecasts (using FCN3 or AIFS-ENS) and tracks tropical cyclones within the predictions by integrating TempestExtremes as a downstream diagnostic tool.
Implements an asynchronous CPU/GPU execution mode where TempestExtremes runs cyclone detection on CPU in parallel with GPU inference, resulting in virtually no computational overhead from the tracking process.
Uses in-memory file handling (/dev/shm) to avoid writing large atmospheric field data to disk, which would otherwise slow down inference significantly and become prohibitive at scale.
Includes a stability check mechanism for detecting and recovering from numerical instabilities in long-range FCN3 forecasts.
Includes a comprehensive README with setup instructions (container and uv), configuration reference, and an example workflow. Also serves as a reference for integrating other custom downstream analysis tools into Earth-2 Studio prediction pipelines.
Includes an end-to-end test (test/test_tc_hunt.sh) that runs a five-member ensemble forecast of Hurricane Helene with FCN3 and verifies that track files are produced correctly.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

…l but needs further testing with batching.

…num workers

… are too many right now

…t of scope

…r of additional layers

greptile-apps · 2026-03-18T15:53:20Z

Greptile Summary

This PR adds a self-contained recipes/tc_tracking recipe that integrates TempestExtremes into Earth2Studio ensemble forecasts for tropical cyclone tracking. The implementation is well-structured and covers the full pipeline: data fetching, GPU inference (FCN3 / AIFS-ENS), in-RAM NetCDF staging, asynchronous TE execution (GPU/CPU overlap via AsyncTempestExtremes.__call__ → track_cyclones_async), and output storage (Zarr / NetCDF4). A stability-check mechanism for numerical blow-ups and a comprehensive end-to-end test round out the contribution.

Key observations:

The async GPU/CPU overlap is correctly implemented — AsyncTempestExtremes.__call__ overrides the parent and dispatches to track_cyclones_async(), which submits TE work to a global ThreadPoolExecutor and returns a Future immediately, letting the GPU proceed to the next IC.
cleanup() and wait_for_completion() both raise on the first failed background task, silently abandoning subsequent tasks whose errors are never surfaced. Collecting all failures before raising (as is done inside _run_te_and_cleanup) would give more complete diagnostics.
pyproject.toml carries a placeholder description ("no, i won't") and references both earth2studio and torch-harmonics via unpinned git sources, making the recipe non-reproducible as upstream changes.
The Dockerfile clones TempestExtremes from HEAD without a pinned tag, creating the same reproducibility risk.

Confidence Score: 3/5

The core logic and async tracking implementation are sound, but packaging metadata issues and partial failure reporting in cleanup reduce confidence for a production-ready merge.
The functional code is well-designed with good synchronisation, proper use of semaphores, and correct GPU/CPU overlap. The main concerns are non-blocking: a placeholder package description, unpinned git/Docker dependencies that compromise reproducibility, and cleanup()/wait_for_completion() aborting on the first failure while silently discarding errors from subsequent tasks.
recipes/tc_tracking/pyproject.toml (placeholder description + unpinned git sources), recipes/tc_tracking/Dockerfile (unpinned TempestExtremes clone), and recipes/tc_tracking/src/tempest_extremes.py (cleanup/wait_for_completion error aggregation).

Important Files Changed

Filename	Overview
recipes/tc_tracking/src/tempest_extremes.py	Core implementation of TempestExtremes integration — provides synchronous `TempestExtremes` and asynchronous `AsyncTempestExtremes` classes; `AsyncTempestExtremes.__call__` correctly submits tracking to a background thread pool enabling GPU/CPU overlap, but `cleanup()`/`wait_for_completion()` abort on the first task failure and silently abandon remaining failing tasks.
recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py	Main inference loop orchestrating ensemble generation, stability checking, and cyclone tracking; logic is sound and correctly uses the async TempestExtremes API; previously flagged debug comments have been cleaned up.
recipes/tc_tracking/pyproject.toml	Package metadata and dependencies; contains a placeholder description ("no, i won't") and unpinned git sources for both `earth2studio` and `torch-harmonics`, which reduce build reproducibility.
recipes/tc_tracking/Dockerfile	Docker build environment that compiles TempestExtremes from source; clones TempestExtremes at HEAD without a pinned tag/commit, which makes image builds non-reproducible.
recipes/tc_tracking/tc_hunt.py	Entry-point script with Hydra configuration; still contains an informal `print("finished yaaayyyy")` celebration message (previously flagged).

_{Last reviewed commit: "TE workers"}

test/models/dx/test_tempest_extremes.py

recipes/tc_tracking/tc_hunt.py

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

recipes/tc_tracking/src/tempest_extremes.py

greptile-apps · 2026-03-18T15:53:31Z

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

+    for ic, mems, seed in ic_mems:
+        mini_batch_size = len(mems)
+
+        data_source = data_source_mngr.select_data_source(ic)
+
+        # if new IC, fetch data, create iterator
+        if ic != ic_prev:
+            if cfg.store_type == "netcdf":
+                store = initialise_netcdf_output(cfg, out_coords, ic, ic_mems)
+            x0, coords0 = fetch_data(
+                data_source,
+                time=[np.datetime64(ic)],
+                lead_time=model.input_coords()["lead_time"],
+                variable=model.input_coords()["variable"],
+                device=dist.device,
+            )
+            ic_prev = ic
+
+        coords = {"ensemble": np.array(mems)} | coords0.copy()
+        xx = x0.unsqueeze(0).repeat(mini_batch_size, *([1] * x0.ndim))
+
+        if stability_check:
+            stability_check.reset(deepcopy(coords))
+        # print(stability_check.input_coords)
+        # exit()
+
+        # set random state or apply perturbation
+        if ("model" not in cfg) or (cfg.model == "fcn3"):
+            model.set_rng(seed=seed)
+        elif (
+            cfg.model[:4] == "aifs"
+        ):  # no need for perturbation, but also cannot set internal noise state
+            pass
+        else:
+            sg = SphericalGaussian(noise_amplitude=0.0005)
+            xx, coords = sg(xx, coords)
+
+        iterator = model.create_iterator(xx, coords)
+
+        # roll out the model and record data as desired
+        for _, (xx, coords) in tqdm(
+            zip(range(cfg.n_steps + 1), iterator), total=cfg.n_steps + 1
+        ):
+            write_to_store(store, xx, coords, out_coords)
+            if cyclone_tracking:
+                cyclone_tracking.record_state(xx, coords)
+
+            if stability_check:
+                yy, coy = map_coords(xx, coords, stability_check.input_coords)
+                stab, _ = stability_check(yy, coy)
+                if not stab.all():
+                    ic_mems.append((ic, mems, seed + 1))
+                    print(
+                        f"CAUTION: one of members {mems} became unstable. will re-create with new seed."
+                    )
+                    break


Unbounded retry loop for unstable members

When a member is detected as unstable (line 260), it is re-appended to ic_mems with seed + 1. Because Python's for loop over a list processes newly-appended items, this creates an unbounded retry cycle — there is no guard on how many times any given (ic, mems) combination can be re-queued.

If a particular initial condition consistently produces unstable trajectories (e.g., a known degenerate edge case), the job will never terminate. A maximum-retry counter should be tracked per (ic, seed) pair, and members that exceed the limit should be skipped with a warning rather than being re-queued indefinitely.

in practice, such jobs will be killed by the system after exceeding their allocated time.
in a future version I want to update the scheduling anyway to something smarter as individual ensemble members might then not always take roughly the same time to execute, as they do now.

mariusaurus · 2026-03-18T16:00:32Z

@greptileai

recipes/tc_tracking/src/tempest_extremes.py

mariusaurus · 2026-03-19T13:14:52Z

@greptileai

recipes/tc_tracking/src/tempest_extremes.py

mariusaurus · 2026-03-19T14:09:34Z

@greptileai

greptile-apps · 2026-03-19T14:16:57Z

recipes/tc_tracking/src/tempest_extremes.py

+    def cleanup(self, timeout_per_task: int | None = None) -> None:
+        """Explicitly clean up and wait for all background tasks to complete.
+
+        This method should be called before the object is destroyed or the program exits
+        to ensure all cyclone tracking tasks complete successfully.
+
+        Parameters
+        ----------
+        timeout_per_task : int | None, optional
+            Timeout in seconds for each task. If None, uses self.timeout.
+
+        Raises
+        ------
+        ChildProcessError
+            If any background task failed
+        Exception
+            If any task failed with other exceptions
+        """
+        if self._cleanup_done:
+            return
+
+        if timeout_per_task is None:
+            timeout_per_task = self.timeout
+
+        try:
+            # Wait for all instance tasks to complete
+            if hasattr(self, "_instance_tasks") and hasattr(self, "_instance_lock"):
+                with self._instance_lock:
+                    tasks_to_wait = list(self._instance_tasks)
+
+                if tasks_to_wait:
+                    print(
+                        f"AsyncTempestExtremes: waiting for {len(tasks_to_wait)} background tasks to complete..."
+                    )
+
+                    for i, future in enumerate(tasks_to_wait):
+                        try:
+                            print(f"  Waiting for task {i+1}/{len(tasks_to_wait)}...")
+                            future.result(timeout=timeout_per_task)
+                            print(
+                                f"  Task {i+1}/{len(tasks_to_wait)} completed successfully"
+                            )
+                        except ChildProcessError as e:
+                            print(
+                                f"  Task {i+1}/{len(tasks_to_wait)} failed with ChildProcessError: {e}"
+                            )
+                            raise  # Re-raise to propagate the error
+                        except Exception as e:
+                            print(f"  Task {i+1}/{len(tasks_to_wait)} failed: {e}")
+                            raise  # Re-raise to propagate the error
+
+                    print(
+                        f"All {len(tasks_to_wait)} background tasks completed successfully"
+                    )
+
+            self._cleanup_done = True
+
+        except Exception as _:
+            self._cleanup_done = True  # Mark as done even on failure to avoid retry
+            raise
+


wait_for_completion and cleanup abort on first failure, silently abandoning later tasks

Both wait_for_completion() and cleanup() iterate over pending tasks and raise immediately on the first error. Any subsequent task failures are never collected — their exceptions are silently swallowed by the background threads and never surfaced to the caller. In a scenario where multiple members fail concurrently, only the first error reaches the user and the remaining failed tasks are abandoned.

Consider collecting all failures before re-raising, similar to the pattern used in _run_te_and_cleanup:

errors = [] for i, future in enumerate(tasks_to_wait): try: future.result(timeout=timeout_per_task) except Exception as e: print(f" Task {i+1}/{len(tasks_to_wait)} failed: {e}") errors.append(e) if errors: raise ChildProcessError( f"{len(errors)} background task(s) failed: {errors}" )

The same change should be applied to wait_for_completion().

mariusaurus and others added 30 commits December 1, 2025 07:52

resolved conflict

d3a2245

update changelog

0189d52

move seed initialization and fix dxwrapper tests

f11b18b

tempest extremes diagnostic model

d063760

error message

a4d2544

testing if TE is available and works

c1cdca0

started working on support for batch sizes >1, currently works for bs 1

016f16b

halfway to larger batch support

68e33b5

enabling TE for batch sizes of >1. async version seems to work as wel…

7bd60e1

…l but needs further testing with batching.

option to pass file names to TE connector

3b0c00e

array equal test

1e9bbe8

first stable try

d6be6dd

support for per-member parallel execution and lets user controll max …

1e9b275

…num workers

precommit hooks

b5f5c18

vibe-coded some tests, need to be hand-tested and selected

af8bc71

vibe-coded some tests, need to be hand-tested and selected

a9fd2bc

passing all pre-commit tests, still need to sub-select tests as there…

526e6bf

… are too many right now

subselected tests

c3258d9

install doc

3fd145d

throwing an error in case cleanup is not called before object goes ou…

c26f453

…t of scope

custom depenmdency failure message for TE

d2a8e4a

moved tensor tiling and concatenation to utils

0ab6d67

enable setting fcn3 random seed

8ca3fae

add proper noise handling for fcn3

e93932e

fix linting and test issues

bc9e3ac

update lockfile

2685f90

move seed initialization and fix dxwrapper tests

e3a4e3d

tc tracking pipeline

1dec990

update

02945f1

updated uv.lock

f89efe3

mariusaurus added 13 commits March 11, 2026 07:49

doc strings and type hints for generate_ensembles.py

8032f20

doc strings and type hints for tempest_extremes.py

5edac8b

doc strings and type hints for src/utils.py

41c3248

doc strings and type hints for data

9c2104d

orography from HF

6f8c3d5

git lfs in dockerfile

3167014

final touches

16d18fe

final touches

aa360b2

Merge branch 'main' into mkoch/tc_hunt_1

6d58f4f

reverted fcn3 changes

6bcbb5a

updated dockerfile to latest physics nemo container and reduced numbe…

3a9a5cf

…r of additional layers

first round of pre-commit hooks

342caf5

licenses

731231b

mariusaurus requested a review from NickGeneva March 18, 2026 15:48

mariusaurus self-assigned this Mar 18, 2026

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

renamed some files and removed tempest_extremes testing

e0b01b2

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

recipes/tc_tracking/src/tempest_extremes.py Show resolved Hide resolved

recipes/tc_tracking/src/tempest_extremes.py Outdated Show resolved Hide resolved

mariusaurus changed the title ~~Hunt for Tropical Cyclones~~ Hunt for Tropical Cyclones pt. 1 Mar 19, 2026

mariusaurus added 4 commits March 19, 2026 02:22

TE list files unique by time stamp

cecdcb7

Te timeouts

a2c7510

improved catching of TE failures

82fa33c

satisfying the greptile

98b80c8

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

recipes/tc_tracking/src/tempest_extremes.py Outdated Show resolved Hide resolved

recipes/tc_tracking/src/tempest_extremes.py Show resolved Hide resolved

TE workers

a897748

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

Conversation

mariusaurus commented Mar 18, 2026

Earth2Studio Pull Request

Description

Checklist

Dependencies

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mariusaurus Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mariusaurus commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

mariusaurus commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

mariusaurus commented Mar 19, 2026

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 18, 2026 •

edited

Loading