feat: add async generator migration with symmetric bridging and statefulness by andreatgretel · Pull Request #378 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-03-09T13:28:18Z

Summary

PR 2 of 4 in the async engine migration (#346).

Full plan: plans/346/async-generators-and-task-queue.md

PR stack:

PR 1 - ExecutionGraph, CompletionTracker, and Task model (feat: add ExecutionGraph, CompletionTracker, and Task model for async scheduler #356, merged)
PR 2 (this) - Async generator migration with symmetric bridging
PR 3 - AsyncTaskScheduler and RowGroupBufferManager
PR 4 - Wire async scheduler into ColumnWiseDatasetBuilder

This PR makes all column generators async-capable without breaking existing sync usage. Generators can implement either generate() or agenerate() (or both), and the base class bridges between them automatically.

Changes

Added

Symmetric generate/agenerate bridging in ColumnGenerator - implement one, get the other for free
_run_coroutine_sync helper for running coroutines from sync contexts (including notebooks with running event loops)
_is_overridden helper for symmetric override detection in both generate() and agenerate()
is_order_dependent property on ColumnGenerator (default False); SeedDatasetColumnGenerator declares True
Defensive data.copy() in base agenerate to prevent caller mutation across threads
FromScratchColumnGenerator.agenerate_from_scratch async wrapper
Native async paths for ImageCellGenerator and EmbeddingCellGenerator with extracted _prepare_*_inputs methods
CustomColumnGenerator.agenerate with full validation parity (required_columns, output shape, error wrapping)
_postprocess_result extracted for shared sync/async output validation
22 tests covering bridging, order-dependence, async custom generators, image/embedding async paths, and error-path parity

Changed

Updated plan document with revised scope and status
Sync bridge timeout re-raises as builtin TimeoutError for Python 3.10 compat

Fixed

Sync bridge timeout now releases the caller immediately via shutdown(wait=False) instead of blocking on pool cleanup

Attention Areas

Reviewers, please pay special attention to:

base.py - the symmetric bridging logic, _is_overridden helper, and _run_coroutine_sync are the foundation for the rest of the stack
custom.py - async branch now runs the same validation as sync; _postprocess_result is shared between both paths

Closes #381

Description updated with AI

…fulness - Symmetric generate/agenerate bridging in base ColumnGenerator - is_stateful property; SeedDatasetColumnGenerator declares True - Async wrappers for FromScratchColumnGenerator and ColumnGeneratorFullColumn - Native async paths for ImageCellGenerator and EmbeddingCellGenerator - CustomColumnGenerator.agenerate with full validation parity - Extract _postprocess_result for shared sync/async output validation

greptile-apps · 2026-03-09T13:36:32Z

Greptile Summary

This PR (2 of 4 in the async engine migration) makes all column generators async-capable without breaking existing sync usage by adding symmetric generate/agenerate bridging to ColumnGenerator, a _run_coroutine_sync helper for running coroutines from sync contexts (including notebooks with live event loops), and native async paths for ImageCellGenerator and EmbeddingCellGenerator. Previous review issues around pool lifecycle, TimeoutError aliasing, duplicate agenerate overrides, and duplicated validation logic have all been addressed.

base.py: _run_coroutine_sync correctly uses try/finally with a timed_out flag to handle success, timeout, and domain-exception exit paths; _is_overridden correctly checks against ColumnGenerator via identity comparison so both bridging directions work reliably across the class hierarchy.
custom.py: agenerate branches on strategy and coroutine-ness; _ainvoke_generator_function correctly awaits the sync wrapper's return value (a coroutine object) for async user functions; _postprocess_result is shared between both paths for validation parity.
image.py / embedding.py: Shared _prepare_*_inputs helpers eliminate the previously duplicated validation logic; disk I/O in ImageCellGenerator.agenerate is correctly offloaded to a thread.
seed_dataset.py: is_order_dependent declared True to signal row-group ordering constraints to the upcoming scheduler (PR 3).
Tests: 19 tests cover bridging symmetry, error-path parity, and async variants of each generator type; stub_resource_provider is correctly resolved from tests/engine/conftest.py.

Confidence Score: 4/5

This PR is safe to merge; all critical pool-lifecycle and exception-handling issues from previous rounds are addressed and the bridging logic is sound.
The implementation is well-structured, all prior round issues (pool leaks, TimeoutError aliasing, duplicate overrides, duplicated validation) have been resolved, and 19 tests provide solid coverage. The only remaining items are a style note on shallow-copy semantics for dict inputs in agenerate and a missing test for the non-timeout exception propagation path through _run_coroutine_sync — neither of which blocks merging.
No files require special attention; base.py carries the most complex new logic but the bridging and pool-lifecycle handling are correct.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py	Adds `_run_coroutine_sync` helper and symmetric `generate`/`agenerate` bridging; try/finally pool lifecycle and builtin TimeoutError wrapping look correct after previous fixes.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py	Adds `agenerate` with full-column thread delegation and native async cell-by-cell path; `_postprocess_result` and `_ainvoke_generator_function` are clean shared helpers; async full-column path delegates to sync `generate` (pre-existing limitation acknowledged by authors).
packages/data-designer-engine/tests/engine/column_generators/generators/test_async_generators.py	19 tests covering bridging symmetry, statefulness, async custom generators, and error-path parity; fixture dependencies resolved via parent conftest.py; `object.__new__` pattern correctly avoids `.fget` brittleness.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ColumnGenerator
    participant _run_coroutine_sync
    participant ThreadPoolExecutor
    participant asyncio

    note over Caller,asyncio: Sync caller → async-only generator
    Caller->>ColumnGenerator: generate(data)
    ColumnGenerator->>ColumnGenerator: _is_overridden("agenerate") → True
    ColumnGenerator->>_run_coroutine_sync: _run_coroutine_sync(self.agenerate(data))
    _run_coroutine_sync->>asyncio: get_running_loop() raises RuntimeError (no loop)
    _run_coroutine_sync->>asyncio: asyncio.run(coro)
    asyncio-->>_run_coroutine_sync: result
    _run_coroutine_sync-->>ColumnGenerator: result
    ColumnGenerator-->>Caller: result

    note over Caller,asyncio: Async caller (e.g. notebook) → sync-only generator
    Caller->>ColumnGenerator: await agenerate(data)
    ColumnGenerator->>ColumnGenerator: _is_overridden("generate") → True
    ColumnGenerator->>asyncio: asyncio.to_thread(self.generate, data.copy())
    asyncio->>ThreadPoolExecutor: run generate(data_copy) in thread
    ThreadPoolExecutor-->>asyncio: result
    asyncio-->>ColumnGenerator: result
    ColumnGenerator-->>Caller: result

    note over Caller,asyncio: Async caller with running loop → async-only generator (bridge path)
    Caller->>ColumnGenerator: generate(data)
    ColumnGenerator->>_run_coroutine_sync: _run_coroutine_sync(self.agenerate(data))
    _run_coroutine_sync->>asyncio: get_running_loop() → loop exists
    _run_coroutine_sync->>ThreadPoolExecutor: pool.submit(asyncio.run, coro)
    ThreadPoolExecutor-->>_run_coroutine_sync: future.result(timeout=300s)
    _run_coroutine_sync->>ThreadPoolExecutor: pool.shutdown(wait=True)
    _run_coroutine_sync-->>ColumnGenerator: result
    ColumnGenerator-->>Caller: result

_{Last reviewed commit: 1928b32}

Use explicit pool lifecycle instead of context manager so that a TimeoutError releases the caller immediately via shutdown(wait=False) rather than blocking on pool.__exit__.

@overload

Add @overload declarations so the base agenerate accepts both dict and pd.DataFrame, mirroring the existing generate pattern.

andreatgretel · 2026-03-09T14:38:32Z

Base class agenerate signature (data: dict) -> dict is narrower than overrides in ColumnGeneratorFullColumn and FromScratchColumnGenerator [...] This violates Liskov Substitution

also addressed the agenerate type signature — added @overload declarations to match the existing generate pattern (dict -> dict | pd.DataFrame -> pd.DataFrame). fixed in 8b6e8b8.

The else clause after return was unreachable, leaking the ThreadPoolExecutor on every successful call. Capture the result first, shut down the pool, then return.

Ensures ThreadPoolExecutor is shut down on all exit paths, including non-TimeoutError exceptions from the coroutine.

Move duplicated input validation and prompt rendering into _prepare_image_inputs, shared by generate and agenerate.

nabinchha · 2026-03-10T17:16:12Z

+    def is_stateful(self) -> bool:
+        """Whether this generator maintains state across calls.
+
+        Stateful generators are serialized per-instance by the async scheduler
+        (row group N must complete before N+1 starts for that generator).
+        """
+        return False


nit: is_stateful could read as vague — lots of things have state. May be requires_sequential_execution?

I'm still trying to fully understand this property. I think we want the "stateful" part in there because this is for columns like the seed column, which needs to remember where it is at in the generation process – is that right? I think the second part of the docstring is a bit hard to follow (might be just me, though).

@nabinchha renamed to is_order_dependent - captures the key semantic (output depends on call order) without being as vague as is_stateful. docstring updated with a concrete example.

@johnnygreco yeah exactly - the seed column needs to remember where it is in the dataset. renamed to is_order_dependent to make the intent clearer at a glance.

johnnygreco · 2026-03-11T01:49:53Z

+        # The @custom_column_generator decorator wraps the user function in a sync
+        # wrapper, so we must unwrap to detect async functions.


lol meant fancy stuff here 🙃

yeah the decorator wrapping forces our hand here - inspect.unwrap is the cleanest way to peek through to the original async function.

- add _is_overridden helper for symmetric generate/agenerate guards - move defensive .copy() into base agenerate, remove subclass overrides - re-raise as builtin TimeoutError for Python 3.10 compat - rename is_stateful to is_order_dependent with improved docstring - replace brittle .fget test with object.__new__ - add async tests for ImageCellGenerator and EmbeddingCellGenerator

nabinchha

Add "Async All the Way Down" dev note covering the async task-queue scheduler built across PRs #356, #378, #404, #429, #456. Includes benchmark results, architecture diagrams, and DAG shape illustrations.

* fix: address review feedback on async engine dev note - Fix wall-clock claim: 41% -> 22% to match benchmark table - Fix dual-model speedup rounding: 1.7x -> 1.6x (10.0/6.1 = 1.64) - Fix run_config API: use dd.set_run_config() instead of passing to create() * docs: add async engine dev note Add "Async All the Way Down" dev note covering the async task-queue scheduler built across PRs #356, #378, #404, #429, #456. Includes benchmark results, architecture diagrams, and DAG shape illustrations. * feat: add docs preview workflow for PRs Build MkDocs site on PRs that touch docs and deploy to Cloudflare Pages. Each PR gets a browseable preview URL posted as a comment. Notebook tutorials use placeholder stubs since they require API keys to execute. Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID repo secrets. * fix: update speedup chart alt text from 1.7x to 1.6x * docs: improve timeline figure context and labeling Add DAG subtitle to sync-vs-async timeline figure and bridge the surrounding text to explain which workload shape is being shown. * edits+additions to async-all-the-way-down dev notes * clarify two semaphore dance * remove dead link * replace hero image * docs: update scale figures with nginx-accurate data and adjust sizing Regenerate scale-model-timeline and scale-boxplot from nginx access logs (column_progress.csv, sync/summary.json) instead of buffered execution logs. Optimize both PNGs to palette mode. Adjust figure widths and update model timeline commentary. * add link from owning-the-model-stack to async-dev-node * docs: address review feedback on async blog post - Tighten intro to a concise abstract, move pipeline narrative into "The Bottleneck Was Structural" section - Remove multi-column generators / seed readers paragraph (TMI) - Clarify sync engine ran columns sequentially within each batch --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>

andreatgretel requested a review from a team as a code owner March 9, 2026 13:28