feat: add ExecutionGraph, CompletionTracker, and Task model for async scheduler by andreatgretel · Pull Request #356 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-02-26T21:59:43Z

Summary

PR 1 of 4 in the async generators & task-queue builder plan. Adds the foundational data structures — ExecutionGraph, CompletionTracker, and Task/TaskResult/TaskTrace — that the async scheduler (PR 3) will consume. No existing behavior changes; all new modules under engine/dataset_builders/utils/.

Changes

Added

execution_graph.py — Column-level DAG built from config dependencies. Supports topological ordering (Kahn's, cached), longest dependency chain, cell-level dependency resolution, side-effect column mapping, Mermaid visualization, upfront task count estimation, cached upstream_by_strategy, and a create() factory classmethod.
completion_tracker.py — Tracks per-cell and per-batch completion state across row groups. Uses an event-driven frontier — readiness is computed incrementally on mark_cell_complete/mark_row_range_complete/drop_row via _enqueue_downstream, so get_ready_tasks returns in O(frontier) instead of scanning all columns × rows × row groups per tick. Enforces strategy-safe completion (cell API rejects non-CELL_BY_CELL columns, batch API rejects CELL_BY_CELL columns). Guards against re-enqueueing already-completed tasks.
task_model.py — Frozen dataclasses for Task (hashable work unit), TaskResult (outcome), TaskTrace (timing trace), and CellRef (named tuple for cell coordinates).
test_execution_graph.py (438 lines) — Tests for graph construction, topological order, longest dependency chain, cell dependencies, side-effects, Mermaid output, cycle detection, task counts, immutability guarantees.
test_completion_tracker.py (348 lines) — Tests for mark/query, batch completion, row drops, frontier-based readiness resolution, multi-row-group scenarios, strategy validation, re-enqueue regression tests.
test_task_model.py (87 lines) — Tests for equality, hashing, set membership, defaults.

Changed

Updated plan document with refined step details.

Total: +1,434 / -18 lines across 7 files (6 new, 1 modified). ~60% of added lines are tests (873 test / 543 source).

Attention Areas

Reviewers: Please pay special attention to the following:

completion_tracker.py — Event-driven frontier logic in _enqueue_downstream and _reevaluate_batch_tasks. Key invariants: strategy validation prevents mismatched mark calls, completed tasks are guarded against re-enqueueing, and _batch_complete tracks true batch completion separately from key-presence in _completed.
execution_graph.py — Core DAG logic. All public accessors (columns, get_upstream_columns, get_downstream_columns, topological_order) return defensive copies. The cell_dependencies method resolves side-effect columns and maps generation strategy to readiness granularity. This is the contract that PR 3's scheduler will rely on.

Test plan

All new tests pass — 58 tests across 3 files
make check-all passes (lint + format)
Existing test suite unaffected — no imports from these modules yet

Description updated with AI

… scheduler Add the foundational data structures for the async task-queue dataset builder (plan #346, PR 1/4): - ExecutionGraph: column-level static DAG with topological ordering, critical path, task counts, cell-dependency resolution, Mermaid output, and side-effect column mapping (__trace, __reasoning_content). - CompletionTracker: lightweight (column, row_group, row_index) completion state with row dropping and ready-task enumeration. - Task/TaskResult/TaskTrace: frozen hashable task dataclass, result container, and opt-in tracing record. All three are pure data structures with no side effects on the existing codebase. They live in new modules under engine/dataset_builders/utils/ and are only imported by code introduced in later PRs. 56 unit tests covering graph construction, validation, dependency resolution, completion tracking, row drops, and task model semantics. Refs #346

Add `is_ready` and `is_batch_ready` methods to CompletionTracker to simplify `ready_tasks`. Cache topological order in ExecutionGraph since the graph is immutable after construction. Move DatasetBuilderColumnConfigT type alias to multi_column_configs. Fix license header years.

greptile-apps · 2026-02-26T22:02:46Z

Greptile Summary

This PR introduces the foundational data structures for the async scheduler — ExecutionGraph (column-level DAG with topological ordering and task-count estimation), CompletionTracker (event-driven frontier for O(frontier) readiness queries), and Task/TaskResult/TaskTrace/SliceRef models — with ~873 lines of tests covering the full surface area. All previously raised review issues have been addressed, including the _batch_complete tracking fix, topological_order() defensive copy, buffer_size <= 0 guard, duplicate-column guard, empty-graph early return, size-mismatch check, and re-enqueue guards in both _enqueue_downstream and _reevaluate_batch_tasks.

One issue remains:

SliceRef.order=True raises TypeError on mixed-strategy dependency lists — compute_cell_dependencies can return both SliceRef(..., None) (FULL_COLUMN upstream) and SliceRef(..., int) (CELL_BY_CELL upstream) in the same list, which causes a crash when sorted. The existing tests don't cover this case because each tested column's upstreams happen to be a single strategy type.

Confidence Score: 4/5

Safe to merge after resolving the SliceRef ordering issue; no existing behavior is changed and all new modules are self-contained.
The PR is well-structured with all previously identified correctness and invariant issues resolved, and test coverage is thorough (~60% of added lines). The single remaining issue — SliceRef.order=True failing for mixed-strategy upstream lists — is a latent bug that will surface in PR 3 once a column with both FULL_COLUMN and CELL_BY_CELL upstreams appears. It is a one-line fix (remove order=True or add a custom comparator).
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py — SliceRef order=True ordering bug with mixed None/int row_index values.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py	Frozen dataclasses for Task, TaskResult, TaskTrace, and SliceRef are well-structured and properly immutable. SliceRef's `order=True` causes TypeError when sorting mixed None/int row_index values that can arise from compute_cell_dependencies on columns with both FULL_COLUMN and CELL_BY_CELL upstreams.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py	Core DAG implementation with Kahn's topological sort, defensive copies on all public accessors, duplicate-column guard, buffer_size guard, and empty-graph guard; logic is sound and all previously flagged issues are addressed.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py	Event-driven frontier tracker with well-implemented strategy validation, size-mismatch guard, _batch_complete tracking, and re-enqueue guards in both _enqueue_downstream and _reevaluate_batch_tasks; all previously identified invariant gaps are resolved.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_execution_graph.py	Comprehensive test coverage for graph construction, topological sort, critical path, cell deps, mermaid, cycle detection, task counts, and immutability; parametrized buffer_size guard and duplicate-column tests added per prior review.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py	Good frontier/strategy/re-enqueue regression coverage, including the new drop_row-unblocks-full-column test; late-upstream re-enqueue regression test was correctly rewritten to actually fire the late event after downstream completion.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_task_model.py	Covers frozen/hashable Task, all task_type literals, TaskResult defaults/error path, TaskTrace mutability and from_task factory; clean and complete.

Sequence Diagram

sequenceDiagram
    participant S as Scheduler (PR 3)
    participant EG as ExecutionGraph
    participant CT as CompletionTracker

    S->>EG: create(column_configs, strategies)
    EG-->>S: graph (DAG, topological order validated)

    S->>CT: with_graph(graph, row_groups)
    CT->>CT: _seed_frontier() — enqueue root column tasks
    CT-->>S: tracker

    loop Scheduler event loop
        S->>CT: get_ready_tasks(dispatched)
        CT-->>S: list[Task] from frontier

        S->>S: dispatch Task(column, row_group, row_index)

        alt Cell task completes
            S->>CT: mark_cell_complete(column, row_group, row_index)
            CT->>CT: _completed[rg][col].add(row_index)
            CT->>CT: _enqueue_downstream() — add newly-ready cell/batch tasks
        else Batch task completes
            S->>CT: mark_row_range_complete(column, row_group, size)
            CT->>CT: _completed[rg][col] = range(size), _batch_complete[rg].add(col)
            CT->>CT: _enqueue_downstream() — add newly-ready downstream tasks
        else Row dropped
            S->>CT: drop_row(row_group, row_index)
            CT->>CT: discard cell tasks for dropped row
            CT->>CT: _reevaluate_batch_tasks() — unblock FULL_COLUMN tasks if all rows done/dropped
        end

        S->>CT: is_all_complete(deps)
        CT-->>S: bool
    end

_{Last reviewed commit: 7dd6f89}

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py

@staticmethod

- Rename all_complete → is_all_complete for boolean method convention - Add ColumnName, RowGroup, RowIndex type aliases for readability - Add public mutation API to ExecutionGraph (add_column, add_edge, set_side_effect, resolve_side_effect) and rewrite build_execution_graph to use it instead of private attributes - Change TaskTrace.from_task from @staticmethod to @classmethod

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

- Rename RowGroup type alias to RowGroupIndex for consistency - Convert ExecutionGraph from dataclass to plain class - Move build_execution_graph logic to ExecutionGraph.create() classmethod

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

nabinchha

@andreatgretel a few more comments related to perf!

Optimization Review

High Impact

1. get_ready_tasks is O(C × R × G) on every scheduler tick

This scans every column × every row × every row group on each call. With 10 columns, 10k records, buffer_size=100, that's ~100k iterations per tick, each triggering cell_dependencies() + is_all_complete().

Two suggestions:

Early skip for completed column×row_group pairs in the cell-by-cell branch. Before the inner row loop, a quick check like len(completed.get(col, set())) + len(dropped) >= rg_size would let you skip entire blocks.
Incremental/event-driven readiness (future PR): maintain a frontier set updated on mark_complete instead of full-scanning. This turns the scheduler from poll-based to event-driven.

2. cell_dependencies allocates a new list + tuples every call

Called per-cell inside the hot loop. For a 100-row batch with 3 upstream columns: 100 list allocations + 300 tuple allocations per column per row group per tick. Since the graph is immutable, the dependency pattern for a given column is always the same — only (row_group, row_index) varies. A cached descriptor that is_all_complete interprets directly could avoid most allocations.

3. is_batch_ready builds full dep list then filters it

deps = graph.cell_dependencies(column, row_group, None, row_group_size)
deps = [(c, rg, ri) for c, rg, ri in deps if ri is None or not self.is_dropped(rg, ri)]

For a full-column downstream of a 1000-row cell-by-cell column, this builds 1000 tuples then creates a second filtered list. Consider checking dropped rows inline or passing the dropped set into the dependency resolution.

Low Impact (fine to defer)

4. topological_order() and columns copy on every access — topological_order() does return list(cache) and is called once per column per row group in get_ready_tasks. Since the graph is immutable and callers don't mutate the result, an internal _topological_order that returns the cached list directly (skipping the copy) would help in the hot path. Same for the columns property.

5. is_all_complete repeated dict lookups — Each (col, rg, ri) tuple triggers self._completed.get(rg, {}).get(col, set()) with temporary empty dict/set allocations on misses. Hoisting the row-group lookup outside the per-cell loop would reduce overhead.

6. _upstream/_downstream are defaultdict but accessors use .get(key, set()) — Allocates a fresh empty set on every miss. Minor, but switching to plain dict would make the no-side-effect intent explicit and avoid the allocation.

Summary

The two highest-impact changes are (1) early-skip logic in get_ready_tasks and (2) reducing per-cell allocations in cell_dependencies. Everything else is micro-optimization that can wait until profiling confirms it matters. Great foundation overall.

andreatgretel · 2026-03-02T19:58:27Z

@nabinchha update on the optimization review after the event-driven frontier refactor:

1. get_ready_tasks O(C × R × G) per tick — addressed. get_ready_tasks is now [t for t in self._frontier if t not in dispatched]. Readiness is computed incrementally in _enqueue_downstream on each mark_complete/mark_batch_complete, so cost is O(downstream_fan_out) per completion instead of O(C × R × G) per tick.

2. cell_dependencies allocations per call — no longer in the hot path. The frontier logic uses graph.upstream_by_strategy (cached) directly. No per-cell list/tuple allocations on each tick.

3. is_batch_ready builds full dep list then filters — removed. Batch readiness is checked inline by _are_cell_ups_complete inside _enqueue_downstream and _reevaluate_batch_tasks, no intermediate list construction.

4–6 (topological_order copies, is_all_complete lookups, defaultdict) — already addressed in previous commits or no longer in the hot path.

Replace the poll-based get_ready_tasks (O(C × R × G) per tick) with an event-driven frontier maintained on mark_complete/mark_batch_complete/ drop_row. get_ready_tasks now returns O(frontier) instead of scanning all columns × rows × row groups.

- Add ReadyTasksFixture dataclass and ready_ctx pytest fixture to deduplicate graph/tracker/dispatched setup across get_ready_tasks tests - Align test with ExecutionGraph.create API rename - Remove redundant inline comments

- CompletionTracker now raises ValueError when graph/row_groups are provided without each other - resolve_side_effect prefers real columns over aliases when a name collision exists

greptile-apps

_{7 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

greptile-apps · 2026-03-03T19:02:09Z

Additional Comments (3)

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py, line 72
self._row_group_sizes[row_group] (accessed at lines 79 and 143) will raise KeyError if mark_complete, mark_batch_complete, or drop_row is ever called with a row_group value that was not included in the row_groups argument at construction. In the async scheduler context, a late-arriving completion event from an unregistered row group would crash the event loop silently.

Add a defensive guard in _enqueue_downstream and _reevaluate_batch_tasks:

rg_size = self._row_group_sizes.get(row_group)
if rg_size is None:
    return

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py, line 86
The condition if any(up not in rg_completed for up in batch_ups) checks only whether a batch-upstream column key exists in the completion dict, not whether all rows are complete. A single call to mark_complete(up_col, rg, 0) on a FULL_COLUMN column creates the column key even though only one row is marked complete.

If the scheduler calls mark_complete on a FULL_COLUMN column (due to strategy mismatch or other paths), downstream FULL_COLUMN tasks with no CELL_BY_CELL upstreams will be enqueued prematurely. The downstream check _are_cell_ups_complete([], ...) returns true for an empty list, bypassing actual completion validation.

Consider tracking batch-level completions separately to distinguish between partial and complete batches, or validate that all rows in batch upstreams are complete before enqueuing downstream tasks.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py, line 143
critical_path() calls max(order, ...) on line 136 without checking if order is empty. When no columns have been registered, topological_order() returns [], and max() on an empty sequence raises ValueError: max() arg is an empty sequence.

Add an early return for the empty case:

def critical_path(self) -> list[str]:
    order = self.topological_order()
    if not order:
        return []
    ...

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

…-queue-foundation

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

- Fix critical_path() crash on empty graph (early return) - Fix is_all_complete batch semantics via _batch_complete tracking set - Add row-group size mismatch validation in mark_row_range_complete - Add unknown row_group validation in mark_cell_complete - Rename methods for verb-prefix convention: upstream → get_upstream_columns, downstream → get_downstream_columns, critical_path → get_longest_dependency_chain, mark_complete → mark_cell_complete, mark_batch_complete → mark_row_range_complete - Introduce CellRef NamedTuple, remove ColumnName/RowGroupIndex/RowIndex aliases - Delete deprecated build_execution_graph() wrapper - Return defensive copy from topological_order() - Add regression tests for fixed bugs

Skip adding downstream tasks to the frontier when they are already marked complete, avoiding redundant work in CompletionTracker.

- Enforce strategy-safe completion: mark_cell_complete rejects non-CELL_BY_CELL columns, mark_row_range_complete rejects CELL_BY_CELL columns (ValueError in graph mode) - Return defensive copies from ExecutionGraph public API (columns, get_upstream/downstream_columns) - Add re-enqueue regression tests for cell and batch paths - Add immutability tests for ExecutionGraph collections

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_execution_graph.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

- Reject duplicate column names in add_column with ValueError - Validate buffer_size > 0 in task_count - Use _batch_complete for batch upstream readiness checks - Remove duplicate section header in test file

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

…-queue-foundation

- Add `from __future__ import annotations` to 5 files missing it - Rename ExecutionGraph methods to start with action verbs (strategy → get_strategy, topological_order → get_topological_order, upstream_by_strategy → split_upstream_by_strategy, task_count → compute_task_count, cell_dependencies → compute_cell_dependencies) - Reorder methods in CompletionTracker and ExecutionGraph: __init__ → properties → classmethods → public → private

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

…lete API - Convert CellRef from NamedTuple to frozen dataclass - Change is_complete to accept CellRef instead of 3 positional args - Unify batch done-guards in _enqueue_downstream and _reevaluate_batch_tasks to use rg_batch_complete instead of rg_completed

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

nabinchha · 2026-03-05T22:12:23Z

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

+            self._row_group_sizes = {rg_id: size for rg_id, size in row_groups}
+            self._seed_frontier()
+
+    def mark_cell_complete(self, column: str, row_group: int, row_index: int) -> None:


Suggestion: add cell_ref property to Task and accept CellRef in CompletionTracker methods
Task and CellRef share the same (column, row_group, row_index) coordinates — a Task is essentially a CellRef plus a task_type. Adding a property to make that relationship explicit:

@dataclass(frozen=True) class Task: # ... existing fields ... @property def cell_ref(self) -> CellRef: return CellRef(self.column, self.row_group, self.row_index)

Then mark_cell_complete, is_complete, and drop_row could accept a CellRef instead of flat args:

# Before tracker.mark_cell_complete(task.column, task.row_group, task.row_index) # After tracker.mark_cell_complete(task.cell_ref)

mark_row_range_complete would keep its current signature since it takes row_group_size instead of row_index — the different shape justifies a different signature.

Benefits:

Makes the Task/CellRef relationship explicit rather than having overlapping-but-unrelated fields

Reduces risk of getting argument order wrong at call sites

Cleaner scheduler code in PR 3

Not blocking — fine to defer to a later PR if you'd rather keep this one focused on the current scope.

makes sense but we don't have real callers yet. deferring to PR 3 where the scheduler will validate the ergonomics.

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

nabinchha · 2026-03-05T22:23:17Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py

+    column: str
+    row_group: int
+    row_index: int | None


similar question as above, can this collapse into a CellRef?

is mutable and closer to Task than CellRef (it also has task_type). could hold a CellRef internally but then you'd do trace.cell_ref.column for the most common fields. deferring for now.

- Split CompletionTracker into __init__() + with_graph() classmethod - Replace assert with RuntimeError in private methods - Add get_root_columns() to ExecutionGraph - Remove "no locks needed" from docstring - Fix re-enqueue regression test to exercise the actual scenario - Remove unused ready_ctx fixture parameter

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_execution_graph.py

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py

johnnygreco · 2026-03-06T14:56:04Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py

+
+@dataclass(frozen=True, order=True)
+class CellRef:
+    """Reference to a cell (or batch when row_index is None) in the dataset grid."""


Not a blocker, but "batch when row_index is None" feels a bit weird to me – i.e., an object called CellRef representing a batch. What if we had a BatchRef object or RowGroupRef or something that doesn't have a row_index?

good point. renamed CellRef to SliceRef -- a slice can naturally be one element (a cell) or a whole row group, so the name does not imply single-cell anymore.

A slice naturally represents both a single cell and a full row group, removing the semantic mismatch of CellRef representing batches.

…lumn

johnnygreco

Fantastic job wrangling the agents and you picky human teammates 🙃

Awesome work @andreatgretel!

greptile-apps · 2026-03-06T15:38:26Z

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py

+        """
+        graph = cls()
+


Raw KeyError when column is absent from strategies

strategies[name] raises a bare KeyError if a column name present in column_configs is missing from the caller-supplied strategies dict. Since create() is the public entry-point for constructing the graph, a raw KeyError: 'my_column' will be difficult to diagnose — callers must already know that the lookup is a dict access rather than an explicit validation step.

A small guard makes the contract explicit and the error actionable:

for sub in sub_configs: name = sub.name if name not in strategies: raise ValueError( f"No strategy provided for column '{name}'. " "Ensure every column has an entry in the 'strategies' dict." ) graph.add_column(name, strategies[name])

This is consistent with the pattern already used elsewhere in create() (e.g. the "not a known producer" guard in the second pass).

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Line: 52-54 Comment: **Raw `KeyError` when column is absent from `strategies`** `strategies[name]` raises a bare `KeyError` if a column name present in `column_configs` is missing from the caller-supplied `strategies` dict. Since `create()` is the public entry-point for constructing the graph, a raw `KeyError: 'my_column'` will be difficult to diagnose — callers must already know that the lookup is a dict access rather than an explicit validation step. A small guard makes the contract explicit and the error actionable: ```python for sub in sub_configs: name = sub.name if name not in strategies: raise ValueError( f"No strategy provided for column '{name}'. " "Ensure every column has an entry in the 'strategies' dict." ) graph.add_column(name, strategies[name]) ``` This is consistent with the pattern already used elsewhere in `create()` (e.g. the `"not a known producer"` guard in the second pass). How can I resolve this? If you propose a fix, please make it concise.

andreatgretel · 2026-03-06T15:40:26Z

Fantastic job wrangling the agents and you picky human teammates 🙃

Nits were all great though, agents made quite a few poor design decisions 😅 Gotta keep steering them!

…-queue-foundation

greptile-apps · 2026-03-06T16:12:32Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py

+@dataclass(frozen=True, order=True)
+class SliceRef:
+    """Reference to a slice of the execution grid: a single cell or a full row group."""
+
+    column: str
+    row_group: int
+    row_index: int | None = None


SliceRef ordering breaks for mixed-strategy dependency lists

order=True generates comparison operators that compare fields lexicographically. row_index is typed int | None, so comparing two SliceRef objects where one has row_index=None (from a FULL_COLUMN upstream) and another has row_index=2 (from a CELL_BY_CELL upstream) raises TypeError: '<' not supported between instances of 'NoneType' and 'int' in Python 3.

compute_cell_dependencies in execution_graph.py returns a mixed list when a column depends on both a FULL_COLUMN upstream (producing SliceRef(..., None)) and a CELL_BY_CELL upstream (producing SliceRef(..., int)). The test at test_execution_graph.py:312 calls sorted(deps) on the return value and would crash for such a column.

The fix depends on intent:

If sorting SliceRefs is not a requirement for internal correctness (no production code paths depend on a sorted deps list), simply remove order=True.

If sorted output is needed, override __lt__ to place None before any integer (batch refs sort before cell refs) and supply order=False to avoid the auto-generated broken operators.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py Line: 10-16 Comment: **`SliceRef` ordering breaks for mixed-strategy dependency lists** `order=True` generates comparison operators that compare fields lexicographically. `row_index` is typed `int | None`, so comparing two `SliceRef` objects where one has `row_index=None` (from a FULL_COLUMN upstream) and another has `row_index=2` (from a CELL_BY_CELL upstream) raises `TypeError: '<' not supported between instances of 'NoneType' and 'int'` in Python 3. `compute_cell_dependencies` in `execution_graph.py` returns a mixed list when a column depends on both a `FULL_COLUMN` upstream (producing `SliceRef(..., None)`) and a `CELL_BY_CELL` upstream (producing `SliceRef(..., int)`). The test at `test_execution_graph.py:312` calls `sorted(deps)` on the return value and would crash for such a column. The fix depends on intent: - If sorting `SliceRef`s is not a requirement for internal correctness (no production code paths depend on a sorted `deps` list), simply remove `order=True`. - If sorted output is needed, override `__lt__` to place `None` before any integer (batch refs sort before cell refs) and supply `order=False` to avoid the auto-generated broken operators. How can I resolve this? If you propose a fix, please make it concise.

nabinchha

🚢

andreatgretel added 2 commits February 26, 2026 01:14

andreatgretel requested a review from a team as a code owner February 26, 2026 21:59

nabinchha requested changes Feb 27, 2026

View reviewed changes

andreatgretel requested review from johnnygreco and nabinchha February 27, 2026 19:17

nabinchha reviewed Feb 27, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/task_model.py Outdated Show resolved Hide resolved

nabinchha reviewed Feb 27, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Outdated Show resolved Hide resolved

nabinchha reviewed Feb 27, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Outdated Show resolved Hide resolved

refactor: address remaining PR review feedback

d0d4695

- Rename RowGroup type alias to RowGroupIndex for consistency - Convert ExecutionGraph from dataclass to plain class - Move build_execution_graph logic to ExecutionGraph.create() classmethod

andreatgretel requested a review from nabinchha February 27, 2026 21:20

nabinchha reviewed Feb 27, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

nabinchha reviewed Feb 27, 2026

View reviewed changes

andreatgretel requested a review from nabinchha March 2, 2026 20:13

andreatgretel added 2 commits March 3, 2026 13:24

fix: validate tracker args and resolve side-effect name collisions

b08cb3d

- CompletionTracker now raises ValueError when graph/row_groups are provided without each other - resolve_side_effect prefers real columns over aliases when a name collision exists

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 3, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Show resolved Hide resolved

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Show resolved Hide resolved

Merge branch 'main' into andreatgretel/feat/async-generators-and-task…

060b933

…-queue-foundation

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

andreatgretel added 3 commits March 5, 2026 11:29

fix: prevent completed tasks from re-entering the frontier

82b8351

Skip adding downstream tasks to the frontier when they are already marked complete, avoiding redundant work in CompletionTracker.

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

address second round of review feedback

03741df

- Reject duplicate column names in add_column with ValueError - Validate buffer_size > 0 in task_count - Use _batch_complete for batch upstream readiness checks - Remove duplicate section header in test file

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Outdated Show resolved Hide resolved

andreatgretel and others added 2 commits March 5, 2026 18:05

Merge branch 'main' into andreatgretel/feat/async-generators-and-task…

a8cf010

…-queue-foundation

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Show resolved Hide resolved

...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

...ages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Show resolved Hide resolved

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py Outdated Show resolved Hide resolved

fix test to use cell_by_cell column for row-group validation test

5d599e8

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py Outdated Show resolved Hide resolved

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_completion_tracker.py Outdated Show resolved Hide resolved

nabinchha reviewed Mar 5, 2026

View reviewed changes

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

johnnygreco reviewed Mar 6, 2026

View reviewed changes

andreatgretel added 2 commits March 6, 2026 12:28

rename CellRef to SliceRef

15fae22

A slice naturally represents both a single cell and a full row group, removing the semantic mismatch of CellRef representing batches.

add missing tests for drop_row unblock, buffer_size, and duplicate co…

d0c6bf0

…lumn

johnnygreco approved these changes Mar 6, 2026

View reviewed changes

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

Merge branch 'main' into andreatgretel/feat/async-generators-and-task…

7dd6f89

…-queue-foundation

andreatgretel requested a review from nabinchha March 6, 2026 16:03

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

nabinchha approved these changes Mar 6, 2026

View reviewed changes

andreatgretel merged commit 9889dc1 into main Mar 6, 2026
47 checks passed

Conversation

andreatgretel commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Added

Changed

Attention Areas

Test plan

Uh oh!

greptile-apps bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Optimization Review

High Impact

Low Impact (fine to defer)

Summary

Uh oh!

andreatgretel commented Mar 2, 2026

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreatgretel Mar 6, 2026

andreatgretel commented Feb 26, 2026 •

edited

Loading

greptile-apps bot commented Feb 26, 2026 •

edited

Loading

nabinchha left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

nabinchha Mar 5, 2026 •

edited

Loading