Skip to content

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511

Open
przemekboruta wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction
Open

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
przemekboruta wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

Summary

Closes #510

  • Deletes dag.py and moves topologically_sort_column_configs into execution_graph.py as a module-level function
  • Replaces networkx.topological_sort with an inline Kahn's algorithm, consistent with ExecutionGraph.get_topological_order
  • Side-effect resolution is now O(1) via a side_effect_map dict — the previous implementation did a linear scan over sum(side_effect_dict.values(), []) which was O(n²)
  • Updates imports in config_compiler.py and test_dag.py

Design note

The function is intentionally a module-level function, not a @classmethod on ExecutionGraph. ExecutionGraph is an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on raw ColumnConfigT without strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change to add_column.

Test plan

  • Existing test_dag.py tests (test_dag_construction, test_circular_dependencies) cover the migrated function with updated import path
  • ruff check passes on all modified files

@przemekboruta przemekboruta requested a review from a team as a code owner April 8, 2026 16:47
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR consolidates two overlapping DAG implementations by deleting dag.py and moving topologically_sort_column_configs into execution_graph.py as a module-level function. The migration replaces the old networkx.topological_sort dependency with an inline Kahn's algorithm (consistent with ExecutionGraph.get_topological_order) and improves side-effect resolution from O(n²) linear scan to O(1) dict lookup. The Kahn's algorithm is correctly implemented, import paths in config_compiler.py and test_dag.py are updated, and no logic changes were introduced.

Confidence Score: 5/5

Safe to merge — clean mechanical refactor with no logic changes and correct Kahn's algorithm implementation.

All changes are a straightforward migration of a function between modules with an equivalent algorithm substitution. The Kahn's algorithm is correct, side-effect resolution is logically equivalent and more efficient, imports are updated in all call sites, and existing tests cover the migrated code. No P0 or P1 findings.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Adds topologically_sort_column_configs as a module-level function using a correct inline Kahn's algorithm; the existing ExecutionGraph class is unchanged.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py Import updated from the deleted dag module to execution_graph; no logic changes.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py File deleted as part of the consolidation; functionality migrated to execution_graph.py.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py Import path updated to execution_graph; test logic is identical to the pre-PR version.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["config_compiler.py\ncompile_dataset_builder_column_configs()"]
    B["execution_graph.py\ntopologically_sort_column_configs()"]
    C["ExecutionGraph.create()"]
    D["ExecutionGraph.get_topological_order()\n(Kahn's algorithm)"]
    E["DAGCircularDependencyError"]

    A -- "calls" --> B
    B -- "builds upstream/downstream\ndicts + Kahn's sort" --> D
    C -- "calls after building edges" --> D
    D -- "raises on cycle" --> E
    B -- "raises on cycle" --> E

    style B fill:#d4edda,stroke:#28a745
    style E fill:#f8d7da,stroke:#dc3545
Loading

Reviews (3): Last reviewed commit: "test: relax non-deterministic ordering a..." | Re-trigger Greptile

@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch from 89857f7 to b25ebf6 Compare April 8, 2026 16:50
…ution_graph.py

Eliminates dag.py and its networkx dependency by moving
topologically_sort_column_configs into execution_graph.py as a
module-level function. Side-effect resolution is now O(1) via a
side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm
is reused in-place rather than leaning on networkx.topological_sort.

Closes NVIDIA-NeMo#510

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch from b25ebf6 to 775f147 Compare April 8, 2026 16:52
test_judge and test_code_and_depends_on_validation_reasoning_traces have
no mutual dependency and reach in-degree 0 simultaneously in Kahn's
algorithm. Set iteration order varies with PYTHONHASHSEED, making the
strict list assertion flaky. Assert only the topological invariants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nabinchha
Copy link
Copy Markdown
Contributor

@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)

2 participants