Skip to content

feat(results): add export() method and --output-format CLI flag#540

Open
przemekboruta wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export
Open

feat(results): add export() method and --output-format CLI flag#540
przemekboruta wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

Summary

  • Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet formats
  • Adds --output-format / -f flag to the data-designer create CLI command; writes dataset.<format> alongside the parquet batch files
  • Default format is jsonl; the parameter is optional in both the Python API and CLI

Usage

Python API:

results = data_designer.create(config, num_records=1000)

results.export("output.jsonl")                    # default: jsonl
results.export("output.csv", format="csv")
results.export("output.parquet", format="parquet")

CLI:

data-designer create config.yaml --output-format jsonl
data-designer create config.yaml -n 500 -f csv

Test plan

  • test_export_writes_file — parametrized over all 3 formats
  • test_export_jsonl_content — each line is valid JSON
  • test_export_csv_content — header + data round-trip
  • test_export_parquet_content — DataFrame round-trip
  • test_export_default_format_is_jsonl
  • test_export_unsupported_format_raises — raises ValueError
  • test_export_returns_path_object — returns Path for str input
  • Existing CLI delegation tests updated for new output_format parameter

Adds DatasetCreationResults.export(path, format=) supporting jsonl,
csv, and parquet. The CLI create command gains --output-format / -f
which writes dataset.<format> alongside the parquet batch files.
@przemekboruta przemekboruta requested a review from a team as a code owner April 13, 2026 19:26
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 13, 2026

Greptile Summary

This PR adds a DatasetCreationResults.export() method supporting jsonl, csv, and parquet formats, and a corresponding --output-format / -f CLI flag for the data-designer create command. The implementation is clean: format validation fires at the top of run_create() before any generation work begins, and all three export paths delegate correctly to pandas serialisation methods on the loaded DataFrame.

Confidence Score: 5/5

Safe to merge — no P0/P1 issues found; validation, export logic, and test coverage are all correct.

All three export paths are correctly implemented, format validation fires before generation begins (addressing the prior thread concern), tests cover all formats plus error paths, and the parameter threading from CLI through controller to results is consistent. No logic errors or correctness issues found.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/results.py Adds ExportFormat Literal, SUPPORTED_EXPORT_FORMATS tuple, and export() method with correct per-format DataFrame serialisation and runtime validation.
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Adds output_format param with early validation (before generation) and post-generation export call; correctly threads the new parameter through the controller.
packages/data-designer/src/data_designer/cli/commands/create.py Adds --output-format / -f typer option, forwarded unchanged to the controller; no flag conflicts with existing -n / -d / -o options.
packages/data-designer/tests/interface/test_results.py Adds 7 parametrised and targeted tests covering all formats, default behaviour, unsupported format error, and Path return type.
packages/data-designer/tests/cli/commands/test_create_command.py Existing delegation tests updated to pass output_format=None; new test verifies --output-format is forwarded to the controller.
packages/data-designer/tests/cli/test_main.py Minimal update: adds output_format=None to the expected call assertion in the existing dispatch test.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as create_command (CLI)
    participant Ctrl as GenerationController
    participant DD as DataDesigner
    participant Results as DatasetCreationResults

    User->>CLI: data-designer create config.yaml -f jsonl
    CLI->>Ctrl: run_create(..., output_format="jsonl")
    Ctrl->>Ctrl: validate output_format in SUPPORTED_EXPORT_FORMATS
    Ctrl->>DD: create(config_builder, num_records, dataset_name)
    DD-->>Ctrl: DatasetCreationResults
    Ctrl->>Results: load_dataset()
    Results-->>Ctrl: pd.DataFrame
    Ctrl->>Results: export(artifact_path/dataset.jsonl, format="jsonl")
    Results->>Results: df.to_json(path, orient="records", lines=True)
    Results-->>Ctrl: Path("...dataset.jsonl")
    Ctrl-->>User: print "Exported to: ..."
Loading

Reviews (3): Last reviewed commit: "fix(cli): remove top-level results impor..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant