Skip to content

fix: raise clear error when all records are dropped during generation#383

Merged
nabinchha merged 6 commits intomainfrom
nmulepati/fix/382-misleading-profiler-error
Mar 10, 2026
Merged

fix: raise clear error when all records are dropped during generation#383
nabinchha merged 6 commits intomainfrom
nmulepati/fix/382-misleading-profiler-error

Conversation

@nabinchha
Copy link
Contributor

@nabinchha nabinchha commented Mar 9, 2026

Summary

  • When every record fails during generation (e.g. LLM or image model errors), the empty dataset previously reached the profiler, which raised a misleading DatasetProfilerConfigurationError about missing columns.
  • Both preview() and create() now check for an empty dataset before profiling and raise a DataDesignerGenerationError with an actionable message instead.
  • Added tests for both code paths to verify the correct error is raised.

Closes #382

Test plan

  • test_create_raises_generation_error_when_dataset_is_empty — verifies create() raises DataDesignerGenerationError when all records are dropped
  • test_preview_raises_generation_error_when_dataset_is_empty — verifies preview() raises DataDesignerGenerationError when all records are dropped
  • Existing tests pass (make test)

closes #382

…#382)

When every record fails during generation (e.g. LLM or image model errors),
the empty dataset previously reached the profiler, which raised a misleading
DatasetProfilerConfigurationError about missing columns. Now both preview()
and create() check for an empty dataset before profiling and raise a
DataDesignerGenerationError with an actionable message instead.

Made-with: Cursor
@nabinchha nabinchha requested a review from a team as a code owner March 9, 2026 16:48
@nabinchha nabinchha self-assigned this Mar 9, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR fixes a poor user experience where all records being dropped during generation (due to LLM or image model errors) caused an empty dataset to propagate to the profiler, which then raised a misleading DatasetProfilerConfigurationError about missing columns. The fix adds explicit empty-dataset checks in both create() and preview(), surfacing a clear DataDesignerGenerationError instead.

Key changes:

  • create(): A new try/except block wraps load_dataset_with_dropped_columns() to catch storage-level exceptions (e.g. no parquet files written) as DataDesignerGenerationError; a subsequent len == 0 guard catches the defensive case of an empty-but-loadable dataset.
  • preview(): An early len(processed_dataset) == 0 guard after process_preview() raises DataDesignerGenerationError before reaching the profiler; the message correctly attributes the cause to "generation or processing failures" to cover both row-filter processors and LLM failures.
  • The now-redundant len(processed_dataset) > 0 condition is removed from the success-log check in preview(), since that point is only reachable with a non-empty dataset.
  • Three new tests cover the create() load-failure, create() empty-DataFrame, and preview() empty-DataFrame paths.

Confidence Score: 4/5

  • This PR is safe to merge; it improves error clarity without changing any happy-path logic.
  • The changes are focused and targeted — they add early-exit guards without touching the generation or profiling pipelines. The except Exception wrapper for load_dataset_with_dropped_columns() is intentionally broad (to catch any storage-level failure) and follows the same pattern used elsewhere in the file. The three new tests correctly exercise the added code paths. The stub_sampler_only_config_builder fixture does not invoke LLM calls, so the tests that run the real build pipeline with num_records=1 are stable. No regressions are expected in the existing test suite.
  • No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/data_designer.py Adds early empty-dataset guards in both create() and preview() with dedicated try/except for load_dataset_with_dropped_columns(); removes now-redundant len(processed_dataset) > 0 guard from the success-log condition.
packages/data-designer/tests/interface/test_data_designer.py Adds three new error-path tests: empty DataFrame from load_dataset_with_dropped_columns, FileNotFoundError from same, and empty DataFrame from process_preview; tests for create() run the real sampler build pipeline before hitting the patch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[create / preview called] --> B[builder.build / build_preview + process_preview]
    B -- Exception --> C[raise DataDesignerGenerationError\n'Error generating dataset']
    B -- Success --> D{create or preview?}

    D -- create --> E[load_dataset_with_dropped_columns]
    E -- Exception --> F[raise DataDesignerGenerationError\n'Failed to load generated dataset']
    E -- Empty DataFrame --> G[raise DataDesignerGenerationError\n'Dataset is empty']
    E -- Non-empty DataFrame --> H[profiler.profile_dataset]

    D -- preview --> I{len processed_dataset == 0?}
    I -- Yes --> J[raise DataDesignerGenerationError\n'Dataset is empty']
    I -- No --> H

    H -- Exception --> K[raise DataDesignerProfilingError]
    H -- Success --> L[Return Results]
Loading

Last reviewed commit: e686dbb

nabinchha and others added 2 commits March 9, 2026 11:44
Moves the call back inside the try/except block so ArtifactStorageError
is caught and re-raised as DataDesignerProfilingError, preserving the
documented API contract. Also reduces test num_records to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nabinchha nabinchha requested a review from andreatgretel March 10, 2026 15:15
@andreatgretel
Copy link
Contributor

packages/data-designer/src/data_designer/interface/data_designer.py:298

if (
    len(processed_dataset) > 0
    and isinstance(analysis, DatasetProfilerResults)
    and len(analysis.column_statistics) > 0
):

nit: I think len(processed_dataset) > 0 is always true here now that the guard at line 275 raises on empty datasets. Might be nice to simplify the condition?

andreatgretel
andreatgretel previously approved these changes Mar 10, 2026
Copy link
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits, nothing blocking. Thanks for addressing the create() case too!

@nabinchha
Copy link
Contributor Author

Just a couple of nits, nothing blocking. Thanks for addressing the create() case too!

cleaned up in e686dbb

@nabinchha nabinchha requested a review from andreatgretel March 10, 2026 15:47
@nabinchha nabinchha merged commit 340087f into main Mar 10, 2026
47 checks passed
@nabinchha nabinchha deleted the nmulepati/fix/382-misleading-profiler-error branch March 10, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misleading DatasetProfilerConfigurationError when all records fail during generation

2 participants