Skip to content

fix: handle discriminated unions in oneOf pruning validator#376

Merged
andreatgretel merged 4 commits intomainfrom
andreatgretel/fix/discriminated-union-pruning
Mar 6, 2026
Merged

fix: handle discriminated unions in oneOf pruning validator#376
andreatgretel merged 4 commits intomainfrom
andreatgretel/fix/discriminated-union-pruning

Conversation

@andreatgretel
Copy link
Contributor

📋 Summary

Fixes validation failures when using Pydantic discriminated unions (e.g. AlphaItem | BetaItem) with LLMStructuredColumnConfig. When the LLM leaks a property across variants, the pruning validator corrupts the instance by stripping properties during failed variant checks, causing all variants to fail.

Fixes #375

🐛 Fixed

  • The pruning-extended jsonschema validator modifies instances in-place during oneOf validation. When trying a wrong variant first, it strips properties that belong to the correct variant — by the time the correct variant is checked, its required fields are gone and validation fails with zero matches.

🔄 Changes

🔧 Changed

  • validators.py — Added _validate_one_of_with_discriminator, a oneOf validator that reads the discriminator mapping to select the correct variant directly instead of trying all variants. Registered alongside prune_additional_properties in extend_jsonschema_validator_with_pruning. Falls back to standard oneOf for schemas without a discriminator.

🧪 Tests

  • test_validators.py — Added discriminated union schema fixture, parametrized test for leaked properties in both directions (alpha→beta, beta→alpha), and test for invalid discriminator values.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • _validate_one_of_with_discriminator — This assumes discriminator mappings use $ref values (which Pydantic always generates). Non-discriminated oneOf schemas fall back to the original jsonschema oneOf validator unchanged.

🤖 Generated with AI

The pruning validator modifies instances in-place during oneOf
validation. When trying a wrong variant, it strips properties needed
by the correct variant, causing all variants to fail.

Add a discriminator-aware oneOf validator that reads the discriminator
mapping to select the correct variant directly, skipping the
try-all-variants loop that causes the corruption.

Fixes #375
@andreatgretel andreatgretel requested a review from a team as a code owner March 6, 2026 15:50
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR fixes a bug where Pydantic discriminated union validation (AlphaItem | BetaItem) would fail when the LLM leaked cross-variant properties: the pruning-extended oneOf validator was trying every subschema in order, and the in-place pruning from the wrong variant would strip fields required by the correct variant before it was ever tried. The fix adds _validate_one_of_with_discriminator, which reads the JSON Schema discriminator.mapping to identify and validate against the correct variant directly, avoiding the cross-variant mutation entirely. Non-discriminated oneOf schemas fall through to the original validator unchanged.

Key changes:

  • validators.py: New _validate_one_of_with_discriminator generator registered as the oneOf handler in extend_jsonschema_validator_with_pruning; the discriminator path skips multi-variant iteration entirely, while schemas without a discriminator key fall back to Draft202012Validator.VALIDATORS["oneOf"]
  • test_validators.py: Four new tests covering both leak directions (alpha→beta, beta→alpha), invalid discriminator value rejection, and plain oneOf fallback

The implementation correctly assumes Pydantic-generated discriminator schemas where the mapping always points to declared oneOf variants.

Confidence Score: 4/5

  • Safe to merge; core fix is correct and well-tested, with only a minor type annotation issue.
  • The fix correctly addresses the discriminated union pruning bug by routing validation through the discriminator mapping, avoiding cross-variant corruption entirely. The new tests cover both leak directions and edge cases (invalid discriminator values, non-discriminated oneOf fallback). One small issue prevents a perfect score: the return type annotation on _validate_one_of_with_discriminator uses Any instead of Iterator[ValidationError], which should be corrected for proper type-checker support. This is a style/documentation issue with no impact on runtime correctness.
  • packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py (return type annotation on line 62-64)

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py Adds _validate_one_of_with_discriminator to route oneOf validation through the discriminator mapping, preventing in-place pruning corruption. The function correctly implements the discriminator-based routing for Pydantic schemas. One minor issue: the return type annotation uses Any instead of Iterator[ValidationError], which should be fixed for proper type-checker support.
packages/data-designer-engine/tests/engine/processing/gsonschema/test_validators.py Adds four new tests covering: discriminated union pruning with leaked fields in both directions, invalid discriminator value rejection, and non-discriminated oneOf fallback. Coverage is comprehensive for the changed logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_validate_one_of_with_discriminator(validator, one_of, instance, schema)"] --> B{Has discriminator?\ninstance is dict?}
    B -- No --> C["Fallback: Draft202012Validator.VALIDATORS['oneOf']\n(tries all variants — pruning bug still possible)"]
    B -- Yes --> D{prop_name in instance\nAND mapping non-empty?}
    D -- No --> C
    D -- Yes --> E["matched_ref = mapping[instance[prop_name]]"]
    E --> F{matched_ref is None?}
    F -- Yes --> G["yield ValidationError\n'X is not a valid discriminator value'"]
    F -- No --> H["matched_schema = {'$ref': matched_ref}"]
    H --> I["errs = list(validator.descend(instance, matched_schema))"]
    I --> J["prune_additional_properties fires\nin-place on instance\n(only removes non-variant fields)"]
    J --> K["yield from errs\n(empty = valid, non-empty = invalid)"]
Loading

Last reviewed commit: 7d7a1a8

@andreatgretel andreatgretel merged commit 3f8d735 into main Mar 6, 2026
47 checks passed
Comment on lines +62 to +64
def _validate_one_of_with_discriminator(
validator: Any, one_of: list[JSONSchemaT], instance: DataObjectT, schema: JSONSchemaT
) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type annotation too broad for a generator

_validate_one_of_with_discriminator is a generator function (uses yield/yield from), so its return type should reflect that rather than Any. This improves type-checker support and documents the generator contract.

Suggested change
def _validate_one_of_with_discriminator(
validator: Any, one_of: list[JSONSchemaT], instance: DataObjectT, schema: JSONSchemaT
) -> Any:
def _validate_one_of_with_discriminator(
validator: Any, one_of: list[JSONSchemaT], instance: DataObjectT, schema: JSONSchemaT
) -> Iterator[lazy.jsonschema.ValidationError]:

You'll also need to add Iterator to the typing import at the top of the file.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py
Line: 62-64

Comment:
Return type annotation too broad for a generator

`_validate_one_of_with_discriminator` is a generator function (uses `yield`/`yield from`), so its return type should reflect that rather than `Any`. This improves type-checker support and documents the generator contract.

```suggestion
def _validate_one_of_with_discriminator(
    validator: Any, one_of: list[JSONSchemaT], instance: DataObjectT, schema: JSONSchemaT
) -> Iterator[lazy.jsonschema.ValidationError]:
```

You'll also need to add `Iterator` to the `typing` import at the top of the file.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discriminated unions with LLM-structured output cause record-level validation failure

2 participants