Skip to content

fix: support nested field access in schema transform templates#435

Merged
andreatgretel merged 5 commits intomainfrom
andreatgretel/feat/nested-schema-transform
Mar 18, 2026
Merged

fix: support nested field access in schema transform templates#435
andreatgretel merged 5 commits intomainfrom
andreatgretel/feat/nested-schema-transform

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

Summary

Schema transform templates now support nested dot notation on deserialized JSON columns:

template = {
    "score": "{{ result.quality.score }}",
    "label": "{{ result.quality.label }}",
}

Previously, {{ result.quality.score }} failed with 'str object' has no attribute 'quality'.

Why this needed a wrapper (TemplateValue) instead of a simpler fix

Schema transform is unlike other Jinja2 consumers in DataDesigner (prompts, expressions) because its output must be valid JSON. This creates a tension:

  1. Nested access needs values to be dicts - Jinja2 resolves {{ x.y }} by looking up y on x
  2. JSON output needs values to render as escaped strings - He said "hello" must become He said \"hello\" inside a JSON string

The old code (_json_escape_record) chose option 2: flatten all dicts to escaped strings before Jinja2 sees them. Safe JSON, but no nested access.

We can't just pass plain dicts and escape only the leaves either. When Jinja2 converts a whole dict to string (for {{ result }}), it uses Python repr: {'key': 'val'} with single quotes - invalid JSON. The escaping needs to happen dynamically at render time depending on whether Jinja2 is drilling into a value or interpolating it.

TemplateValue solves this by deferring the decision:

  • Dot access ({{ result.quality }}) triggers __getattr__, which returns a new TemplateValue wrapping the nested value
  • String interpolation ({{ result }}) triggers __str__, which applies a caller-provided escape function

The escape function (_escape_value_for_json) is specific to schema transform. Other Jinja2 consumers (prompt templates, expression columns) don't need any of this - Jinja2 natively handles dot access on plain dicts via getattr-to-getitem fallback, and plain str() is fine for text output.

Changes

  • ginja/record.py - new TemplateValue class + wrap_record helper
  • ginja/environment.py - prepare_jinja2_template_renderer accepts optional record_str_fn; safe_render accepts skip_record_sanitization (same pattern as existing skip_template_validation)
  • processors/schema_transform.py - replaced _json_escape_record with _escape_value_for_json passed as record_str_fn
  • test_schema_transform.py - regression tests for nested dot access, mixed nested+flat, and list indexing

Test plan

  • Existing schema transform tests pass (special chars, JSON serialized values, preview mode)
  • New parametrized tests cover nested dot access, mixed nested+flat, list indexing
  • All 150 processing tests + 191 column generator tests pass (backwards compatible)

Enable {{ result.quality.score }} style dot notation in schema
transform Jinja2 templates, where result is a deserialized JSON column.

Previously, _json_escape_record flattened all dict values to escaped
JSON strings before Jinja2 saw them. This made the rendered output
valid JSON but prevented nested access since Jinja2 only saw strings.

The fix introduces TemplateValue, a wrapper that defers the choice
between "drill into nested dict" and "render as escaped string" to
template evaluation time. Jinja2 resolves dot notation via __getattr__
(returning a new TemplateValue for the nested value), and converts to
string via __str__ (delegating to a caller-provided str_fn). This is
necessary because plain dicts render as Python repr ({'key': 'val'})
which is invalid JSON - we need to control __str__ to produce properly
escaped JSON, and that requires a wrapper object.

Other Jinja2 consumers (prompt templates, expression columns) don't
need this - Jinja2 natively supports dot access on plain dicts via
getattr-to-getitem fallback, and plain str() is fine for text output.
Schema transform is unique because its output must be valid JSON.
@andreatgretel andreatgretel requested a review from a team as a code owner March 18, 2026 12:59
@andreatgretel andreatgretel changed the title feat: support nested field access in schema transform templates fix: support nested field access in schema transform templates Mar 18, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR fixes a bug where schema transform templates could not perform nested dot-notation access on deserialized JSON columns (e.g. {{ result.quality.score }} would fail because the old _json_escape_record helper pre-flattened all dicts to escaped strings before Jinja2 saw them, leaving result as a plain str). The solution is clean and well-motivated: instead of pre-processing the record into strings, the raw deserialized dicts are now passed directly to Jinja2, and a finalize hook (_escape_value_for_json) handles JSON escaping at the point of interpolation. A companion prefer_dict_key_access override on getattr ensures that dict keys like "items" or "values" take priority over Python built-in methods (which ImmutableSandboxedEnvironment would otherwise return as unsafe_undefined, silently blocking the fallback to key lookup).

Key changes:

  • _json_escape_record (record-level pre-escaping) removed; replaced with _escape_value_for_json (value-level finalize function) in schema_transform.py
  • UserTemplateSandboxEnvironment gains an opt-in prefer_dict_key_access mode that prioritises dict key lookup in getattr, fixing the items/values/keys shadowing problem
  • prepare_jinja2_template_renderer accepts an optional record_str_fn; when set, both finalize and prefer_dict_key_access are enabled together
  • Note: the PR description's "Changes" section lists ginja/record.py as modified (to add a TemplateValue class), but the final implementation replaced that approach with the Jinja2 finalize hook (commit 5f90cbf); the description is slightly out-of-date
  • New parametrized regression tests cover nested dot access, mixed nested+flat rendering, and list indexing with a shadowed "items" key

Confidence Score: 4/5

  • Safe to merge; logic is correct and all previous review concerns have been addressed.
  • The core fix (deferring JSON escaping to the Jinja2 finalize hook instead of pre-flattening dicts) is sound and well-tested. The prefer_dict_key_access override is safe because sanitize_record guarantees all dict values are basic JSON types before rendering, so bypassing the sandbox's unsafe-attribute check for dict-key access carries no security risk. The two minor deductions are: (1) _escape_value_for_json has no explicit guard for non-finite floats (nan/inf), relying on sanitize_record to catch them upstream; and (2) prefer_dict_key_access and record_str_fn are implicitly coupled in the public API, which may limit future flexibility.
  • No files require special attention; the environment change is the most architecturally significant but is well-contained.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py Adds prefer_dict_key_access flag to UserTemplateSandboxEnvironment that overrides getattr to prefer dict key lookup over Python attribute/method access; extends prepare_jinja2_template_renderer to accept an optional record_str_fn which is wired up as Jinja2's finalize hook and automatically enables prefer_dict_key_access. Logically correct and safe after JSON-sanitization guarantees basic types, but the two concerns are coupled in the public API.
packages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py Replaces _json_escape_record (which pre-flattened all dicts to escaped strings, preventing nested access) with _escape_value_for_json (a single-value escape function passed as record_str_fn). The deserialized record is now passed directly to render_template, with escaping deferred to Jinja2's finalize hook. Clean and correct.
packages/data-designer-engine/tests/engine/processing/processors/test_schema_transform.py Adds well-structured parametrized tests covering nested dot access, mixed nested+flat rendering, and list indexing. Uses clear pytest.param with id= labels. No fragile branching heuristics. The items key test case specifically validates the prefer_dict_key_access behaviour.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[SchemaTransformProcessor] --> B[prepare renderer\nwith escape finalize fn]
    B --> C[UserTemplateSandbox\nfinalize=escape fn\nprefer dict key access]
    A --> D[Loop over DataFrame rows]
    D --> E[deserialize JSON values\nstrings become Python dicts]
    E --> F[render template]
    F --> G[sanitize record\nJSON round-trip]
    G --> H{Expression type}
    H -->|dot access: result.quality.score| I[getattr override\ndict key takes priority\nover Python method]
    H -->|whole dict: result| J[return full dict]
    I --> K[finalize: escape for JSON\nstr for numbers\njson.dumps for strings\ndouble-encode for dicts]
    J --> K
    K --> L[Rendered JSON string]
    L --> M[json.loads to output DataFrame]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Line: 449-452

Comment:
**`prefer_dict_key_access` and `finalize` are always coupled together**

`finalize` (JSON escaping) and `prefer_dict_key_access` (nested-dict attribute resolution) are logically independent concerns, but the current implementation always sets them together whenever `record_str_fn is not None`. A future caller who wants one without the other—e.g., nested dot access on a non-JSON output, or a custom finalizer without the dict-key priority—would have to bypass `prepare_jinja2_template_renderer` and construct `UserTemplateSandboxEnvironment` directly.

Both parameters exist independently on `UserTemplateSandboxEnvironment.__init__`, so consider exposing `prefer_dict_key_access` as its own optional argument here to make the two axes of configuration independently reachable:

```python
env_kwargs: dict[str, Any] = {}
if record_str_fn is not None:
    env_kwargs["finalize"] = record_str_fn
if record_str_fn is not None or prefer_dict_key_access:
    env_kwargs["prefer_dict_key_access"] = True
```

Or at minimum, a brief docstring note clarifying that the flag is currently an implicit side-effect of providing `record_str_fn` would help future readers understand the intentional coupling.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py
Line: 33-44

Comment:
**No explicit guard for `float` special values (`nan` / `inf`)**

For numeric types that are not `bool`, `str`, `dict`, `list`, or `None`, the function falls through to `return str(value)`. This is correct for ordinary integers and floats, but `str(float('nan'))``"nan"` and `str(float('inf'))``"inf"`, which are not valid JSON tokens. If either value were to reach `_escape_value_for_json` they would produce a string that survives rendering and then fail silently (or raise) only at the `json.loads(rendered)` call in `_transform`, which makes the error harder to trace.

In practice `sanitize_record` (which runs a JSON round-trip via `serialize_data`) should catch or normalize NaN/Inf before rendering. But if `serialize_data` happens to use a lenient serializer (e.g. `allow_nan=True`), those values pass through. Adding an explicit guard makes the contract clear:

```python
if isinstance(value, float) and not (value == value):  # nan check
    return "null"
```

or simply document the assumption that callers guarantee `value` has already passed a strict JSON serialization step.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "docs: add missing pa..."

@andreatgretel
Copy link
Copy Markdown
Contributor Author

(AR) Suggestion: Multi-template path not extended with record_str_fn support

packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py:455

What: prepare_jinja2_multi_template_renderer and render_multi_template do not support record_str_fn, creating API asymmetry.

Why: The two rendering paths are now architecturally divergent. Future callers needing nested access in multi-template scenarios have no path.

Suggestion:

At minimum add a comment noting the gap. Ideally extract wrapping logic into a shared helper.

@andreatgretel
Copy link
Copy Markdown
Contributor Author

(AR) Suggestion: No test coverage for error paths in nested access

packages/data-designer-engine/tests/engine/processing/processors/test_schema_transform.py:210

What: Tests cover happy paths but not missing keys, None intermediates, or type errors in nested access.

Why: TemplateValue introduces new error surfaces (AttributeError on missing keys) that are untested.

Suggestion:

Add parametrized error cases for missing nested keys and None intermediate values.

@andreatgretel
Copy link
Copy Markdown
Contributor Author

(AR) This PR introduces TemplateValue, a wrapper enabling Jinja2 dot notation on deserialized JSON columns in schema transform templates. The design is well-motivated and the code is clean, with good separation between the generic wrapping mechanism in record.py and the JSON-specific escaping in schema_transform.py. The review covered all four changed files across processors, ginja environment, record handling, and tests.

The most important finding is a concrete bug: _escape_value_for_json produces Python-style "True"/"False" instead of JSON-conventional "true"/"false" for boolean values, because bool falls through all isinstance checks to the str(value) fallback. This was confirmed via smoke test. Additionally, the new _record_str_fn attribute lacks a class-level annotation (risking AttributeError if render_template is called before preparation), and the skip_record_sanitization parameter on safe_render widens the public security surface unnecessarily.

Verdict: needs-changes — 1 critical, 3 warnings, 5 suggestions.

- Fix boolean serialization: add bool check before str in _escape_value_for_json
  to produce JSON 'true'/'false' instead of Python 'True'/'False'
- Add class-level _record_str_fn annotation to WithJinja2UserTemplateRendering
- Rename skip_record_sanitization to _skip_record_sanitization (underscore prefix)
  to signal internal-only usage, and document it in safe_render docstring
- Add defensive error handling in TemplateValue.__getitem__ and __iter__
- Promote test input data to parametrize column, removing brittle string scan
- Add __eq__ and __hash__ to TemplateValue so Jinja2 equality
  conditionals (e.g. {% if result.label == "excellent" %}) work
- Add inline comment explaining deliberate double-encode in
  _escape_value_for_json for dict/list values
- Default _record_str_fn to None at class level so accessing it
  before prepare_jinja2_template_renderer doesn't mask the real error
@nabinchha
Copy link
Copy Markdown
Contributor

Small design thought, not a blocker: I wonder if there's a simpler path here using Jinja's finalize hook, so the template context can stay as plain sanitized dict / list data instead of introducing TemplateValue. Something along these lines:

env = UserTemplateSandboxEnvironment(...)
env.finalize = _escape_value_for_json
rendered = env.from_string(template).render(sanitize_record(record))

Since Jinja already supports nested access like {{ result.quality.score }} on normal dict-like data, this might preserve the nested lookup behavior while keeping the rendering model a bit simpler. I may be missing an edge case around the {{ result }} / JSON-string behavior, so totally fine if you already explored this and ruled it out. Just wanted to offer it as food for thought in case a quick spike shows it can cover the same cases with less custom wrapper logic.

Use Jinja2's built-in finalize hook for value-to-string conversion
and a getattr override for dict-key-priority lookup, eliminating the
custom TemplateValue wrapper class entirely.
@andreatgretel
Copy link
Copy Markdown
Contributor Author

@nabinchha Great call on the finalize approach — implemented in 5f90cbf. Used env.finalize = record_str_fn combined with a getattr override on the sandbox environment (to prefer dict key lookup over method resolution for keys like items). This eliminated TemplateValue, wrap_record, _skip_record_sanitization, and _record_str_fn entirely (-58 lines net). All tests pass.

@andreatgretel andreatgretel merged commit 61fa015 into main Mar 18, 2026
47 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/feat/nested-schema-transform branch April 14, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants