fix: support nested field access in schema transform templates#435
fix: support nested field access in schema transform templates#435andreatgretel merged 5 commits intomainfrom
Conversation
Enable {{ result.quality.score }} style dot notation in schema
transform Jinja2 templates, where result is a deserialized JSON column.
Previously, _json_escape_record flattened all dict values to escaped
JSON strings before Jinja2 saw them. This made the rendered output
valid JSON but prevented nested access since Jinja2 only saw strings.
The fix introduces TemplateValue, a wrapper that defers the choice
between "drill into nested dict" and "render as escaped string" to
template evaluation time. Jinja2 resolves dot notation via __getattr__
(returning a new TemplateValue for the nested value), and converts to
string via __str__ (delegating to a caller-provided str_fn). This is
necessary because plain dicts render as Python repr ({'key': 'val'})
which is invalid JSON - we need to control __str__ to produce properly
escaped JSON, and that requires a wrapper object.
Other Jinja2 consumers (prompt templates, expression columns) don't
need this - Jinja2 natively supports dot access on plain dicts via
getattr-to-getitem fallback, and plain str() is fine for text output.
Schema transform is unique because its output must be valid JSON.
Greptile SummaryThis PR fixes a bug where schema transform templates could not perform nested dot-notation access on deserialized JSON columns (e.g. Key changes:
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py | Adds prefer_dict_key_access flag to UserTemplateSandboxEnvironment that overrides getattr to prefer dict key lookup over Python attribute/method access; extends prepare_jinja2_template_renderer to accept an optional record_str_fn which is wired up as Jinja2's finalize hook and automatically enables prefer_dict_key_access. Logically correct and safe after JSON-sanitization guarantees basic types, but the two concerns are coupled in the public API. |
| packages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py | Replaces _json_escape_record (which pre-flattened all dicts to escaped strings, preventing nested access) with _escape_value_for_json (a single-value escape function passed as record_str_fn). The deserialized record is now passed directly to render_template, with escaping deferred to Jinja2's finalize hook. Clean and correct. |
| packages/data-designer-engine/tests/engine/processing/processors/test_schema_transform.py | Adds well-structured parametrized tests covering nested dot access, mixed nested+flat rendering, and list indexing. Uses clear pytest.param with id= labels. No fragile branching heuristics. The items key test case specifically validates the prefer_dict_key_access behaviour. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[SchemaTransformProcessor] --> B[prepare renderer\nwith escape finalize fn]
B --> C[UserTemplateSandbox\nfinalize=escape fn\nprefer dict key access]
A --> D[Loop over DataFrame rows]
D --> E[deserialize JSON values\nstrings become Python dicts]
E --> F[render template]
F --> G[sanitize record\nJSON round-trip]
G --> H{Expression type}
H -->|dot access: result.quality.score| I[getattr override\ndict key takes priority\nover Python method]
H -->|whole dict: result| J[return full dict]
I --> K[finalize: escape for JSON\nstr for numbers\njson.dumps for strings\ndouble-encode for dicts]
J --> K
K --> L[Rendered JSON string]
L --> M[json.loads to output DataFrame]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Line: 449-452
Comment:
**`prefer_dict_key_access` and `finalize` are always coupled together**
`finalize` (JSON escaping) and `prefer_dict_key_access` (nested-dict attribute resolution) are logically independent concerns, but the current implementation always sets them together whenever `record_str_fn is not None`. A future caller who wants one without the other—e.g., nested dot access on a non-JSON output, or a custom finalizer without the dict-key priority—would have to bypass `prepare_jinja2_template_renderer` and construct `UserTemplateSandboxEnvironment` directly.
Both parameters exist independently on `UserTemplateSandboxEnvironment.__init__`, so consider exposing `prefer_dict_key_access` as its own optional argument here to make the two axes of configuration independently reachable:
```python
env_kwargs: dict[str, Any] = {}
if record_str_fn is not None:
env_kwargs["finalize"] = record_str_fn
if record_str_fn is not None or prefer_dict_key_access:
env_kwargs["prefer_dict_key_access"] = True
```
Or at minimum, a brief docstring note clarifying that the flag is currently an implicit side-effect of providing `record_str_fn` would help future readers understand the intentional coupling.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py
Line: 33-44
Comment:
**No explicit guard for `float` special values (`nan` / `inf`)**
For numeric types that are not `bool`, `str`, `dict`, `list`, or `None`, the function falls through to `return str(value)`. This is correct for ordinary integers and floats, but `str(float('nan'))` → `"nan"` and `str(float('inf'))` → `"inf"`, which are not valid JSON tokens. If either value were to reach `_escape_value_for_json` they would produce a string that survives rendering and then fail silently (or raise) only at the `json.loads(rendered)` call in `_transform`, which makes the error harder to trace.
In practice `sanitize_record` (which runs a JSON round-trip via `serialize_data`) should catch or normalize NaN/Inf before rendering. But if `serialize_data` happens to use a lenient serializer (e.g. `allow_nan=True`), those values pass through. Adding an explicit guard makes the contract clear:
```python
if isinstance(value, float) and not (value == value): # nan check
return "null"
```
or simply document the assumption that callers guarantee `value` has already passed a strict JSON serialization step.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "docs: add missing pa..."
packages/data-designer-engine/tests/engine/processing/processors/test_schema_transform.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Show resolved
Hide resolved
...ages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/record.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/record.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/processing/processors/test_schema_transform.py
Outdated
Show resolved
Hide resolved
|
(AR) Suggestion: Multi-template path not extended with record_str_fn support
What: prepare_jinja2_multi_template_renderer and render_multi_template do not support record_str_fn, creating API asymmetry. Why: The two rendering paths are now architecturally divergent. Future callers needing nested access in multi-template scenarios have no path. Suggestion: |
|
(AR) Suggestion: No test coverage for error paths in nested access
What: Tests cover happy paths but not missing keys, None intermediates, or type errors in nested access. Why: TemplateValue introduces new error surfaces (AttributeError on missing keys) that are untested. Suggestion: |
|
(AR) This PR introduces The most important finding is a concrete bug: Verdict: needs-changes — 1 critical, 3 warnings, 5 suggestions. |
- Fix boolean serialization: add bool check before str in _escape_value_for_json to produce JSON 'true'/'false' instead of Python 'True'/'False' - Add class-level _record_str_fn annotation to WithJinja2UserTemplateRendering - Rename skip_record_sanitization to _skip_record_sanitization (underscore prefix) to signal internal-only usage, and document it in safe_render docstring - Add defensive error handling in TemplateValue.__getitem__ and __iter__ - Promote test input data to parametrize column, removing brittle string scan
packages/data-designer-engine/src/data_designer/engine/processing/ginja/record.py
Outdated
Show resolved
Hide resolved
...ages/data-designer-engine/src/data_designer/engine/processing/processors/schema_transform.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Show resolved
Hide resolved
- Add __eq__ and __hash__ to TemplateValue so Jinja2 equality
conditionals (e.g. {% if result.label == "excellent" %}) work
- Add inline comment explaining deliberate double-encode in
_escape_value_for_json for dict/list values
- Default _record_str_fn to None at class level so accessing it
before prepare_jinja2_template_renderer doesn't mask the real error
|
Small design thought, not a blocker: I wonder if there's a simpler path here using Jinja's env = UserTemplateSandboxEnvironment(...)
env.finalize = _escape_value_for_json
rendered = env.from_string(template).render(sanitize_record(record))Since Jinja already supports nested access like |
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py
Outdated
Show resolved
Hide resolved
Use Jinja2's built-in finalize hook for value-to-string conversion and a getattr override for dict-key-priority lookup, eliminating the custom TemplateValue wrapper class entirely.
|
@nabinchha Great call on the |
Summary
Schema transform templates now support nested dot notation on deserialized JSON columns:
Previously,
{{ result.quality.score }}failed with'str object' has no attribute 'quality'.Why this needed a wrapper (TemplateValue) instead of a simpler fix
Schema transform is unlike other Jinja2 consumers in DataDesigner (prompts, expressions) because its output must be valid JSON. This creates a tension:
{{ x.y }}by looking upyonxHe said "hello"must becomeHe said \"hello\"inside a JSON stringThe old code (
_json_escape_record) chose option 2: flatten all dicts to escaped strings before Jinja2 sees them. Safe JSON, but no nested access.We can't just pass plain dicts and escape only the leaves either. When Jinja2 converts a whole dict to string (for
{{ result }}), it uses Python repr:{'key': 'val'}with single quotes - invalid JSON. The escaping needs to happen dynamically at render time depending on whether Jinja2 is drilling into a value or interpolating it.TemplateValuesolves this by deferring the decision:{{ result.quality }}) triggers__getattr__, which returns a newTemplateValuewrapping the nested value{{ result }}) triggers__str__, which applies a caller-provided escape functionThe escape function (
_escape_value_for_json) is specific to schema transform. Other Jinja2 consumers (prompt templates, expression columns) don't need any of this - Jinja2 natively handles dot access on plain dicts via getattr-to-getitem fallback, and plainstr()is fine for text output.Changes
ginja/record.py- newTemplateValueclass +wrap_recordhelperginja/environment.py-prepare_jinja2_template_rendereraccepts optionalrecord_str_fn;safe_renderacceptsskip_record_sanitization(same pattern as existingskip_template_validation)processors/schema_transform.py- replaced_json_escape_recordwith_escape_value_for_jsonpassed asrecord_str_fntest_schema_transform.py- regression tests for nested dot access, mixed nested+flat, and list indexingTest plan