Fix: expand dict transformation#9561
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
1 issue found across 3 files
Architecture diagram
sequenceDiagram
participant User as User (Marimo UI)
participant DFPlugin as Dataframe Plugin
participant Handler as ExpandDict Handler (handlers.py)
participant Narwhals as Narwhals Layer
participant Polars as Polars Engine
participant Backend as Original Backend (pandas/Ibis/other)
Note over User,Backend: Expand Dict Transformation Flow
User->>DFPlugin: Trigger expand dict on column
DFPlugin->>Handler: handle_expand_dict(df, transform)
Handler->>Narwhals: collect_and_preserve_type(df)
Narwhals->>Backend: Collect actual data from original backend
Backend-->>Narwhals: Data as native type
Narwhals-->>Handler: Collected DataFrame + undo function
Handler->>Polars: collected_df.to_polars()
Note over Handler,Polars: Convert to Polars for unnest support
Polars->>Polars: polars_df.unnest(column_id)
Note over Polars: Handles null dict values correctly
Polars-->>Handler: Unnested Polars DataFrame
Handler->>Narwhals: nw.from_native(unnested)
Narwhals->>Handler: Narwhals wrapper
Handler->>Handler: undo(narwhals_df)
Note over Handler: Convert back to original backend type
Handler-->>DFPlugin: Transformed DataFrame
DFPlugin-->>User: Updated table with expanded columns
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
mscolnick
left a comment
There was a problem hiding this comment.
need to take an optional dep on polars
… rows with the create test dataframes instead.
refactor: adding None == NaN in assert frame equal with nans method to use it in expand_dict test.
Done. |
|
There are some pandas CI errors that i am looking into. |
… errors for mixed object columns.
This is happening because of data conversion mismatch between pandas and narwahls with mixed data types for So my only option is to fallback to pandas backend processing separately for the |
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
… json normalise following the handlers code.
Bundle ReportBundle size has no change ✅ Affected Assets, Files, and Routes:view changes for bundle: marimo-esmAssets Changed:
|
|
@kirangadhave I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Pull request overview
This PR makes the dataframe Expand Dict transform robust to nulls by routing expansion through backend-native implementations (Polars unnest and Pandas json_normalize) and adds/updates tests to validate the behavior, including nested dict values.
Changes:
- Update runtime transform handling to expand dict/struct columns using Pandas-native logic for pandas inputs and Polars
unnestotherwise. - Update generated “print code” for Expand Dict in pandas and polars to match the new implementations.
- Expand test coverage for Expand Dict with nulls and nested dicts; adjust equality helper to optionally treat
NoneandNaNas equivalent.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| marimo/_plugins/ui/_impl/dataframes/transforms/handlers.py | Implements Expand Dict via pandas json_normalize or Polars unnest after collection. |
| marimo/_plugins/ui/_impl/dataframes/transforms/print_code.py | Updates printed code generation for Expand Dict for pandas and polars. |
| tests/_plugins/ui/_impl/dataframes/test_handlers.py | Unskips/extends Expand Dict tests (nulls + nested dicts) and tweaks dataframe comparison helper. |
| tests/_plugins/ui/_impl/dataframes/test_print_code.py | Adds print-code parity tests for Expand Dict with nested dicts for pandas and polars. |
| result = apply(df, in_transform) | ||
| assert_frame_equal_with_nans(result, expected) | ||
|
|
||
| @staticmethod | ||
| @pytest.mark.parametrize( | ||
| ("df", "expected"), | ||
| list( | ||
| zip( | ||
| create_test_dataframes( | ||
| {"nulls": [1, 2, 3, None, "hello"]}, include=["pandas"] | ||
| ), | ||
| create_test_dataframes({"nulls": [None]}, include=["pandas"]), | ||
| strict=False, | ||
| ) | ||
| ), | ||
| ) | ||
| def test_filter_rows_null_pandas_object( | ||
| df: DataFrameType, expected: DataFrameType | ||
| ) -> None: | ||
| in_transform = FilterRowsTransform( | ||
| type=TransformType.FILTER_ROWS, | ||
| operation="keep_rows", | ||
| where=FilterGroup( | ||
| type="group", | ||
| operator="and", | ||
| children=[ | ||
| FilterCondition( | ||
| type="condition", | ||
| column_id="nulls", | ||
| operator="in", | ||
| value=[None], | ||
| ) | ||
| ], | ||
| ), | ||
| ) | ||
| result = apply(df, in_transform) | ||
| assert_frame_equal_with_nans(result, expected) | ||
|
|
||
| @staticmethod | ||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
No issues found across 4 files
Architecture diagram
sequenceDiagram
participant UI as DataFrame UI
participant Handler as NarwhalsTransformHandler
participant Narwhals as Narwhals Layer
participant Backend as DataFrame Backend
participant PrintCode as Print Code Generator
Note over UI,PrintCode: Expand Dict Transform Flow - Current State
UI->>Handler: handle_expand_dict(DataFrame, ExpandDictTransform)
Handler->>Narwhals: collect_and_preserve_type(df)
Narwhals-->>Handler: (collected_df, undo)
Handler->>Narwhals: collected_df.to_native()
Narwhals-->>Handler: native_df
alt Pandas Backend
Handler->>Handler: Check if pandas dataframe
Handler->>Backend: result_df = native_df.copy()
Handler->>Backend: expanded = pd.json_normalize(result_df.pop(column_id).map(...), max_level=0)
Backend-->>Handler: expanded DataFrame
Handler->>Backend: expanded.index = result_df.index
Handler->>Backend: result_df.join(expanded)
Backend-->>Handler: joined DataFrame
Handler->>Narwhals: undo(nw.from_native(joined))
Narwhals-->>Handler: original backend type
else Polars Backend
Handler->>Narwhals: collected_df.to_polars()
Narwhals-->>Handler: polars_df
Handler->>Backend: polars_df.unnest(column_id)
Backend-->>Handler: unnested DataFrame
Handler->>Narwhals: undo(nw.from_native(unnested))
Narwhals-->>Handler: original backend type
end
Handler-->>UI: Transformed DataFrame
Note over UI,PrintCode: Print Code Generation
UI->>PrintCode: Generate Python code for transform
PrintCode->>PrintCode: Check backend type
alt Pandas Backend
PrintCode->>PrintCode: Generate: df.join(pd.json_normalize(df.pop(col).map(...), max_level=0).set_axis(...))
else Polars Backend
PrintCode->>PrintCode: Generate: df.unnest(column_id)
end
PrintCode-->>UI: Generated code string
|
@Shamik-07 can you please update the video in the PR to a higher res version? I'm having difficulty reading text in it |
Unfortunately, due to GH upload limitations, i can't upload a high res video. |
kirangadhave
left a comment
There was a problem hiding this comment.
please address copilot review. We can do a manual round of reviews after that.
Thanks, fixed. |
kirangadhave
left a comment
There was a problem hiding this comment.
Requesting a major change which should remove hard dependency on polars for other backends which support structs.
Also please address other nits.
| expanded.index = result_df.index | ||
| return undo(nw.from_native(result_df.join(expanded))) | ||
|
|
||
| DependencyManager.polars.require( |
There was a problem hiding this comment.
non pandas dataframes are forced to go through polars conversion here. For duckdb or ibis, that means forcing polars installation. We should not do that.
Narwhals has struct.field, so we can do:
schema = df.collect_schema()
fields = [f.name for f in schema[col].fields]
df.with_columns(
[nw.col(col).struct.field(f).alias(f) for f in fields]
).drop(col)This approach also stays lazy. Pandas approach with json_serialize is correct.
There was a problem hiding this comment.
Are you sure that all backends, which support struct schema would necessarily have struct.field?
| # older versions of pandas running on py310 otherwise CI will fail | ||
| expanded = pd.json_normalize( | ||
| result_df.pop(transform.column_id).map( | ||
| lambda value: {} if value is None else value |
| import pandas as pd | ||
|
|
||
| result_df = native_df.copy() | ||
| # max_level=0 was used so that pandas doesn't recursively unnest dicts |
There was a problem hiding this comment.
the comment is narrating the code, simplify to explain the why instead.
| max_level=0, | ||
| ) | ||
| expanded.index = result_df.index | ||
| return undo(nw.from_native(result_df.join(expanded))) |
There was a problem hiding this comment.
Duplicate column names after unnest will throw error here. Handle it gracefully.
| why="to expand dict/struct columns for non-pandas backends" | ||
| ) | ||
| polars_df = collected_df.to_polars() | ||
| unnested = polars_df.unnest(transform.column_id) |
There was a problem hiding this comment.
Duplicate column names after unnest will throw error here. Handle it gracefully.
| pd = pytest.importorskip("pandas") | ||
| pytest.importorskip("polars") | ||
| pytest.importorskip("pyarrow") | ||
| pytest.importorskip("polars") |
There was a problem hiding this comment.
import order change is unnecessary
feat: raising duplicate columns error test: adding necessary tests for duplicate columns
|
Thanks for the review comments. |
📝 Summary
Using narwahls to convert all backend to polars and then using the
unnestfunction of polars for expanding the dict and then convert it back to the original backend.Closes #4583
Screen.Recording.2026-05-15.at.18.07.461.mov
📋 Pre-Review Checklist
✅ Merge Checklist