BUG: fix validate(sample=x) for pl.DataFrame #1923

m-richards · 2025-03-02T10:06:24Z

Fixes #1912.

I had to feed through some additional information which meant passing through an additional argument to validate or subsample, which differs from the base class implementations. It seemed better to do that in subsample which is less user facing.

Not sure if I found the best place to put tests.

codecov · 2025-03-02T10:08:11Z

Codecov Report

Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.

Project coverage is 93.48%. Comparing base (812b2a8) to head (eea6292).
Report is 204 commits behind head on main.

Files with missing lines	Patch %	Lines
pandera/backends/polars/container.py	86.36%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1923      +/-   ##
==========================================
- Coverage   94.28%   93.48%   -0.80%     
==========================================
  Files          91      121      +30     
  Lines        7013     9398    +2385     
==========================================
+ Hits         6612     8786    +2174     
- Misses        401      612     +211

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Matt Richards <[email protected]>

cosmicBboy

The backend only accepting LazyFrame was an intentional choice here, to force the backend only use polar's lazy api when implementing the actual validation logic.

Thinking about this a little bit, instead of introducing an additional argument to the subsample api, what if we did the following:

Preserve the type of polars.DataFrame and pass it into the backend validate method
Do the subsampling on polars.DataFrame if that's what the user passed in
Convert it to a lazyframe.
At the end of the backend validate method, we may also want to convert that back into a polars.DataFrame if that's what the user passed in.

m-richards · 2025-03-07T10:41:11Z

I'll have a go and see how this works, I was aware that LazyFrame input was deliberate and wasn't sure if it made sense to change that.

m-richards · 2025-03-10T10:31:45Z

@cosmicBboy how do you see the sequencing of this working? If I pass in a pl.DataFrame, would you expect the backend validate to go:

pl.DataFrame -> pl.LazyFrame for parsing -> convert back to pl.DataFrame for subsampling?
Bring sampling forward so that parsing is applied after sampling?
Something else? Making the parsing support pl.DataFrames as well?

cosmicBboy · 2025-03-12T01:38:13Z

hmm, this is tricky. So parsing should be applied to the entire dataframe, regardless of subsampling. If the validation checks pass on the subsample, the user should still expect to get the full parsed dataframe.

The reason subsampling exists is to make validation tractable for extremely large dataframes (with the understanding that pandera may not catch errors in the out-of-sample set).

So I guess if you pass in a polars.DataFrame the sequence of events are:

parse (polars.DataFrame): this assumes that parsers work with both polars.Dataframe and polars.LazyFrame. If this doesn't work (or to make the parsing interface consistent) then maybe this step converts the DataFrame -> LazyFrame -> parsers -> DataFrame for subsample.
subsample (polars.DataFrame)
convert to polars.LazyFrame
run the rest of the validation checks on the polars.LazyFrame
convert back to polars.DataFrame

Signed-off-by: Matt Richards <[email protected]>

m-richards · 2025-03-16T00:09:43Z

mypy.ini

@@ -1,4 +1,5 @@
 [mypy]
+disable_error_code =annotation-unchecked


this subpresses all the tests/geopandas/test_pydantic.py:20: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked] type warnings/notes in the mypy pre-commit stage

m-richards · 2025-03-16T00:11:54Z

pandera/api/polars/types.py

@@ -20,7 +20,8 @@ class CheckResult(NamedTuple):


 PolarsCheckObjects = Union[pl.LazyFrame, pl.DataFrame]
-
+PolarsFrame = TypeVar("PolarsFrame", pl.LazyFrame, pl.DataFrame)


I wasn't sure if these should go here or somewhere else, since some packages treat api packages a public facing, but I saw PolarsCheckObjects and PolarsData are here and they're not exposed externally.

m-richards · 2025-03-16T00:15:00Z

pandera/backends/polars/container.py

+
+def _to_frame_kind(obj: PolarsFrame, kind: type[PolarsFrame2]) -> PolarsFrame2:
+    if isinstance(obj, pl.DataFrame):
+        if issubclass(kind, pl.DataFrame):


this could be if kind ==pl.DataFrame but mypy doesn't seem to narrow on that properly

m-richards · 2025-03-16T00:21:19Z

@cosmicBboy I've just done another iteration on this. I had a look into making the parsers accept pl.DataFrame directly, but this starts having to touch lots of places, we'd need to make pandera/engines/polars_engine.py::DataType and subclasses support pl.DataFrame inputs in the coerce and try_coerce methods. That's also probably doable if I were to do this, I'd prefer to make mypy run on polars_engine first (it's probably not strictly necessary but I think the tests inputs (or a subset) would also need to then be parametrised to feed in DataFrame/LazyFrame inputs as these would be slightly different code paths

fix validate(sample=x) for pl.DataFrame

ba4896e

Signed-off-by: Matt Richards <[email protected]>

m-richards force-pushed the bugfix/1912 branch from 5c0f037 to ba4896e Compare March 3, 2025 08:33

cosmicBboy reviewed Mar 7, 2025

View reviewed changes

m-richards added 3 commits March 16, 2025 11:01

make this work with lazy only parsers

2e4a845

Signed-off-by: Matt Richards <[email protected]>

update to run post sample checks on lf

db5ca18

Signed-off-by: Matt Richards <[email protected]>

minimise diff

eea6292

Signed-off-by: Matt Richards <[email protected]>

m-richards commented Mar 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix validate(sample=x) for pl.DataFrame #1923

BUG: fix validate(sample=x) for pl.DataFrame #1923

m-richards commented Mar 2, 2025 •

edited

Loading

codecov bot commented Mar 2, 2025 •

edited

Loading

cosmicBboy left a comment •

edited

Loading

m-richards commented Mar 7, 2025

m-richards commented Mar 10, 2025

cosmicBboy commented Mar 12, 2025 •

edited

Loading

m-richards Mar 16, 2025

m-richards Mar 16, 2025

m-richards Mar 16, 2025

m-richards commented Mar 16, 2025

		@@ -1,4 +1,5 @@
		[mypy]
		disable_error_code =annotation-unchecked

		@@ -20,7 +20,8 @@ class CheckResult(NamedTuple):


		PolarsCheckObjects = Union[pl.LazyFrame, pl.DataFrame]

		PolarsFrame = TypeVar("PolarsFrame", pl.LazyFrame, pl.DataFrame)

BUG: fix validate(sample=x) for pl.DataFrame #1923

Are you sure you want to change the base?

BUG: fix validate(sample=x) for pl.DataFrame #1923

Conversation

m-richards commented Mar 2, 2025 • edited Loading

codecov bot commented Mar 2, 2025 • edited Loading

Codecov Report

cosmicBboy left a comment • edited Loading

Choose a reason for hiding this comment

m-richards commented Mar 7, 2025

m-richards commented Mar 10, 2025

cosmicBboy commented Mar 12, 2025 • edited Loading

m-richards Mar 16, 2025

Choose a reason for hiding this comment

m-richards Mar 16, 2025

Choose a reason for hiding this comment

m-richards Mar 16, 2025

Choose a reason for hiding this comment

m-richards commented Mar 16, 2025

m-richards commented Mar 2, 2025 •

edited

Loading

codecov bot commented Mar 2, 2025 •

edited

Loading

cosmicBboy left a comment •

edited

Loading

cosmicBboy commented Mar 12, 2025 •

edited

Loading