Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix validate(sample=x) for pl.DataFrame #1923

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

m-richards
Copy link
Collaborator

@m-richards m-richards commented Mar 2, 2025

Fixes #1912.

I had to feed through some additional information which meant passing through an additional argument to validate or subsample, which differs from the base class implementations. It seemed better to do that in subsample which is less user facing.

Not sure if I found the best place to put tests.

Copy link

codecov bot commented Mar 2, 2025

Codecov Report

Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.

Project coverage is 93.48%. Comparing base (812b2a8) to head (eea6292).
Report is 204 commits behind head on main.

Files with missing lines Patch % Lines
pandera/backends/polars/container.py 86.36% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1923      +/-   ##
==========================================
- Coverage   94.28%   93.48%   -0.80%     
==========================================
  Files          91      121      +30     
  Lines        7013     9398    +2385     
==========================================
+ Hits         6612     8786    +2174     
- Misses        401      612     +211     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend only accepting LazyFrame was an intentional choice here, to force the backend only use polar's lazy api when implementing the actual validation logic.

Thinking about this a little bit, instead of introducing an additional argument to the subsample api, what if we did the following:

  • Preserve the type of polars.DataFrame and pass it into the backend validate method
  • Do the subsampling on polars.DataFrame if that's what the user passed in
  • Convert it to a lazyframe.
  • At the end of the backend validate method, we may also want to convert that back into a polars.DataFrame if that's what the user passed in.

@m-richards
Copy link
Collaborator Author

I'll have a go and see how this works, I was aware that LazyFrame input was deliberate and wasn't sure if it made sense to change that.

@m-richards
Copy link
Collaborator Author

@cosmicBboy how do you see the sequencing of this working? If I pass in a pl.DataFrame, would you expect the backend validate to go:

  1. pl.DataFrame -> pl.LazyFrame for parsing -> convert back to pl.DataFrame for subsampling?
  2. Bring sampling forward so that parsing is applied after sampling?
  3. Something else? Making the parsing support pl.DataFrames as well?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 12, 2025

hmm, this is tricky. So parsing should be applied to the entire dataframe, regardless of subsampling. If the validation checks pass on the subsample, the user should still expect to get the full parsed dataframe.

The reason subsampling exists is to make validation tractable for extremely large dataframes (with the understanding that pandera may not catch errors in the out-of-sample set).

So I guess if you pass in a polars.DataFrame the sequence of events are:

  1. parse (polars.DataFrame): this assumes that parsers work with both polars.Dataframe and polars.LazyFrame. If this doesn't work (or to make the parsing interface consistent) then maybe this step converts the DataFrame -> LazyFrame -> parsers -> DataFrame for subsample.
  2. subsample (polars.DataFrame)
  3. convert to polars.LazyFrame
  4. run the rest of the validation checks on the polars.LazyFrame
  5. convert back to polars.DataFrame

@@ -1,4 +1,5 @@
[mypy]
disable_error_code =annotation-unchecked
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this subpresses all the tests/geopandas/test_pydantic.py:20: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked] type warnings/notes in the mypy pre-commit stage

@@ -20,7 +20,8 @@ class CheckResult(NamedTuple):


PolarsCheckObjects = Union[pl.LazyFrame, pl.DataFrame]

PolarsFrame = TypeVar("PolarsFrame", pl.LazyFrame, pl.DataFrame)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if these should go here or somewhere else, since some packages treat api packages a public facing, but I saw PolarsCheckObjects and PolarsData are here and they're not exposed externally.


def _to_frame_kind(obj: PolarsFrame, kind: type[PolarsFrame2]) -> PolarsFrame2:
if isinstance(obj, pl.DataFrame):
if issubclass(kind, pl.DataFrame):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be if kind ==pl.DataFrame but mypy doesn't seem to narrow on that properly

@m-richards
Copy link
Collaborator Author

@cosmicBboy I've just done another iteration on this. I had a look into making the parsers accept pl.DataFrame directly, but this starts having to touch lots of places, we'd need to make pandera/engines/polars_engine.py::DataType and subclasses support pl.DataFrame inputs in the coerce and try_coerce methods. That's also probably doable if I were to do this, I'd prefer to make mypy run on polars_engine first (it's probably not strictly necessary but I think the tests inputs (or a subset) would also need to then be parametrised to feed in DataFrame/LazyFrame inputs as these would be slightly different code paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Polars DataFrameModel.validate crashes with sample specified
2 participants