-
-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix validate(sample=x) for pl.DataFrame #1923
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1923 +/- ##
==========================================
- Coverage 94.28% 93.48% -0.80%
==========================================
Files 91 121 +30
Lines 7013 9398 +2385
==========================================
+ Hits 6612 8786 +2174
- Misses 401 612 +211 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Matt Richards <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backend only accepting LazyFrame
was an intentional choice here, to force the backend only use polar's lazy api when implementing the actual validation logic.
Thinking about this a little bit, instead of introducing an additional argument to the subsample api, what if we did the following:
- Preserve the type of
polars.DataFrame
and pass it into the backend validate method - Do the subsampling on
polars.DataFrame
if that's what the user passed in - Convert it to a lazyframe.
- At the end of the backend validate method, we may also want to convert that back into a
polars.DataFrame
if that's what the user passed in.
I'll have a go and see how this works, I was aware that LazyFrame input was deliberate and wasn't sure if it made sense to change that. |
@cosmicBboy how do you see the sequencing of this working? If I pass in a pl.DataFrame, would you expect the backend validate to go:
|
hmm, this is tricky. So parsing should be applied to the entire dataframe, regardless of subsampling. If the validation checks pass on the subsample, the user should still expect to get the full parsed dataframe. The reason subsampling exists is to make validation tractable for extremely large dataframes (with the understanding that pandera may not catch errors in the out-of-sample set). So I guess if you pass in a
|
Signed-off-by: Matt Richards <[email protected]>
Signed-off-by: Matt Richards <[email protected]>
Signed-off-by: Matt Richards <[email protected]>
@@ -1,4 +1,5 @@ | |||
[mypy] | |||
disable_error_code =annotation-unchecked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this subpresses all the tests/geopandas/test_pydantic.py:20: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
type warnings/notes in the mypy pre-commit stage
@@ -20,7 +20,8 @@ class CheckResult(NamedTuple): | |||
|
|||
|
|||
PolarsCheckObjects = Union[pl.LazyFrame, pl.DataFrame] | |||
|
|||
PolarsFrame = TypeVar("PolarsFrame", pl.LazyFrame, pl.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if these should go here or somewhere else, since some packages treat api packages a public facing, but I saw PolarsCheckObjects and PolarsData are here and they're not exposed externally.
|
||
def _to_frame_kind(obj: PolarsFrame, kind: type[PolarsFrame2]) -> PolarsFrame2: | ||
if isinstance(obj, pl.DataFrame): | ||
if issubclass(kind, pl.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be if kind ==pl.DataFrame but mypy doesn't seem to narrow on that properly
@cosmicBboy I've just done another iteration on this. I had a look into making the parsers accept pl.DataFrame directly, but this starts having to touch lots of places, we'd need to make pandera/engines/polars_engine.py::DataType and subclasses support pl.DataFrame inputs in the |
Fixes #1912.
I had to feed through some additional information which meant passing through an additional argument to validate or subsample, which differs from the base class implementations. It seemed better to do that in subsample which is less user facing.
Not sure if I found the best place to put tests.