Consolidating/streamlining object-based and class-based pandera API #643

cosmicBboy · 2021-10-05T01:40:03Z

cosmicBboy
Oct 5, 2021
Maintainer

@jeffzi wanted to ping you on this question, and I wanted to discuss on here before turning it into a full-blown issue.

There are a few things I've been thinking about re: usability and wanted to list them off here under the broader theme of consolidating and streamlining the pandera API:

1. Support datatypes in SchemaModel

Make it so that users don't have to explicitly specify pa.typing.Series or pa.typing.Index. It would also support pandera datatypes.

class Schema(SchemaModel):
    column1: int
    column2: int
    index1: int = pa.Field(index=True)

Pro: more concise syntax, potentially fewer imports
Con: loss of editor auto-completion for Series/Index? (I haven't testing this)

2. Support `pa.Field` in `DataFrameSchema`

schema = DataFrameSchema({
    "column1": pa.Field(dtype=int),
    "column2": pa.Field(dtype=int),
    "index1": pa.Field(dtype=int, index=True),
})

Pro: fewer concepts to learn, user doesn't have to think about the index kwarg
Cons: would have to introduce a dtype kwarg to Field, only to be specified for the object-based API. This would perhaps cause some confusion both to users and contributors, and muddy the distinction between SchemaModel components and DataFrameSchema object components. I've also wanted to preserve the first positional arg of Field to default, with similar semantics to the dataclass or pydantic implementation, except it would fill nan values with the default.

These changes should be designed to be backwards compatible, and I think it would be pretty straight-forward to implement.

Thoughts?

jeffzi · 2021-10-06T21:12:31Z

jeffzi
Oct 6, 2021
Collaborator

Hi @cosmicBboy. I haven't had the chance to think carefully about it yet, I'll try to give a proper reply in the coming days.

Anyway, my first thoughts are:

I'm worried about mypy on top of auto-completion. We'd need to test. On the other hand, it's true than the type annotations can be super verbose. Perhaps designing official aliases for common patterns such as Optional[Series[]] would make it cleaner, if your proposal turns out to make mypy & friends very unhappy.
I love this ! One thing that bothers me with the current dict of Columns is that the column name can be specified as key and Column(name=). Most people will not use the name argument though.

would have to introduce a dtype kwarg to Field, only to be specified for the object-based API.

We could throw an error if dtype is given in a SchemaModel, so the users would quickly learn not to do that mistake

I've also wanted to preserve the first positional arg of Field to default, with similar semantics to the dataclass or pydantic implementation, except it would fill nan values with the default.

This scenario is not very common. I think the first positional argument should be given to dtype because it's probably the most frequent argument. Right now, Field has no positional argument so we can think carefully about what makes more sense without breaking anything.

0 replies

NickCrews · 2022-04-22T19:11:53Z

NickCrews
Apr 22, 2022

[coming from #839, pretty much just throwing in a bunch of new thoughts without giving concrete feedback on the above, sorry!]

IDK if either of you have every looked at sqlalchemy, but they look to be solving a very similar problem. I think they have a very nice method of both a declarative and imperative API to create a schema, but no matter your method of creation, you always end up with a class that represents the schema. They do some clever stuff where the imperative API actually returns a class, not an object, and when you use the declarative syntax, your class actually subclasses from a dynamically created base class. See the docs for a sense.

Installing sqlalchemy and playing around with it could be a good test to see how well you could possibly integrate with mypy, IDE autcompletion, etc.

The things I really like about this is:

easy to transition between creation APIs: I was using pandera's declarative API, but then I was getting frustrated with composing them to make derivative schemas (I couldn't remove columns that a superclass has). So I switched to the imperative API, but that transition was hard, because I couldn't do simple find/replace. The Column vs Field, dtype_kwargs, etc were subtly different and I had to do it manually.
Use a mismatch of APIs: I could have avoided a lot of that transition pain if I could have just changed SOME of the schemas to use imperative style, but keep the rest declarative. But currently pandera's imperative schemas do not interoperate with declarative schemas. It would be great if you always got the same type back from either construction API.

Things to note about this method:

schemas are **class**es: This means they are singletons. I think this usually should be desirable? why would you have multiple slightly different versions of the same schema in one program run?
They claim to have support for mypy? https://docs.sqlalchemy.org/en/14/orm/declarative_styles.html?highlight=mypy#creating-an-explicit-base-non-dynamically-for-use-with-mypy-similar

In the linked issue I'm coming from, @cosmicBboy said:

Are you referring to how you construct the schema, or how you modify that schema? Seems like it would be hard to get away from this, as the two APIs serve two different purposes/programming preferences... the DataFrameSchema is better for inline validation while SchemaModel is better for people who care about python type annotations.

I don't have the background for this, sorry. I'm assuming I'm missing something when I ask what is getting in the way of us having both benefits? Maybe the sqlalchemy approach to typing that I linked above might be a lead?

1 reply

cosmicBboy Apr 25, 2022
Maintainer Author

Thanks for the thoughts @NickCrews !

I'll write a more comprehensive response in the next few days but the short answer is:

great thoughts in there! I'll respond to some of these by pointing you to current (perhaps sub-optimal) ways of solving your pain-points.
in the longer term I'd love to figure out a more streamlined interface for both object-/class-based APIs (not actually sure how sqlalchemy defines imperative vs declarative... both ways of writing sqlalchemy seem imperative to me 🤷‍♂️) but I think it's gonna take some time to transition to whatever this more streamlined interface looks like due to historical reasons (e.g. DataFrameSchema is the class that actually implements the validation logic, SchemaModel simply converts to a DataFrameSchema object). I'd like to not break current users as much as possible, so it'll be tight-rope dance 💃

@jeffzi feel free to provide your thoughts here too, since basically doing this consolidation will entail many structural changes to the codebase, including the SchemaModel and related modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidating/streamlining object-based and class-based pandera API #643

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Consolidating/streamlining object-based and class-based pandera API #643

cosmicBboy Oct 5, 2021 Maintainer

1. Support datatypes in SchemaModel

2. Support pa.Field in DataFrameSchema

Replies: 2 comments · 1 reply

jeffzi Oct 6, 2021 Collaborator

NickCrews Apr 22, 2022

cosmicBboy Apr 25, 2022 Maintainer Author

cosmicBboy
Oct 5, 2021
Maintainer

2. Support `pa.Field` in `DataFrameSchema`

Replies: 2 comments 1 reply

jeffzi
Oct 6, 2021
Collaborator

NickCrews
Apr 22, 2022

cosmicBboy Apr 25, 2022
Maintainer Author