-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Union type with dataclass ambiguous error and support superset comparison #5858
Conversation
Signed-off-by: mao3267 <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5858 +/- ##
==========================================
+ Coverage 36.90% 36.96% +0.06%
==========================================
Files 1310 1310
Lines 131372 131487 +115
==========================================
+ Hits 48477 48608 +131
+ Misses 78682 78658 -24
- Partials 4213 4221 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…yteorg#5489-dataclass-mismatch Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
…t (one level) dataclass Signed-off-by: mao3267 <[email protected]>
One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct.
Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic". TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well. |
…yteorg#5489-dataclass-mismatch
Signed-off-by: mao3267 <[email protected]>
just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary. |
Signed-off-by: mao3267 <[email protected]>
Let's get this done this week @mao3267 |
Currently, we support both Although the schemas from |
Signed-off-by: Future-Outlier <[email protected]>
the exact match part is okay and we should fix that but I'm confused about the other part. looking at this example
can you explain why this should work? Why doesn't mypy complain in this example? Or does it? I almost feel like if we're going to go down this route, it should be the other way around. If cc @fg91 if you want to take a look as well. wasn't there another pr where we were discussing an LGPL library also? |
@@ -19,6 +19,8 @@ type trivialChecker struct { | |||
} | |||
|
|||
func removeTitleFieldFromProperties(schema map[string]*structpb.Value) { | |||
// TODO: Explain why we need this | |||
// TODO: givse me example about dataclass vs. Pydantic BaseModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example comparing dataclass and Pydantic BaseModel. As shown, the schema for dataclass includes a title field that records the name of the class. Additionally, the additionalProperties
field is absent from the Pydantic BaseModel schema because its value is false
. cc @eapolinario
dataclass | Pydantic.BaseModel |
---|---|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add the comment, writing the entire schema would make it too lengthy. Would it be acceptable to use something like this instead?
class A:
a: int
Pydantic.BaseModel: {"properties": {"a": {"title": "A", "type": "integer"}}}
dataclass: {"properties": {"a": {"type": "integer"}}, "additionalProperties": false}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing to preprocess the schemas so that one can mix and match dataclasses and base models given their schemas are aligned? I.e. task expects a dataclass with schema "A" and I pass a base model that has the same schema.
I personally feel this is not necessary and think it would be totally acceptable to consider a dataclass and a base model not a match by default. Especially if this makes things a lot more complicated in the backend otherwise because the schemas need to be aligned. What do you think about this?
If you are confident in the logic I'm of course not opposing the feature but if you feel this makes things complicated and brittle, I'd rather keep it simple and more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually make things more complicated, will remove related logic.
Class
In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this? By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.
The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it. |
@Future-Outlier @mao3267 The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it. It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective |
Oh got it, but this only works because there are defaults right? The original case you linked to should work because @dataclass
class A:
a: int
b: Optional[int] = None
c: str then this should not work correct? (because
I don't think of this as superset matching. I think of this as schema compatibility, which is why I was thinking we'd find some off-the-shelf library that can just do it for us. 'Is this schema compatible with this other schema?'
Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it? Re @fg91's comments
I don't know the answer but it should be yes to the first question. It should be possible to easily provide a json schema and have everything on the backend just work. Isn't this the case @Future-Outlier? There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is. |
Replying @wild-endeavor
Yes. This should not work.
No. I just decided to use another package that directly compares the JSON schema. This is not documented anywhere. Replying @fg91
To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages? |
This is exactly what I'm trying to say :)
Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way". And I personally feel that the dataclass and pydantic type transformers should provide the schema in this general way so that the backend doesn't have to have type specific implementations for dataclasses/base models. (As a side note, as I described here, for cache checks we don't use the schema in metadata but the so-called type structure. Maybe it's difficult to fix this in hindsight but I kinda wished that there was a single unified way type transformers make the schema available to propeller that is used for everything, cache checks, union type checks, ...) |
@fg91 and @EngHabu mind chiming in on dataclass compatibility? Just thinking about it from the basics, if I have json schemas representing two dataclasses, let's say, @dataclass
class A2:
a: int
b: Optional[int] = None
c: str = "hello" and @dataclass
class A1:
a: int which of the following two are valid? def wf():
a1 = create_a1() # -> A1
use_a2(a2=a1) # a2: A2 Case 2 def wf():
a2 = create_a2() # -> A2
use_a1(a1=a2) # a1: A1 Just thinking about compatibility in the loosest sense, both should be valid. The reason is that in the first case, when calling the downstream task In the second case, the The implication here though is that if you have @task
def make_a1() -> A1: ...
@task
def use_either(a: typing.Union[A1, A2]): ... This this will fail
because A1 will match more than one variant. flytekit itself will not fail I think (right @Future-Outlier?) but we'll never get there because the compiler will fail. Should we just do exact matches only? Plus of two examples earlier (case 1 & 2), both will fail mypy type checking. |
Thank you @pingsutw and @eapolinario for reminding me... the original issue was an issue because of inheritance. So the case to make work is @dataclass
class Parent():
a: int
@dataclass
class Child(Parent):
b: int with def wf():
c = create_child() # -> Child
use_parent(p=c) # p: Parent The reason we want to make this work is because this works in Python (type checks with mypy) and is common enough we feel that it should work. So the upshot is, we should do strict subset matching. If the downstream schema is a subset of the upstream, and the fields that are there are an exact match, then we can. So in this example, you should be able to pass Also, This case should not compile but I think it's okay: @dataclass
class Parent():
a: int
@dataclass
class Child(Parent):
a: Optional[int] This case should compile fine. @dataclass
class LeafBase():
leaf: int
@dataclass
class Leaf(LeafBase):
other: int
@dataclass
class Parent:
l: LeafBase A task that produces |
FYI, originally we tried to use the LGPL package json_schema_compare to do the subset check. Even though feeling pretty sure that the license is compatible with Apache 2.0, we just not using now without any other specific reasons. If we want to support capability checks (the strict subset check) with schemas generated from different packages, maybe we should try the LGPL package since handling all those schemas by ourselves will require specific logic, which is not expected. |
We can create a new workflow version to do this, the upgrade scenario will not fail, I tested it with @pingsutw. |
Tracking issue
Related to #5489
Why are the changes needed?
When a function accepts a
Union
of two dataclasses as input, Flyte cannot distinguish which dataclass matches the user's input. This is because Flyte only compares the simple types, and both dataclasses are identified asflyte.SimpleType_STRUCT
in this scenario. As a result, there will be multiple matches, causing ambiguity and leading to an error.union_test_dataclass.py
What changes were proposed in this pull request?
marshmallow_jsonschema.JSONSchema
(draft-07) ormashumaro.jsonschema.build_json_schema
(draft 2020-12) for dataclass and Pydantic.BaseModel for itself. To check equivalence, we compare the bytes from marshaling the json schemas if they are in the same draft version. For now, we only consider supporting the comparison of schemas with the same version.superset_A.py
superset_dataclass.py
How was this patch tested?
union_test_basemodel.py
superset_basemodel.py
Setup process
Screenshots
Note for Optional values in JSON Schemas
While handling
Optional
values in Python, bothNoneType
and the target type are accepted. However, when defining such values, default values must still be provided. This is whyOptional
properties without default values are marked asrequired
in JSON schemas.Check all the applicable boxes
Related PRs
None
Docs link
TODO