Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Union type with dataclass ambiguous error and support superset comparison #5858

Closed
wants to merge 11 commits into from

Conversation

mao3267
Copy link
Contributor

@mao3267 mao3267 commented Oct 18, 2024

Tracking issue

Related to #5489

Why are the changes needed?

When a function accepts a Union of two dataclasses as input, Flyte cannot distinguish which dataclass matches the user's input. This is because Flyte only compares the simple types, and both dataclasses are identified as flyte.SimpleType_STRUCT in this scenario. As a result, there will be multiple matches, causing ambiguity and leading to an error.

union_test_dataclass.py

from typing import Union
from dataclasses import dataclass
from dataclasses_json import dataclass_json
from flytekit import task, workflow

@dataclass_json 
@dataclass
class A:
    a: int
    

@dataclass_json
@dataclass
class B:
    b: int


@task
def bar() -> A:
    return A(a=1)

@task
def foo(inp: Union[A, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

What changes were proposed in this pull request?

  1. To distinguish between different types using protobuf struct (dataclass, Pydantic.BaseModel), we compare their JSON schemas generated by marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema (draft 2020-12) for dataclass and Pydantic.BaseModel for itself. To check equivalence, we compare the bytes from marshaling the json schemas if they are in the same draft version. For now, we only consider supporting the comparison of schemas with the same version.
  2. We plan to support superset matching for dataclass/Pydantic.BaseModel with schemas in draft 2020-12, meaning that class A and class supersetA can be a match in the following example: (Pydantic.BaseModel example is in the screenshot section)

superset_A.py

# downstream 
@dataclass
class A:
    a: int
    b: Optional[int] = None
    c: str = "Flyte"

superset_dataclass.py

from dataclasses import dataclass
from typing import Optional
from superset_A import A as supersetA
# upstream 
@dataclass
class A:
    a: int

@dataclass
class B:
    b: str

@task
def foo() -> A:
    return A(a=1)

@task
def my_task(input: Union[supersetA, B]):
    print(input)

@workflow
def wf():
    a = foo()
    my_task(a)
  1. Unit tests will be added for different versions of json schema, including one-level, two-level, and superset examples.

How was this patch tested?

  1. Run an example using union input with identical dataclass on remote (union_test_dataclass.py)
  2. Run an example with superset on remote (superset_dataclass.py)
  3. Run an example using union input with identical BaseModel class on remote (union_test_basemodel.py)

union_test_basemodel.py

from pydantic import BaseModel
from typing import Union
from flytekit import task, workflow
from flytekit.image_spec import ImageSpec

flytekit_hash = "3475ddc41f2ba31d23dd072362be704d7c2470a0"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
image = ImageSpec(
    packages=[
                flytekit,
                "pydantic>2",
                "pandas",
                "pyarrow"
],
    apt_packages=["git"],
    registry="localhost:30000",
    builder="default",
)

class A(BaseModel):
    a: int

class B(BaseModel):
    b: str

@task(container_image=image)
def bar() -> A:
    return A(a=1)

@task(container_image=image)
def foo(inp: Union[A, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

if __name__ == "__main__":
    wf()
  1. Run an example with superset on remote (superset_basemodel.py)

superset_basemodel.py

from pydantic import BaseModel
from typing import Union
from flytekit import task, workflow
from flytekit.image_spec import ImageSpec
from superset_A import A as supersetA

flytekit_hash = "3475ddc41f2ba31d23dd072362be704d7c2470a0"
flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

# Define custom image for the task
image = ImageSpec(
    packages=[
                flytekit,
                "pydantic>2",
                "pandas",
                "pyarrow"
],
    apt_packages=["git"],
    registry="localhost:30000",
    builder="default",
)

# downstream (superset_A.py)
class A(BaseModel):
    a: int
    b: Optional[int] = None
    c: str = "Flyte"

# upstream
class A(BaseModel):
    a: int

class B(BaseModel):
    b: str

@task(container_image=image)
def bar() -> A:
    return A(a=1)

@task(container_image=image)
def foo(inp: Union[supersetA, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

if __name__ == "__main__":
    wf()

Setup process

git clone https://github.com/flyteorg/flyte.git
gh pr checkout 5858
make compile
POD_NAMESPACE=flyte ./flyte start --config flyte-single-binary-local.yaml

Screenshots

  1. Example using union input with identical dataclass on remote (union_test_dataclass.py)
image
  1. Example with superset on remote (superset_dataclass.py)
image
  1. Example using union input with identical BaseModel class on remote (union_test_basemodel.py)
image
  1. Example with superset on remote (superset_basemodel.py)
image
  1. Input is dataclass and Superset is BaseModel inherited
image

Note for Optional values in JSON Schemas

While handling Optional values in Python, both NoneType and the target type are accepted. However, when defining such values, default values must still be provided. This is why Optional properties without default values are marked as required in JSON schemas.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

None

Docs link

TODO

Copy link

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 64.76190% with 37 lines in your changes missing coverage. Please review.

Project coverage is 36.96%. Comparing base (b5f23a6) to head (ada05ed).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
flytepropeller/pkg/compiler/validators/typing.go 64.76% 29 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5858      +/-   ##
==========================================
+ Coverage   36.90%   36.96%   +0.06%     
==========================================
  Files        1310     1310              
  Lines      131372   131487     +115     
==========================================
+ Hits        48477    48608     +131     
+ Misses      78682    78658      -24     
- Partials     4213     4221       +8     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.07% <ø> (+0.01%) ⬆️
unittests-flytecopilot 22.23% <ø> (ø)
unittests-flytectl 62.39% <ø> (ø)
unittests-flyteidl 6.92% <ø> (ø)
unittests-flyteplugins 53.84% <ø> (ø)
unittests-flytepropeller 43.15% <64.76%> (+0.25%) ⬆️
unittests-flytestdlib 55.31% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pingsutw pingsutw assigned mao3267 and pingsutw and unassigned pingsutw Oct 23, 2024
@Future-Outlier Future-Outlier self-assigned this Oct 28, 2024
@fg91
Copy link
Member

fg91 commented Nov 1, 2024

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct.
We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct.
It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

@mao3267 mao3267 changed the title [WIP] Fix Union type with dataclass ambiguous error and support superset comparison Fix Union type with dataclass ambiguous error and support superset comparison Nov 8, 2024
@Future-Outlier
Copy link
Member

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct. We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct. It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

@Future-Outlier
Copy link
Member

Let's get this done this week @mao3267

@mao3267 mao3267 marked this pull request as ready for review November 11, 2024 02:48
@mao3267
Copy link
Contributor Author

mao3267 commented Nov 11, 2024

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures. For dataclass_json, we only support equivalence without superset matching due to significant differences between its JSON schema (draft-07) and the newer version used by both Pydantic and dataclass (draft 2020-12).

Although the schemas from Pydantic BaseModel and dataclass adhere to the same version, they differ in how they handle fields. For instance, only the schema generated by Pydantic BaseModel records the title in properties. Additionally, the required and additionalProperties fields are omitted if no required properties exist or if additional properties are disallowed. To address these discrepancies, we preprocess the schema before comparison, which involves removing the title field, and modify the logic while comparing for the required and additionalProperties field.

Signed-off-by: Future-Outlier <[email protected]>
@wild-endeavor
Copy link
Contributor

wild-endeavor commented Nov 11, 2024

the exact match part is okay and we should fix that but I'm confused about the other part.

looking at this example

@task
def my_task(input: Union[supersetA, B]):
    print(input)

@workflow
def wf():
    a = foo()
    my_task(a)

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.

Why doesn't mypy complain in this example? Or does it?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is because supersetA contains more fields than A

cc @fg91 if you want to take a look as well.

wasn't there another pr where we were discussing an LGPL library also?

@@ -19,6 +19,8 @@ type trivialChecker struct {
}

func removeTitleFieldFromProperties(schema map[string]*structpb.Value) {
// TODO: Explain why we need this
// TODO: givse me example about dataclass vs. Pydantic BaseModel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example comparing dataclass and Pydantic BaseModel. As shown, the schema for dataclass includes a title field that records the name of the class. Additionally, the additionalProperties field is absent from the Pydantic BaseModel schema because its value is false. cc @eapolinario

dataclass Pydantic.BaseModel
image image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add the comment, writing the entire schema would make it too lengthy. Would it be acceptable to use something like this instead?

class A:
	a: int

Pydantic.BaseModel: 	{"properties": {"a": {"title": "A", "type": "integer"}}}
dataclass: 			{"properties": {"a": {"type": "integer"}}, "additionalProperties": false}

Copy link
Member

@fg91 fg91 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing to preprocess the schemas so that one can mix and match dataclasses and base models given their schemas are aligned? I.e. task expects a dataclass with schema "A" and I pass a base model that has the same schema.

I personally feel this is not necessary and think it would be totally acceptable to consider a dataclass and a base model not a match by default. Especially if this makes things a lot more complicated in the backend otherwise because the schemas need to be aligned. What do you think about this?

If you are confident in the logic I'm of course not opposing the feature but if you feel this makes things complicated and brittle, I'd rather keep it simple and more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually make things more complicated, will remove related logic.

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 12, 2024

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A.
Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this? By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

wasn't there another pr where we were discussing an LGPL library also?

The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

cc @Future-Outlier @wild-endeavor

@fg91
Copy link
Member

fg91 commented Nov 12, 2024

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures.

@Future-Outlier @mao3267
My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

@wild-endeavor
Copy link
Contributor

wild-endeavor commented Nov 12, 2024

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A. Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this?

Oh got it, but this only works because there are defaults right? The original case you linked to should work because
So if your supersetA was

@dataclass
class A:
    a: int
    b: Optional[int] = None
    c: str

then this should not work correct? (because c is missing).

By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

I don't think of this as superset matching. I think of this as schema compatibility, which is why I was thinking we'd find some off-the-shelf library that can just do it for us. 'Is this schema compatible with this other schema?'

wasn't there another pr where we were discussing an LGPL library also?
The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

Re @fg91's comments

Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

I don't know the answer but it should be yes to the first question. It should be possible to easily provide a json schema and have everything on the backend just work. Isn't this the case @Future-Outlier? There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 13, 2024

Replying @wild-endeavor

then this should not work correct? (because c is missing).

Yes. This should not work.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

No. I just decided to use another package that directly compares the JSON schema. This is not documented anywhere.

Replying @fg91

My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

@fg91
Copy link
Member

fg91 commented Nov 13, 2024

There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

This is exactly what I'm trying to say :)

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way". And I personally feel that the dataclass and pydantic type transformers should provide the schema in this general way so that the backend doesn't have to have type specific implementations for dataclasses/base models.
What do you think about this?

(As a side note, as I described here, for cache checks we don't use the schema in metadata but the so-called type structure. Maybe it's difficult to fix this in hindsight but I kinda wished that there was a single unified way type transformers make the schema available to propeller that is used for everything, cache checks, union type checks, ...)

@wild-endeavor
Copy link
Contributor

@fg91 and @EngHabu mind chiming in on dataclass compatibility? Just thinking about it from the basics, if I have json schemas representing two dataclasses, let's say,

@dataclass
class A2:
    a: int
    b: Optional[int] = None
    c: str = "hello"

and

@dataclass
class A1:
    a: int

which of the following two are valid?
Case 1

def wf():
  a1 = create_a1()  # -> A1
  use_a2(a2=a1)  # a2: A2

Case 2

def wf():
  a2 = create_a2()  # -> A2
  use_a1(a1=a2)  # a1: A1

Just thinking about compatibility in the loosest sense, both should be valid. The reason is that in the first case, when calling the downstream task use_a2, fields b and c have defaults.

In the second case, the a field can be taken from the a2 object and b and c discarded.

The implication here though is that if you have

@task
def make_a1() -> A1: ...

@task
def use_either(a: typing.Union[A1, A2]): ...

This this will fail

use_either(a=make_a1())

because A1 will match more than one variant. flytekit itself will not fail I think (right @Future-Outlier?) but we'll never get there because the compiler will fail.

Should we just do exact matches only? Plus of two examples earlier (case 1 & 2), both will fail mypy type checking.

@wild-endeavor
Copy link
Contributor

Thank you @pingsutw and @eapolinario for reminding me... the original issue was an issue because of inheritance.

So the case to make work is

@dataclass
class Parent():
    a: int

@dataclass
class Child(Parent):
    b: int

with

def wf():
  c = create_child()  # -> Child
  use_parent(p=c)  # p: Parent

The reason we want to make this work is because this works in Python (type checks with mypy) and is common enough we feel that it should work.

So the upshot is, we should do strict subset matching. If the downstream schema is a subset of the upstream, and the fields that are there are an exact match, then we can. So in this example, you should be able to pass Child to something that expects Parent because Child has all the fields of Parent, and those fields are an exact type match.

Also, typing.Union[Parent, Child] is okay to fail because it's ambiguous.

This case should not compile but I think it's okay:
Child overwrites a field in Parent changing its type, which means it's no longer a strict subset match.

@dataclass
class Parent():
    a: int

@dataclass
class Child(Parent):
    a: Optional[int]

This case should compile fine.

@dataclass
class LeafBase():
    leaf: int

@dataclass
class Leaf(LeafBase):
    other: int

@dataclass
class Parent:
    l: LeafBase

A task that produces Parent(l=Leaf(leaf=5, other=3)) should be able to be passed to a downstream task that takes Parent.

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 14, 2024

FYI, originally we tried to use the LGPL package json_schema_compare to do the subset check. Even though feeling pretty sure that the license is compatible with Apache 2.0, we just not using now without any other specific reasons.

If we want to support capability checks (the strict subset check) with schemas generated from different packages, maybe we should try the LGPL package since handling all those schemas by ourselves will require specific logic, which is not expected.

@Future-Outlier
Copy link
Member

We can create a new workflow version to do this, the upgrade scenario will not fail, I tested it with @pingsutw.
@wild-endeavor also agrees that this is a low priority.
Let's close this and work on other hight priority first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants