-
-
Notifications
You must be signed in to change notification settings - Fork 363
add a runtime type checker for metadata objects #3400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
this is pretty substantial so I would appreciate a lot of eyes @zarr-developers/python-core-devs if anyone has concerns about whether we should do any runtime type checking at all, maybe send those thoughts to the issue this PR closes I'm going to keep working on tests for the type checker, but so far it's working great. This PR does violating liskov for a few subclasses of our Similarly, there are lots of @TomAugspurger I think you in particular will appreciate some of the effects of this PR. Since we can annotate methods like That being said, I think the ArrayMetadata class will still need to do some internal consistency checks, like ensuring that the number of dimension names matches the length of |
…valuate_forward_ref
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3400 +/- ##
==========================================
- Coverage 94.92% 94.90% -0.03%
==========================================
Files 79 80 +1
Lines 9503 9752 +249
==========================================
+ Hits 9021 9255 +234
- Misses 482 497 +15
🚀 New features to boost your workflow:
|
…iguating ints from bools
In the dev meeting today @jhamman rightly challenged me to justify writing our own in-house runtime type checker instead of bringing in pydantic. I had a few answers:
My calculation was, 500 LoC for an API that does exactly what we need, (and nothing else), was a better deal than bringing in a big, celebrity dependency that would require pretty substantial changes to basic Zarr Python classes. Other folks might see this calculation differently. |
as illustration, consider this diff enabled by this PR: @@ -286,14 +283,7 @@ class NullTerminatedBytes(ZDType[np.dtypes.BytesDType[int], np.bytes_], HasLengt
True if the input is a valid representation of this class in Zarr V3, False
otherwise.
"""
- return (
- isinstance(data, dict)
- and set(data.keys()) == {"name", "configuration"}
- and data["name"] == cls._zarr_v3_name
- and isinstance(data["configuration"], dict)
- and "length_bytes" in data["configuration"]
- and isinstance(data["configuration"]["length_bytes"], int)
- )
+ return check_type(data, NullTerminatedBytesJSON_V3).success I worry that if type safety requires writing tedious, error-prone functions like what we have in main, we might not do it, and then we lose type safety. Boiling it down to a much simpler function makes type safety look a lot easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a Quick Look, but haven't gone through in detail. At a high level:
- I'd prefer to avoid a dependency on pydantic (or any other runtime validators)
- At a glance, the implementation looks reasonable, but I haven't had a chance to go through it in detail (and likely won't for a while)
- I'd say anything we can do to limit scope, with the goal of limiting complexity, is worth considering.
class CodecJSON_V2(TypedDict, Generic[TName]): | ||
"""The JSON representation of a codec for Zarr V2""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should reexport these types here for backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 30d48a8
If the dictionary data is invalid or incompatible with either Zarr format 2 or 3 array creation. | ||
""" | ||
metadata = parse_array_metadata(data) | ||
metadata = parse_array_metadata(data) # type: ignore[call-overload] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the type: ignore
needed due to the presence of ArrayMetadata
in the signature for __init__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the overload because the data
input to from_dict
is typed as dict[str, JSON]
. I could change that to the union of the two typeddict metadata types, but this will incompatibly override the base class implementation of from_dict
, and so I will need another # type: ignore
there. The only way to clean all of this up is to redo our base Metadata
ABC, which I deemed out of scope for this PR
""" | ||
|
||
kind: Literal["inline"] | ||
must_understand: Literal["false"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct for v2 consolidated metadata? IIRC, it didn't have must_understand
and maybe had a version
field? I guess maybe I'm mixing up consolidated metadata as written by zarr-python 2.x, and the v3 spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't v2 consolidated metadata exclusively via an external .zmetadata
file, with no change to .zattrs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, the .zmetadata
file had version and "metadata"
keys:
out = {
"zarr_consolidated_format": 1,
"metadata": {key: json_loads(store[key]) for key in store if is_zarr_key(key)},
}
But the above typeddict is for modelling the use of our inline consolidated metadata model for Zarr V2 arrays. I would need a totally separate type for the zmetadata contents.
src/zarr/core/type_check.py
Outdated
return tp | ||
|
||
|
||
def check_type(obj: Any, expected_type: Any, path: str = "value") -> TypeCheckResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we narrow the type on expected_type
here? And do we gain anything if we do?
I'd like to minimize complexity. We know we have a relatively constrained set of expected objects we're going to be passing here (node metadata, codec configuration, ...). If there's anything we can leverage to make this simpler, let's do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you're kind of asking if we could define a "type-checkable type", i.e. a union of all the types we support runtime type checking for. I'd like that, but I haven't looked into setting that up yet. I don't know what the blockers would be, and I think it would be potentially helpful to more clearly define the scope of this checker.
although, in its current implementation, any type that isn't associated with a specific checking routine falls back to isinstance
, which does work for pretty much all user-defined classes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a look at c7096b1, It wasn't too painful to narrow from Any
to type | types.UnionType | ForwardRef | None
T = TypeVar("T") | ||
|
||
|
||
def ensure_type(obj: object, expected_type: type[T], path: str = "value") -> T: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path
is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's passed to check_type
a few lines down
"array": ArrayV3Metadata.from_dict( | ||
{ | ||
**array_metadata, | ||
**array_metadata, # type: ignore[typeddict-item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy doesn't know about structural assignability in typeddicts, i.e. the fact that typeddicts tolerate extra keys
This PR adds a runtime type checker specifically for checking JSON-like data against a type definition. It's currently a draft while I get the test suite happy and refine the API, but it's also ready for people to look at and try out. I'm pretty convinced of it's utility, but I also think we should have a good discussion about whether this feature is a good idea.
Demo
The basic API looks like this:
Some aspects might evolve while this is a draft, like the nature of the error messages.
Supported types
This is not a general-purpose type checker. It is targeted for the types relevant for Zarr metadata documents, and so It supports the following narrow set of types
cost
maintenance burden
The type checker itself is ~530 lines of commented code, broken up into functions which are mostly easy to understand. The typeddict part, and the logic for resolving generic types, is convoluted and potentially sensitive to changes in how python exposes type annotations at runtime. Many type annotation features have been designed for static type checkers and not use within a python program, so some of this is rather fiddly. But I don't think we are relying on any brittle or private APIs here.
performance
As currrently implemented, the type checker will report all detectable errors:
This is wasted compute when we don't care exactly how mismatched the data is, but it is a better user experience. We might need to tune this if performance becomes a problem, e.g. by introducing a
"fail_fast"
option that returns on the first error.benefit
We can instantly remove a lot of special-purpose functions. Most of the functions named
parse_*
(~30+ functions) and essentially all of the functions named*check_json*
(~30 functions) could be replaced or simplified with thecheck_type
function.We can also make our JSON loading routines type-safe:
mypy could not infer the type correctly, but basedpyright does:
While we could write a bespoke function that specifically checks all the possibilities for zarr v3 metadata. But then we would need to painfully modify that function by hand to support something like this:
alternatives
we could use an external JSON validation library / type checking like pydantic, attrs, msgspec, beartype, etc. But I would rather not add a dependency. With the approach in this PR, we keep control in-house, and because this PR just adds functions, it composes with the rest of our codebase at the moment. (FWIW right now this type checker doesn't do any parsing, it only validates. If you think we should parse instead of just validating, then IMO that's a job for our array metadata classes)
we could also do nothing, and continue writing JSON parsing code by hand. But I would rather not do that, because this invites bugs and makes it hard to keep up with sneaky spec changes. Specifically, I'm planning on writing a lot of new types to model the codecs defined in #3376, and I would rather just write the type and get the type checking (and type safety) for free.
closes #3285