-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP9 (phase 1): add clarifications for extension naming #330
base: main
Are you sure you want to change the base?
Conversation
@joshmoore - really glad you got this started! 🙌 My feedback is that the PR is hard to review. It touches 15 files, including a ton of minor, unrelated formatting changes to the core spec document. If we want folks to engage and give meaningful feedback, we need to make it easier to review. I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes. |
Remaining text blocks are likely to be re-used under the more general "Extension points" section. see: zarr-developers#312
549cc16
to
454faaf
Compare
👍
You're right. I've extracted out #331.
I disagree that they are unrelated. Take a look. The sections I've modified were basically already un-parseable. Since I was adding sections, the outline was getting more convoluted.
👍 Give it a look and let me know what you think. |
Thanks for all of your work on this! My current understanding of the practical effect of proposal is as follows: -raw names will be granted fairly easily, e.g. zstd, bfloat16, and others I've proposed would be assigned to me, the ones that zarr-python has started using (string, bytes, vlen-utf8, etc.) would be assigned to someone from zarr-python. URL names will be used only for really experimental stuff, all commonly-used extensions will have raw names since they will be minimal effort. Therefore, the verbosity of the URLs is not really a problem in practice.
The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process. |
I share your concerns to some degree. I think we can adapt the governance structure for extensions in the future, if we think that a more thorough review process would be necessary. We are thinking of forming a zarr specs team that could take on that responsibility. |
Thanks @rabernat. Fundamentally, the arrow extension mechansim seems very similar to what we have been proposing here, which is reassuring. We can of course continue discussing the details, but we should also be sure to steer away from a design-by-commitee situation. For Zarr, I don't see a good reason for differentiating between "bare names" and prefixed "canonical extensions". We can control the naming of both uniformly through the zarr-extensions repo. Extensions can include prefixes, if they want, but I wouldn't force that. Our process for registering raw names is definitely more lightweight than registering a canonical extension in arrow. Re "user-defined extensions". I want to remind everybody that in our proposal it is very easy to register a "raw name" in zarr-extensions. Also, it is an express goal of our proposal to avoid naming conflicts. I wouldn't want to step back from that. Re dropping URLs. URLs have a couple of nice properties that we want for avoiding naming conflicts, self-documentation, and compatibility with json-ld. On the other hand, the downside of URLs seems to boil down to people finding URLs weird. So, I would be inclined to stick with URLs. Also, I want to reiterate that it will be very easy to register raw names, so, most extensions in the field will not use URLs. Re maturing extensions. In an earlier version of our proposal, we had that extensions would mature by changing their names (i.e. from URL over prefixed to raw name). Now, we think it is better to find a different denotion of maturity so that extensions don't have to change their name, which would create unnecessary complexity for implementations. In summary, I think our two-level naming system (i.e. centralized through zarr-extensions and uncoordinated free-for-all) is less complex than adopting arrow's system and would work really well for Zarr and fit the current community practice. |
Just to emphasize, I want a boring, simple solution here. Our default solution should be to copy something that has worked in a similar project. If you are rejecting something that has worked for another project, then I would like to see an engineering-based explanation for that decision. On its own terms, the arrow spec is pretty simple: there are two types of extensions. The first extension type is decentralized, and the spec makes NO requirements for what names they use, only recommendations:
IMO This language is very easy for implementations to understand. It doesn't aim to globally prevent name collisions, but I suspect that is OK in practice. We should learn from this. The second extension type is centralized, and there are more requirements, but crucially, there are no explicit name requirements. Instead, all the requirements are scoped to the extension itself. I think it's safe to assume that any name collisions will be handled by the process of vetting the extension. I think both of these extension types could work for us. I also think the arrow spec is also simpler than this PR, because the arrow spec imposes fewer requirements. For an extension developer, you just have to choose a name that composes with the extensions your implementation already knows about, and you are done. That's far simpler than introducing a dependency on a separate github repo. As an implementation developer, I would be happy working with something like the arrow spec. I cannot say the same for this PR in its current state. |
to elaborate on this: I am specifically opposed to the requirement that extension names be registered on github OR a URL. I'm open to discuss alternatives to this requirement (e.g., making it a suggestion, or finding another way entirely to achieve the goals of this requirement). |
Thanks for all your work on enabling extensions! I have a few comments based on my experiences contributing to the GeoZarr WG over the past couple years and reading through all the linked documents. Specific concern around URLSI wanted to offer a couple recent experience-based observations that could help make concerns with URLs a bit more concrete:
If it's truly necessary to have a persistent identifier for extensions, DOIs were created for this purpose. The one component in the original description for URLS that aren't possible with DOIs is arguably self-describability, but in practice URLs can be quite challenging to interpret without visiting the content hosted. Recommendation to bring back the
|
A DOI would already be allowed as a URL, but being a numeric identifier would make the metadata very difficult for humans to interpret. Do you have a specific alternative proposal for naming? The name registration process already would seem to address all of these concerns.
I'm not sure I understand the argument regarding performance benefits. But I'm also not sure exactly what you have in mind regarding being able to distinguish which keys are extensions vs core --- are you saying that you want The https://github.com/zarr-developers/zarr-extensions repo already provides a unified listing of things regardless of whether they are in the core spec or an extension. I think this proposal is in part an acknowledgement that the ZEP process has not worked well for defining extensions under the existing extension points and I expect that if this proposal is accepted, no new extensions under the existing extension points may be added to the core spec, and the ZEP process would only be used for new extension points.
Extensions are explicitly intended for things that alter the behavior of the zarr implementation itself. OME-Zarr just builds on top of Zarr and therefore would not be considered a zarr extension. @rabernat previously proposed to call such things as OME-Zarr "zarr conventions". I actually expect that it would be relatively unlikely for a zarr extension to store metadata in However, it could be very reasonable for there to be a registry of attribute names that is very similar to the registry of zarr extensions. The only issue is that currently there are no restrictions on what attribute keys are allowed and therefore it is not clear how to distinguish "registered attributes" from plain attributes. In any case I think it would be good to limit the scope of this discussion to just proper zarr extensions in the interest of getting that part sorted out more quickly and efficiently. |
Thanks @maxrjones. I largely agree with @jbms's reply, but would like to add two points:
In ZEP 9, we proposed the
Currently, OME-Zarr puts all its metadata under |
In the interest of speed and since it's easier to add than take away, you could just take out URLs as Davis and Ryan asked for, see how it goes with requiring non-url based registration in the zarr-extensions repo, and add it as a new PR if it seems like it's necessary. As I've now given both my concrete concern and a specific proposal, I'm not going to engage on the URL debate further. Thanks again for working on this! Regarding my other comments, I am now more confused about what could become a proper Zarr extension and therefore fall under the purview of these naming requirements. I started a thread on Zulip to get clarification without diluting the discussion on this PR, if anyone would be willing to offer clarifications there 🙏 |
Just wanted to note that there are two new contributions for dtypes and codecs in V3 over in Zarr Python
These offer a great opportunity for us to explore the implications of this ZEP. What sort of guidance would we provide to these contributors on naming their codecs? |
It's a good question, @rabernat. From the current written text, their next step would be to open a PR against zarr-extensions (And I know having testers there would make @normanrz happy). If they didn't want to do that, they could use a URL (e.g., What would you want the guidance to them to look like? |
I like that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate everyone's constructive comments and feedback. Here is a set of changes that I believe addresses some of the remaining points around extension naming. Specifically, it replaces URL-based extensions names with Arrow-style namespaced extensions.
Simply dropping URLs is not ideal. There will inevitably be organizations who want and need totally private extensions, which they never intend to share with the rest of the world. This should be explicitly allowed by the spec. This is what namespaced extensions are for. This also covers all of the development scenarios.
docs/v3/core/index.rst
Outdated
Extension naming | ||
---------------- | ||
|
||
The `name` field of an extension can take two forms: **raw names** and **URL-based names**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `name` field of an extension can take two forms: **raw names** and **URL-based names**. | |
There are two types of extensions names: | |
- **raw names** - intended for well-known extensions aimed at broad adoption and maximum interoperability. | |
- **namespaced extensions** - intended for private extensions and development purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's more clear to say that there are two types of extensions -- those that are centrally registered (these may have raw names OR namespaced names), and those that are not centrally registered (these should have namespaced names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we be registering namespaced codecs? In my head these were mutually exclusive categories. What you describe sounds more confusing and ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if codec foo.codec
becomes popular enough that its owners want to register it centrally, doesn't it make sense to keep the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would help to sketch out the process for "publishing" an extension that was previously unpublished -- it sounds like you would want the prefix to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, someone else might be using the same namespace for their own private use. Registering a namespaced name creates the possibility of a conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a manageable risk. People creating their own extensions in private should be aware that there can be no guarantee that the name they are using will be globally reserved for them. Forcing registered extensions to change their name seems disruptive and I would rather avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think my main concern is conflating the structure of the extension name (raw vs prefixed) with its intended scope -- the text currently reads as if namespaced names cannot be used when broad interoperability is desired. But IMO organizations should be allowed to use namespaced names in this context, in the same way that geoarrow defines their extensions under the geoarrow
prefix.
docs/v3/core/index.rst
Outdated
Raw names MUST be assigned within a central repository. | ||
Raw names are unique and immutable. | ||
Raw names MUST start with one lower case letter a-z and then be followed | ||
by only lower case letters a-z, numerals 0-9, underscores, dashes, and dots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by only lower case letters a-z, numerals 0-9, underscores, dashes, and dots. | |
by only lower case letters a-z, numerals 0-9, underscores, and dashes. |
Dot characters are forbidden, to avoid confusion with namespaced extensions.
docs/v3/core/index.rst
Outdated
* For raw names that are coming from well-known projects, use the same prefix followed | ||
by a dot for requesting your raw name, e.g. "numcodecs.". Other examples of prefixes can | ||
be found in the `zarr-extensions`_ repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these would be considered namespaced extensions under my proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my comment above but we should clarify early on that registered names can also be namespaced.
Before too many comments are made here, a heads up that I'd appreciate having this as a PR so it can be downloaded and built with sphinx, check for warnings, etc. I didn't find a way to do that for @d-v-b's comments in the first round. |
Ok I will make a PR. |
Thanks! I assume the comments above though give folks a good sense of what you're thinking and the comments can start. Looking forward to everyone's feedback. Edit: I realized that I could accept all the comments and then revert, but it seems like that could be confusing. |
I made a PR against your branch here: joshmoore#1 |
The current spec says
So aren't all the changes proposed so far a regression? Zarr 3.0 arrays with a URI codec
|
I always found this requirement hard to understand, given that none of the codecs defined alongside the v3 spec followed it, which suggested that it was not actually a real requirement. Are there codecs in the wild that followed this requirement? |
Oh that is a good point, I'd interpreted it as just for codecs not defined in the spec. And you've of course noted in the past the inconsistency in that section. Although I think the intention was clear: use URIs to avoid clashes and make it possible to resolve how the data was encoded. ZEP0009 achieves the same objective and is a big improvement, but it should be adapted to achieve that without potentially invalidating existing data and breaking the stability policy. ZEP0009 can achieve this with a few extra words and no effort from implementations.
I did follow the spec requirement to use URIs for the custom codecs I've got in |
Interesting points, @LDeakin. Thanks. ZEP9's goal was to be as conformant as possible to the various interpretations in v3.0 ("use URIs" while clearly defining raw names). My earlier change to this PR restricting URIs to URLs to make the identifiers more useful (i.e. self-documenting) was unintentionally breaking. (I had wanted to get back to URIs with a later phase of ZEP9, but of course that doesn't fix it.) Reading through Ryan's PR, I did wonder if there wasn't an URN (as a specific type of URI) that we could use. The closest I could find would be the "eXperimental" prefix: "Raw"/Registered names could be considered (or renamed) to "shortcuts" for "urn:zarr*"1 and the non-registered names could be definitively prefixed with Going this route would mean all URIs are again permissible but discouraged. This runs the risk of not always fulfilling @d-v-b's ask for a clear code-compatible label (or "short name" as @jhamman described it from cfconventions) but might balance some of the other priorities that have been expressed above. 1 Here I use |
In joshmoore@5c03a24 I added the following section
|
Yesterday the @zarr-developers/steering-council met to discuss this important issue and build alignment. I'm happy to share we made some great progress and I think the end is in sight. Here's what we agreed. From a practical point of view, the spec will allow raw names, namespaced names, or URIs (discouraged, but needed for backwards compatibility) as extension names. @joshmoore will merge my changes and make a few more tweaks. More long term, @joshmoore will pursue the idea that Zarr extension names will formally become URNs by registering the |
Co-authored-by: Davis Bennett <[email protected]>
Use namespaced names instead of URLs
This PR clarifies the extension mechanism concept in the v3 specification. Comments on any changes which will break existing implementations are STRONGLY encouraged. Please see zarr-developers/zeps#65 for background material.
TODOs:
Post-merge: