Skip to content

module for metadata-aware IO #3017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d-v-b opened this issue Apr 24, 2025 · 3 comments
Open

module for metadata-aware IO #3017

d-v-b opened this issue Apr 24, 2025 · 3 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 24, 2025

metadata-aware IO (I made this term up, please suggest a better name) is the use of our store API to do IO that depends on zarr semantics, like reading / writing array and group metadata, for each zarr version. E.g., in zarr v2, a function that reads array metadata has to make 2 requests, one for .zarray and another for .zattrs. A function that reads array metadata for zarr v3 has a different implementation -- it makes just 1 request, for a different key (zarr.json).

We don't have a single place in our codebase for these operations. In fact, there's some worrying code duplication -- we have a function called get_array_metadata defined in core/array.py that overlaps with _read_metadata_v2 and _read_metadata_v3, which are both defined in core/group.py.

I think we should put these routines in one place. Eventually, that module would contain functions for:

  • reading array metadata
  • reading group metadata
  • reading array or group metadata (for zarr v2 this case requires its own implementation for performance reasons)
  • checking if an array exists
  • checking if a group exists
  • writing array metadata
  • writing group metadata

None of these functions would return an array or group. They would just return array / group metadata, which could be used to create an array or group as needed. For this reason, I don't think these functions belong in core/array.py or core/group.py, since those modules are concerned with the Array and Group classes. The metadata-aware IO layer however cuts across the array / group distinction (e.g. with functions that can return either array or group metadata).

Eventually we may want to formalize the set of all these operations as a protocol.

I'm not sure how chunk IO fits in here.

I reached this conclusion while working on #3012

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 24, 2025

note: this requires a solution to #3018 before it can be viable.

@paraseba
Copy link
Contributor

A few random thoughts:

  • Shouldn't the Store tell Zarr if it supports V2? New stores have no reason to support V2, and probably zarr shouldn't even try it with them (for performance reasons, but also to allow these Stores to accept a narrower set of keys, which could be beneficial)

  • In general there should probably be more "introspection" between zarr and the Stores. We have a few support_* methods, but there is probably much more we could do. Even in the performance area, for example, a Store could be able to provide a "natural" concurrency limit for its implementation.

  • As another example of the point above. For some stores "check if X exists" could mean they are just doing a get, and for others it could be much faster. Zarr can probably optimize differently for these two cases.

  • Eventually we may want to formalize the set of all these operations as a protocol.

    this point confuses me a bit. Who will implement the protocol?

  • I'm not sure how chunk IO fits in here.

    I think this one is a big deal. Maintaining storage for metadata and chunks are very different goals. Today we have a single abstraction for both, but it doesn't feel right. The interfaces are different, the performance requirements are very different. Ideally, I'd love to mix a match my chunk and metadata Stores, it would produce very powerful things.

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 24, 2025

  • Shouldn't the Store tell Zarr if it supports V2? New stores have no reason to support V2, and probably zarr shouldn't even try it with them (for performance reasons, but also to allow these Stores to accept a narrower set of keys, which could be beneficial)

Our store API right now is at the "arbitrary key/value storage" abstraction level, which means we have not defined a way for a store to express an opinion about certain keys.

But if we used the idea i'm proposing in this PR, then we would define distinct "metadata-aware" IO layers for zarr v2 and zarr v3, as getting array metadata for v2 requires a different implementation than for v3. This would also formalize the notion that zarr v2 and zarr v3 hierarchies should not be mixed (something our current stores cannot enforce).

In schematic form, I am imagining that we would define something like this:

 # terrible name
class LocalV2MetaStore:
  def __init__(self, store: LocalStore) -> None:
    ....

  def read_array_metadata(path) -> ArrayV2Metadata:
    ... # fetch .zarray, .zattrs

class LocalV3MetaStore:
  def __init__(self, store: LocalStore) -> None:
    ....

  def read_array_metadata(path) -> ArrayV3Metadata:
    ... # fetch zarr.json, etc

These two classes would still rely on a store (in the key-value storage sense). I don't think there would be a static way to check for compatibility between a store and one of these metastore classes. But key-value stores that didn't support zarr v2 would certainly fail at runtime.

This example sort of answers your question about who would implement the protocol -- zarr-python would, but also anyone else who wants to provide their own metadata storage layer, e.g. icechunk, could implement the protocol as well. It could also be a base class. What matters is that we define an API for doing all the metadata operations necessary for zarr stuff, in such a way that's useful for other people to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants