module for metadata-aware IO #3017

d-v-b · 2025-04-24T18:05:08Z

metadata-aware IO (I made this term up, please suggest a better name) is the use of our store API to do IO that depends on zarr semantics, like reading / writing array and group metadata, for each zarr version. E.g., in zarr v2, a function that reads array metadata has to make 2 requests, one for .zarray and another for .zattrs. A function that reads array metadata for zarr v3 has a different implementation -- it makes just 1 request, for a different key (zarr.json).

We don't have a single place in our codebase for these operations. In fact, there's some worrying code duplication -- we have a function called get_array_metadata defined in core/array.py that overlaps with _read_metadata_v2 and _read_metadata_v3, which are both defined in core/group.py.

I think we should put these routines in one place. Eventually, that module would contain functions for:

reading array metadata
reading group metadata
reading array or group metadata (for zarr v2 this case requires its own implementation for performance reasons)
checking if an array exists
checking if a group exists
writing array metadata
writing group metadata

None of these functions would return an array or group. They would just return array / group metadata, which could be used to create an array or group as needed. For this reason, I don't think these functions belong in core/array.py or core/group.py, since those modules are concerned with the Array and Group classes. The metadata-aware IO layer however cuts across the array / group distinction (e.g. with functions that can return either array or group metadata).

Eventually we may want to formalize the set of all these operations as a protocol.

I'm not sure how chunk IO fits in here.

I reached this conclusion while working on #3012

The text was updated successfully, but these errors were encountered:

d-v-b · 2025-04-24T19:23:28Z

note: this requires a solution to #3018 before it can be viable.

paraseba · 2025-04-24T22:29:01Z

A few random thoughts:

Shouldn't the Store tell Zarr if it supports V2? New stores have no reason to support V2, and probably zarr shouldn't even try it with them (for performance reasons, but also to allow these Stores to accept a narrower set of keys, which could be beneficial)
In general there should probably be more "introspection" between zarr and the Stores. We have a few support_* methods, but there is probably much more we could do. Even in the performance area, for example, a Store could be able to provide a "natural" concurrency limit for its implementation.
As another example of the point above. For some stores "check if X exists" could mean they are just doing a get, and for others it could be much faster. Zarr can probably optimize differently for these two cases.
Eventually we may want to formalize the set of all these operations as a protocol.

this point confuses me a bit. Who will implement the protocol?
I'm not sure how chunk IO fits in here.

I think this one is a big deal. Maintaining storage for metadata and chunks are very different goals. Today we have a single abstraction for both, but it doesn't feel right. The interfaces are different, the performance requirements are very different. Ideally, I'd love to mix a match my chunk and metadata Stores, it would produce very powerful things.

d-v-b · 2025-04-24T22:47:28Z

Shouldn't the Store tell Zarr if it supports V2? New stores have no reason to support V2, and probably zarr shouldn't even try it with them (for performance reasons, but also to allow these Stores to accept a narrower set of keys, which could be beneficial)

Our store API right now is at the "arbitrary key/value storage" abstraction level, which means we have not defined a way for a store to express an opinion about certain keys.

But if we used the idea i'm proposing in this PR, then we would define distinct "metadata-aware" IO layers for zarr v2 and zarr v3, as getting array metadata for v2 requires a different implementation than for v3. This would also formalize the notion that zarr v2 and zarr v3 hierarchies should not be mixed (something our current stores cannot enforce).

In schematic form, I am imagining that we would define something like this:

 # terrible name
class LocalV2MetaStore:
  def __init__(self, store: LocalStore) -> None:
    ....

  def read_array_metadata(path) -> ArrayV2Metadata:
    ... # fetch .zarray, .zattrs

class LocalV3MetaStore:
  def __init__(self, store: LocalStore) -> None:
    ....

  def read_array_metadata(path) -> ArrayV3Metadata:
    ... # fetch zarr.json, etc

These two classes would still rely on a store (in the key-value storage sense). I don't think there would be a static way to check for compatibility between a store and one of these metastore classes. But key-value stores that didn't support zarr v2 would certainly fail at runtime.

This example sort of answers your question about who would implement the protocol -- zarr-python would, but also anyone else who wants to provide their own metadata storage layer, e.g. icechunk, could implement the protocol as well. It could also be a base class. What matters is that we define an API for doing all the metadata operations necessary for zarr stuff, in such a way that's useful for other people to implement.

This was referenced May 1, 2025

Monthly issue metrics report #3030

Open

Monthly issue metrics report sanketverma1704/zarr-python#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module for metadata-aware IO #3017

module for metadata-aware IO #3017

d-v-b commented Apr 24, 2025 •

edited

Loading

d-v-b commented Apr 24, 2025

paraseba commented Apr 24, 2025

d-v-b commented Apr 24, 2025

module for metadata-aware IO #3017

module for metadata-aware IO #3017

Comments

d-v-b commented Apr 24, 2025 • edited Loading

d-v-b commented Apr 24, 2025

paraseba commented Apr 24, 2025

d-v-b commented Apr 24, 2025

d-v-b commented Apr 24, 2025 •

edited

Loading