-
-
Notifications
You must be signed in to change notification settings - Fork 328
module for metadata-aware IO #3017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
note: this requires a solution to #3018 before it can be viable. |
A few random thoughts:
|
Our store API right now is at the "arbitrary key/value storage" abstraction level, which means we have not defined a way for a store to express an opinion about certain keys. But if we used the idea i'm proposing in this PR, then we would define distinct "metadata-aware" IO layers for zarr v2 and zarr v3, as getting array metadata for v2 requires a different implementation than for v3. This would also formalize the notion that zarr v2 and zarr v3 hierarchies should not be mixed (something our current stores cannot enforce). In schematic form, I am imagining that we would define something like this: # terrible name
class LocalV2MetaStore:
def __init__(self, store: LocalStore) -> None:
....
def read_array_metadata(path) -> ArrayV2Metadata:
... # fetch .zarray, .zattrs
class LocalV3MetaStore:
def __init__(self, store: LocalStore) -> None:
....
def read_array_metadata(path) -> ArrayV3Metadata:
... # fetch zarr.json, etc These two classes would still rely on a store (in the key-value storage sense). I don't think there would be a static way to check for compatibility between a store and one of these metastore classes. But key-value stores that didn't support zarr v2 would certainly fail at runtime. This example sort of answers your question about who would implement the protocol -- zarr-python would, but also anyone else who wants to provide their own metadata storage layer, e.g. icechunk, could implement the protocol as well. It could also be a base class. What matters is that we define an API for doing all the metadata operations necessary for zarr stuff, in such a way that's useful for other people to implement. |
metadata-aware IO (I made this term up, please suggest a better name) is the use of our store API to do IO that depends on zarr semantics, like reading / writing array and group metadata, for each zarr version. E.g., in zarr v2, a function that reads array metadata has to make 2 requests, one for
.zarray
and another for.zattrs
. A function that reads array metadata for zarr v3 has a different implementation -- it makes just 1 request, for a different key (zarr.json
).We don't have a single place in our codebase for these operations. In fact, there's some worrying code duplication -- we have a function called
get_array_metadata
defined incore/array.py
that overlaps with_read_metadata_v2
and_read_metadata_v3
, which are both defined incore/group.py
.I think we should put these routines in one place. Eventually, that module would contain functions for:
None of these functions would return an array or group. They would just return array / group metadata, which could be used to create an array or group as needed. For this reason, I don't think these functions belong in
core/array.py
orcore/group.py
, since those modules are concerned with theArray
andGroup
classes. The metadata-aware IO layer however cuts across the array / group distinction (e.g. with functions that can return either array or group metadata).Eventually we may want to formalize the set of all these operations as a protocol.
I'm not sure how chunk IO fits in here.
I reached this conclusion while working on #3012
The text was updated successfully, but these errors were encountered: