-
-
Notifications
You must be signed in to change notification settings - Fork 328
[v3] Refactoring our serialization handling #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am very interested in improving our serialization logic, so thanks for working on this.
Could you explain exactly whats painful about serialization today? I have my own list of complaints, but I would be interested in hearing yours.
That's a good summary of how we have approached this so far. The libraries you list all support a lot of use cases, but for zarr-python we merely need to round-trip a finite set of JSON objects into very simple python classes. That being said, my views on this have evolved a bit and personally I am more open to using a dependency here if it was lightweight and had a big positive effect on the codebase. "lightweight" rules out pydantic, but msgspec or cattrs could be viable. And if we do keep our home-grown solution, I think we should have an asymptotic goal for our type validator routines and metadata models to be usable for modelling zarr metadata in pydantic, cattrs, etc.
I would be curious to see how this looks. Personally I like having validation routines for class |
related to: #2081 specifically: #2081 (comment) it would be great if this could get consolidated so it could also be hooked into by external objects. so imo these two approaches:
would be simplified by having a So where the single try:
return json.dump(obj)
except TypeError as e:
if hasattr(obj, '__zarr_serialize__`):
return json.dump(obj.__zarr_serialize__())
else:
raise e I see the Then if you also added annotations to the serialization method like return {
module: obj.__module__,
type: obj.__class__.__name__,
value: obj.__zarr_serialize__()
} (except less janky) you wouldn't need to do a big if/else switch statement on types in deserialization |
At least for now, the ser/deserialization logic in question is explicitly scoped to the small, finite set of JSON types enumerated by the zarr specs, stuff like the contents of |
i dont' mean to derail the conversation, so maybe we can limit this discussion to #2081 , but the point i was trying to make is that you can can get both by using serialization hooks :) - the boring types get a consistent serialization interface, but then if someone wanted to extend it (downstream, not in base zarr), they would have a clear means of doing so. multiple birds with one stone: clean up serialization/deserialization for zarr internals and support generic object serialization (which is i think being deprecated in v3 right?) using a system that can be reused across zarr rather than having separate serdes implementations for different parts of the library (if i'm understanding correctly, this is just for metadata objects? not sure if different than rest of serialization) but anyway just chiming in because it was related to what we were talking about elsewhere, don't want to muddy the waters |
Yeah, the current model is that the zarr v3 and v2 specs define a small handful of JSON objects that schematize groups or arrays. Besides the user-defined attributes (which doesn't get inspected at all, save for checking that it's JSON serializable), this set of JSON objects is closed, and there is no general anticipation that users would extend this set (extending this set of objects would amount to extending the zarr spec itself). With that constraint, the appeal of generic object serialization is pretty limited. The design that tom has problems with does use a serialization hook: it's called At the same time, there's a completely separate serialization path for the contents of arrays (chunks). This path is defined by the particular configuration of compressors / codecs specified in the metadata for the array, and chunk serialization generally produces an opaque bytestream at the end (e.g., because chunks get compressed). |
Just going off memory, I think the first thing I ran into was failing to implement or call
Does using a static type checker like mypy solve your concerns, when instances are being created directly? This doesn't really help if the program creating an And we do have the option of keeping a Thanks for the feedback @sneakers-the-rat. For this issue, I'm mostly concerned with round-tripping things that are in zarr-python, not necessarily a system that would be exposed to users or developers building on zarr-python. That said,
If I had to guess, at some point zarr-python will need to grow some kind of "registration" system for (de)serializing essentially arbitrary objects. But that's a bit murky to me. |
Maybe we don't even disagree :) I didn't intend to fixate on And the type checker is nice but i'm specifically interested in runtime: we should ensure that people can import zarr metadata classes, create a schema for a zarr array, and save it to storage, without worrying that their metadata was invalid. If we have to have object validation somewhere, then validation on construction is, I think, the only way for this to be a smooth user experience. Also, not all the zarr metadata invariants can be caught by the type system, e.g. positive integers. So we need runtime validation, the question is just where to put it I guess. |
to your points about the flaws in the current serialization:
If I recall correctly, the motivation for keeping python types around in the output of |
The "validation on construction" question is interesting. There are (I think) two components:
And there are roughly two places where we could ensure / validate that:
@dataclasses.dataclass
class Metadata:
shape: tuple[int]
def __post_init__(self):
if not isinstance(self.shape, tuple): raise TypeError # or we could cast
if not all(x > 0 for x in self.shape): raise TypeError(...) Whether or not to validate that is FWIW, my preference is to probably to
So I think that my proposal would allow for something like |
This is allowed today. I definitely feel like
This sounds like a bad user experience? Why should we allow the construction of invalid objects? I could see an argument for using something like a typed dict to model the serialized form of |
Zarr version
v3
Numcodecs version
na
Python Version
na
Operating System
na
Installation
na
Description
While working on consolidated metadata, I bumped into some awkwardness of our serialization logic. It ended up being fine, but I wonder if we could do a bit better by implementing some dedicated logic to serialization and deserialization of the Zarr metadata objects.
For now, I'll assume that we aren't interested in using a 3rd-party library like msgspec, pydantic, or cattrs for this. My current implementation is at https://github.com/TomAugspurger/zarr-python/blob/feature/serde/src/zarr/_serialization.py and it's pretty complicated, but I'll be able to clean some things up (both in that file, and in the various
to_dict
/from_dict
methods on our objects currently). I personally have a pretty high bar for libraries taking on dependencies, so I think this is worth it, but others may disagree.I wrote up https://tomaugspurger.net/posts/serializing-dataclasses/ with a little introduction based on what I've learned so far. The short version is
json.dumps(obj, default=custom_converter_function)
), but centralize all the serialization logic in one spotI'm not sure yet, but this might need to have some effect on the
__init__
/__post_init__
of our dataclasses. In general, I'd like to move some of theparse_<thing>
that are found in a lot of our dataclasses to the boundary of the program, and have the init of the dataclasses just be the default generated by@dataclass
. This will mean duplicating some logic on the external facing methods likezarr.open
,zarr.create
,Group.create
, but IMO that's worth it.I still have some work to do (parsing nested objects like lists and tuples properly), but wanted to share https://tomaugspurger.net/posts/serializing-dataclasses/ sooner rather than later.
Steps to reproduce
n/a
Additional output
No response
The text was updated successfully, but these errors were encountered: