-
-
Notifications
You must be signed in to change notification settings - Fork 327
(fix): structured dtype fill value consolidated metadata #3015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix): structured dtype fill value consolidated metadata #3015
Conversation
3f97416
to
8e23bf9
Compare
@d-v-b is this pending on anything? we would like to see this in the next patch release as well if possible :) |
tests/test_metadata/test_v2.py
Outdated
def test_structured_dtype_fill_value_serialization(tmp_path): | ||
group_path = tmp_path / "test.zarr" | ||
root_group = zarr.open_group(group_path, mode="w", zarr_format=2) | ||
root_group.create_array( | ||
name="structured_headers", | ||
shape=(100, 100), | ||
chunks=(100, 100), | ||
dtype=np.dtype([("foo", "i4"), ("bar", "i4")]), | ||
) | ||
|
||
zarr.consolidate_metadata(root_group.store, zarr_format=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?
Also I think we need a check to ensure that if the dtype is void and the fill value is None
, then there's no base64 encoding (I know from the implementation in this PR that the test will pass, but it's good to have the test in any case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?
Oh wow, I totally intended to do that (hence the name of the test).
Also I think we need a check to ensure that if the dtype is void and the fill value is None, then there's no base64 encoding
Great suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! tried both!
thanks for the ping @tasansal, i left some comments about the test, but personally It's OK with me if those comments are ignored. Long term it's not sustainable for us to have duplicated fill value encoding logic across the codebase, and I think some upcoming PRs will help by heavily consolidating this, but I think it's OK in the short-term if this is pushed out quickly. |
@d-v-b Is there anything I can do to help push this PR forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BrianMichell thanks for the ping. I think this looks good, thanks for the improvements @ilan-gold
assert ( | ||
root_group.metadata.consolidated_metadata.to_dict()["metadata"]["structured_dtype"][ | ||
"fill_value" | ||
] | ||
== fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilan-gold I'm a little confused by this test. If the fill value is a structured dtype scalar, then shouldn't the fill value that appears in metadata be base64 encoded? If so, shouldn't this check fail in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the python representation here is serlialized into the void data type. Wheter that is correct or not is a different story. Looking at it closely, I think this is either (a) expected, and the typing on ArrayV2Metadata
is wrong or (b) the typing is right, and the behavior is wrong.
The type is: fill_value: int | float | str | bytes | None = 0
But the call to parse_fill_value
yields numpy
object:
zarr-python/src/zarr/core/metadata/v2.py
Lines 297 to 308 in 36a1bac
try: | |
if isinstance(fill_value, list): | |
return np.array([tuple(fill_value)], dtype=dtype)[0] | |
elif isinstance(fill_value, tuple): | |
return np.array([fill_value], dtype=dtype)[0] | |
elif isinstance(fill_value, bytes): | |
return np.frombuffer(fill_value, dtype=dtype)[0] | |
elif isinstance(fill_value, str): | |
decoded = base64.standard_b64decode(fill_value) | |
return np.frombuffer(decoded, dtype=dtype)[0] | |
else: | |
return np.array(fill_value, dtype=dtype)[()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(What I'm trying to say is that this dictionary is not the on-disk json, but a parsed version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know offhand what zarr-python 2 did here? (I can also check this later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not offhand, but just to understand, why does it matter? Is there a backwards compat concern with the in-memory python representation? TBH structured data types are much less essential to us than some of the other people who are raising these concerns from what it sounds like so I'm not super familiar with previous behavior. I don't think many people in our community use them, but they are in our CI and I like contributing so I make these PRs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm touching a lot of data type representation code over in #2874 and I want to make sense of some of the test failures I'm seeing, and this was one of the tests that I tripped. I think your explanation makes sense (i.e., this is just the in-memory representation, and so the fill value should be the decoded version), sorry for the noise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @d-v-b always happy to help. Will be quick to report on the status of this all once 3.0.8 is out, but our tests show no errors with this feature at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for posterity, the specific thing in my PR that made this test fail was the use of to_dict
. In #2874, I am making fill value encoding happen in the call to to_dict
, instead of via a special JSON encoder (the status quo). So on my branch this test was comparing the JSON-serialized fill value against the in-memory version. I made the test pass by removing the to_dict
step and directly comparing the metadata.fill_value
attribute against the expected fill_value
.
Closes #2998
TODO:
docs/user-guide/*.rst
changes/