Skip to content

(fix): structured dtype fill value consolidated metadata #3015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Apr 24, 2025

Closes #2998

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 24, 2025
@ilan-gold ilan-gold force-pushed the ig/fix_structured_dtype_consolidated branch from 3f97416 to 8e23bf9 Compare April 24, 2025 15:21
@ilan-gold ilan-gold mentioned this pull request Apr 24, 2025
3 tasks
@tasansal
Copy link
Contributor

@d-v-b is this pending on anything? we would like to see this in the next patch release as well if possible :)

Comment on lines 321 to 331
def test_structured_dtype_fill_value_serialization(tmp_path):
group_path = tmp_path / "test.zarr"
root_group = zarr.open_group(group_path, mode="w", zarr_format=2)
root_group.create_array(
name="structured_headers",
shape=(100, 100),
chunks=(100, 100),
dtype=np.dtype([("foo", "i4"), ("bar", "i4")]),
)

zarr.consolidate_metadata(root_group.store, zarr_format=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?

Also I think we need a check to ensure that if the dtype is void and the fill value is None, then there's no base64 encoding (I know from the implementation in this PR that the test will pass, but it's good to have the test in any case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?

Oh wow, I totally intended to do that (hence the name of the test).

Also I think we need a check to ensure that if the dtype is void and the fill value is None, then there's no base64 encoding

Great suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! tried both!

@d-v-b
Copy link
Contributor

d-v-b commented Apr 24, 2025

thanks for the ping @tasansal, i left some comments about the test, but personally It's OK with me if those comments are ignored. Long term it's not sustainable for us to have duplicated fill value encoding logic across the codebase, and I think some upcoming PRs will help by heavily consolidating this, but I think it's OK in the short-term if this is pushed out quickly.

@BrianMichell
Copy link
Contributor

@d-v-b Is there anything I can do to help push this PR forward?

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrianMichell thanks for the ping. I think this looks good, thanks for the improvements @ilan-gold

@d-v-b d-v-b merged commit 0c76778 into zarr-developers:main Apr 30, 2025
30 checks passed
Comment on lines +338 to +342
assert (
root_group.metadata.consolidated_metadata.to_dict()["metadata"]["structured_dtype"][
"fill_value"
]
== fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilan-gold I'm a little confused by this test. If the fill value is a structured dtype scalar, then shouldn't the fill value that appears in metadata be base64 encoded? If so, shouldn't this check fail in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the python representation here is serlialized into the void data type. Wheter that is correct or not is a different story. Looking at it closely, I think this is either (a) expected, and the typing on ArrayV2Metadata is wrong or (b) the typing is right, and the behavior is wrong.

The type is: fill_value: int | float | str | bytes | None = 0

But the call to parse_fill_value yields numpy object:

try:
if isinstance(fill_value, list):
return np.array([tuple(fill_value)], dtype=dtype)[0]
elif isinstance(fill_value, tuple):
return np.array([fill_value], dtype=dtype)[0]
elif isinstance(fill_value, bytes):
return np.frombuffer(fill_value, dtype=dtype)[0]
elif isinstance(fill_value, str):
decoded = base64.standard_b64decode(fill_value)
return np.frombuffer(decoded, dtype=dtype)[0]
else:
return np.array(fill_value, dtype=dtype)[()]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(What I'm trying to say is that this dictionary is not the on-disk json, but a parsed version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know offhand what zarr-python 2 did here? (I can also check this later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not offhand, but just to understand, why does it matter? Is there a backwards compat concern with the in-memory python representation? TBH structured data types are much less essential to us than some of the other people who are raising these concerns from what it sounds like so I'm not super familiar with previous behavior. I don't think many people in our community use them, but they are in our CI and I like contributing so I make these PRs :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm touching a lot of data type representation code over in #2874 and I want to make sense of some of the test failures I'm seeing, and this was one of the tests that I tripped. I think your explanation makes sense (i.e., this is just the in-memory representation, and so the fill value should be the decoded version), sorry for the noise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @d-v-b always happy to help. Will be quick to report on the status of this all once 3.0.8 is out, but our tests show no errors with this feature at the moment.

Copy link
Contributor

@d-v-b d-v-b May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for posterity, the specific thing in my PR that made this test fail was the use of to_dict. In #2874, I am making fill value encoding happen in the call to to_dict, instead of via a special JSON encoder (the status quo). So on my branch this test was comparing the JSON-serialized fill value against the in-memory version. I made the test pass by removing the to_dict step and directly comparing the metadata.fill_value attribute against the expected fill_value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Structured dtype serialization with consolidated metadata fails
4 participants