(fix): structured dtype fill value consolidated metadata #3015

ilan-gold · 2025-04-24T14:50:30Z

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

tasansal · 2025-04-24T18:17:55Z

@d-v-b is this pending on anything? we would like to see this in the next patch release as well if possible :)

d-v-b · 2025-04-24T18:24:59Z

tests/test_metadata/test_v2.py

+def test_structured_dtype_fill_value_serialization(tmp_path):
+    group_path = tmp_path / "test.zarr"
+    root_group = zarr.open_group(group_path, mode="w", zarr_format=2)
+    root_group.create_array(
+        name="structured_headers",
+        shape=(100, 100),
+        chunks=(100, 100),
+        dtype=np.dtype([("foo", "i4"), ("bar", "i4")]),
+    )
+
+    zarr.consolidate_metadata(root_group.store, zarr_format=2)


this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?

Also I think we need a check to ensure that if the dtype is void and the fill value is None, then there's no base64 encoding (I know from the implementation in this PR that the test will pass, but it's good to have the test in any case)

this test is weak (it only tests that consolidate metadata doesn't error). Do you think it makes sense to make this test a big stronger, e.g. by checking that the fill value was actually encoded the way we think it should have been?

Oh wow, I totally intended to do that (hence the name of the test).

Also I think we need a check to ensure that if the dtype is void and the fill value is None, then there's no base64 encoding

Great suggestion

Thanks! tried both!

d-v-b · 2025-04-24T18:27:33Z

thanks for the ping @tasansal, i left some comments about the test, but personally It's OK with me if those comments are ignored. Long term it's not sustainable for us to have duplicated fill value encoding logic across the codebase, and I think some upcoming PRs will help by heavily consolidating this, but I think it's OK in the short-term if this is pushed out quickly.

BrianMichell · 2025-04-30T14:11:34Z

@d-v-b Is there anything I can do to help push this PR forward?

d-v-b

@BrianMichell thanks for the ping. I think this looks good, thanks for the improvements @ilan-gold

d-v-b · 2025-05-01T08:23:43Z

tests/test_metadata/test_v2.py

+    assert (
+        root_group.metadata.consolidated_metadata.to_dict()["metadata"]["structured_dtype"][
+            "fill_value"
+        ]
+        == fill_value


@ilan-gold I'm a little confused by this test. If the fill value is a structured dtype scalar, then shouldn't the fill value that appears in metadata be base64 encoded? If so, shouldn't this check fail in that case?

I think the python representation here is serlialized into the void data type. Wheter that is correct or not is a different story. Looking at it closely, I think this is either (a) expected, and the typing on ArrayV2Metadata is wrong or (b) the typing is right, and the behavior is wrong.

The type is: fill_value: int | float | str | bytes | None = 0

But the call to parse_fill_value yields numpy object:

zarr-python/src/zarr/core/metadata/v2.py

Lines 297 to 308 in 36a1bac

try:

if isinstance(fill_value, list):

return np.array([tuple(fill_value)], dtype=dtype)[0]

elif isinstance(fill_value, tuple):

return np.array([fill_value], dtype=dtype)[0]

elif isinstance(fill_value, bytes):

return np.frombuffer(fill_value, dtype=dtype)[0]

elif isinstance(fill_value, str):

decoded = base64.standard_b64decode(fill_value)

return np.frombuffer(decoded, dtype=dtype)[0]

else:

return np.array(fill_value, dtype=dtype)[()]

(What I'm trying to say is that this dictionary is not the on-disk json, but a parsed version)

do you know offhand what zarr-python 2 did here? (I can also check this later)

Definitely not offhand, but just to understand, why does it matter? Is there a backwards compat concern with the in-memory python representation? TBH structured data types are much less essential to us than some of the other people who are raising these concerns from what it sounds like so I'm not super familiar with previous behavior. I don't think many people in our community use them, but they are in our CI and I like contributing so I make these PRs :)

I'm touching a lot of data type representation code over in #2874 and I want to make sense of some of the test failures I'm seeing, and this was one of the tests that I tripped. I think your explanation makes sense (i.e., this is just the in-memory representation, and so the fill value should be the decoded version), sorry for the noise!

No worries @d-v-b always happy to help. Will be quick to report on the status of this all once 3.0.8 is out, but our tests show no errors with this feature at the moment.

for posterity, the specific thing in my PR that made this test fail was the use of to_dict. In #2874, I am making fill value encoding happen in the call to to_dict, instead of via a special JSON encoder (the status quo). So on my branch this test was comparing the JSON-serialized fill value against the in-memory version. I made the test pass by removing the to_dict step and directly comparing the metadata.fill_value attribute against the expected fill_value.

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 24, 2025

ilan-gold added 3 commits April 24, 2025 17:20

(fix): structured dtype consolidated metadata fill value

0da74a4

(chore): relnote

8136b00

(chore): test

8e23bf9

ilan-gold force-pushed the ig/fix_structured_dtype_consolidated branch from 3f97416 to 8e23bf9 Compare April 24, 2025 15:21

ilan-gold mentioned this pull request Apr 24, 2025

(fix): zarr bound scverse/anndata#1976

Merged

3 tasks

tasansal mentioned this pull request Apr 24, 2025

Limit allowable zarr versions TGSAI/mdio-python#533

Merged

d-v-b reviewed Apr 24, 2025

View reviewed changes

ilan-gold added 2 commits April 25, 2025 00:05

(fix): more robust testing

feb367e

Merge branch 'main' into ig/fix_structured_dtype_consolidated

21b1fdd

d-v-b approved these changes Apr 30, 2025

View reviewed changes

Merge branch 'main' into ig/fix_structured_dtype_consolidated

a959e66

d-v-b merged commit 0c76778 into zarr-developers:main Apr 30, 2025
30 checks passed

d-v-b reviewed May 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): structured dtype fill value consolidated metadata #3015

(fix): structured dtype fill value consolidated metadata #3015

ilan-gold commented Apr 24, 2025 •

edited

Loading

tasansal commented Apr 24, 2025

d-v-b Apr 24, 2025

ilan-gold Apr 24, 2025

ilan-gold Apr 24, 2025

d-v-b commented Apr 24, 2025

BrianMichell commented Apr 30, 2025

d-v-b left a comment

d-v-b May 1, 2025

ilan-gold May 1, 2025

ilan-gold May 1, 2025

d-v-b May 1, 2025

ilan-gold May 1, 2025

d-v-b May 1, 2025

ilan-gold May 1, 2025

d-v-b May 2, 2025 •

edited

Loading

	try:
	if isinstance(fill_value, list):
	return np.array([tuple(fill_value)], dtype=dtype)[0]
	elif isinstance(fill_value, tuple):
	return np.array([fill_value], dtype=dtype)[0]
	elif isinstance(fill_value, bytes):
	return np.frombuffer(fill_value, dtype=dtype)[0]
	elif isinstance(fill_value, str):
	decoded = base64.standard_b64decode(fill_value)
	return np.frombuffer(decoded, dtype=dtype)[0]
	else:
	return np.array(fill_value, dtype=dtype)[()]

(fix): structured dtype fill value consolidated metadata #3015

(fix): structured dtype fill value consolidated metadata #3015

Conversation

ilan-gold commented Apr 24, 2025 • edited Loading

tasansal commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Apr 24, 2025

BrianMichell commented Apr 30, 2025

d-v-b left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b May 2, 2025 • edited Loading

Choose a reason for hiding this comment

ilan-gold commented Apr 24, 2025 •

edited

Loading

d-v-b May 2, 2025 •

edited

Loading