Device support in `zarr-python` (especially for GPU) #2658

nenb · 2025-01-06T19:19:59Z

Problem
I would like to load zarr data directly onto non-CPU devices (especially GPU). The current approach appears to rely on using cupy to load onto cupy-supported devices e.g. https://github.com/rapidsai/kvikio/blob/branch-25.02/notebooks/zarr.ipynb.

Unfortunately, there are a number of devices that are not supported by cupy e.g. I don't believe that my Apple Metal GPU is supported. This means that I must load from zarr via CPU if I would like to use these devices e.g. zarr on disk -> numpy -> torch (which has Metal support).

This is slow(er) and I don't believe is necessary from the zarr specification alone (?).

Background
Multi-device support is a very important requirement in the AI/ML community. I would like to use zarr (and specifically the Python implementation) to run models such as LLMs on multiple devices. The quicker it is to load the model onto device (and with reduced memory usage etc), the better the UX and developer experience is.

Questions

Is cupy the correct/only way to load direct to GPU with zarr-python?
Is there/will there be any way of loading direct to devices such as Metal with zarr-python?
(Related) What is the best way to load a PyTorch neural network on GPU with zarr-python? Is it cupy and then using something like dlpack for zero-copy exchange? Are there alternatives?

Related issues
#1967
#2574

cc @jhamman (as suggested by @TomNicholas)

The text was updated successfully, but these errors were encountered:

ziw-liu · 2025-01-08T17:59:35Z

CuPy/kvikio relies on nvidia's GPUDirect storage (GDS) driver and goes through PCIe. Metal GPUs are using unified memory, so CPU-to-GPU transfer can in theory be almost zero-cost (passing an address). If there is a way to pass the ownership of an array from CPU to GPU, nothing needs to be done in zarr unless there is need for GPU-accelerated decompression.

In practice though, at least torch implements the to("mps") method by cloning the tensor (memcpy-ish cost), and each ML framework may do different things. Another reference point is jax, which implements (experimental) serialization to zarr using tensorstore.

nenb · 2025-05-01T00:16:16Z

Thanks for the pointers, @ziw-liu - your comment sent me down a rabbit-hole and I think I’ve now got a more concrete proposal worth floating.

TL;DR

Goal: let zarr-python decode straight into MLX arrays so that data arriving from disk is already resident in Apple-Silicon unified memory and instantly visible to both CPU and GPU.
How: add a UnifiedMemoryBuffer (name TBD) that satisfies zarr-python's Buffer abstraction and is backed by MLX arrays under the hood.
Why it matters: MLX is basically the (zero-copy, GPU-aware) “NumPy for Apple devices.” Bridging Zarr → MLX lets Mac users stream huge model weights or datasets directly to unified memory and also hopefully continues to grow the Zarr community.

1 Background – why MLX is special

Unified Memory
On Apple Silicon the CPU and GPU share the same memory. Frameworks that understand this (MLX) can treat “device transfers” as metadata operations only. There is an issue by one of the framework authors which describes well why such a new framework was introduced.
PyTorch’s current limitation
AFAIK (based on @ziw-liu comment above and my further research) torch.Tensor.to("mps") still performs a clone because its device model was designed around discrete GPUs. That removes a lot of the benefits of unified memory.
Enter MLX
MLX (docs, blog) wraps a fairly thin C++/Metal runtime in an almost-NumPy API. Because of this, MLX feels like the natural target array type for a Zarr buffer on Macs.

2 What I’m proposing

A new Buffer implementation specifically for macs with unified memory.

3 Questions

There are many gaps in my knowledge of zarr-python. I would appreciate any comments from those more familiar about why this might not be a good idea/might not work as I hope!
The potential of Unified Memory to facilitate shared CPU-GPU data access seems particularly relevant to Zarr codecs, especially with the ongoing exploration of GPU-based decompression to alleviate CPU bottlenecks in ML workflows. For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

4 Final Notes

There are a lot more subtleties to implementing this than I have outlined in this issue. But I wanted to start an initial discussion here to get some feedback, before proceeding to a PoC.

I know there’s a lot I don’t know about Zarr internals. Any pointers, pitfalls, or “please don’t do it this way” comments are very welcome!

d-v-b · 2025-05-01T08:02:04Z

A new Buffer implementation specifically for macs with unified memory.

I think the idea behind the buffer API design was to support exactly this strategy, so it looks like the right direction to me!

TomAugspurger · 2025-05-01T17:04:35Z

One other thing to think through is the config system (docs: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html). We currently have a high-level zarr.config.enable_gpu() that updates a few config settings (the default buffer type being the big one). At least at the moment that's tied directly to CUDA / cuPy / NVIDIA GPUs. We'll need to figure out whether we want to try to have "gpu" mean "figure out stuff at runtime, based on the resources available". That sounds a bit complicated so for now I'd recommend namespacing everything under "mlx".

For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

I'm not an expert, but am starting to dig into it as part of #2904. It's pretty challenging... At least for NVIDIA GPUs, I think we might need finer-grained controls over the input and output buffers are for each stage of the pipeline. Maybe that's not an issue with the unified memory model though.

This was referenced Feb 1, 2025

Monthly issue metrics report #2787

Closed

Monthly issue metrics report sanketverma1704/zarr-python#7

Open

Monthly issue metrics report enthusiastdev121/zarr-python#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device support in `zarr-python` (especially for GPU) #2658

Device support in `zarr-python` (especially for GPU) #2658

nenb commented Jan 6, 2025

ziw-liu commented Jan 8, 2025

nenb commented May 1, 2025

d-v-b commented May 1, 2025

TomAugspurger commented May 1, 2025

Device support in zarr-python (especially for GPU) #2658

Device support in zarr-python (especially for GPU) #2658

Comments

nenb commented Jan 6, 2025

ziw-liu commented Jan 8, 2025

nenb commented May 1, 2025

TL;DR

1 Background – why MLX is special

2 What I’m proposing

3 Questions

4 Final Notes

d-v-b commented May 1, 2025

TomAugspurger commented May 1, 2025

Device support in `zarr-python` (especially for GPU) #2658

Device support in `zarr-python` (especially for GPU) #2658