Skip to content

Device support in zarr-python (especially for GPU) #2658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nenb opened this issue Jan 6, 2025 · 4 comments
Open

Device support in zarr-python (especially for GPU) #2658

nenb opened this issue Jan 6, 2025 · 4 comments

Comments

@nenb
Copy link

nenb commented Jan 6, 2025

Problem
I would like to load zarr data directly onto non-CPU devices (especially GPU). The current approach appears to rely on using cupy to load onto cupy-supported devices e.g. https://github.com/rapidsai/kvikio/blob/branch-25.02/notebooks/zarr.ipynb.

Unfortunately, there are a number of devices that are not supported by cupy e.g. I don't believe that my Apple Metal GPU is supported. This means that I must load from zarr via CPU if I would like to use these devices e.g. zarr on disk -> numpy -> torch (which has Metal support).

This is slow(er) and I don't believe is necessary from the zarr specification alone (?).

Background
Multi-device support is a very important requirement in the AI/ML community. I would like to use zarr (and specifically the Python implementation) to run models such as LLMs on multiple devices. The quicker it is to load the model onto device (and with reduced memory usage etc), the better the UX and developer experience is.

Questions

  1. Is cupy the correct/only way to load direct to GPU with zarr-python?
  2. Is there/will there be any way of loading direct to devices such as Metal with zarr-python?
  3. (Related) What is the best way to load a PyTorch neural network on GPU with zarr-python? Is it cupy and then using something like dlpack for zero-copy exchange? Are there alternatives?

Related issues
#1967
#2574

cc @jhamman (as suggested by @TomNicholas)

@ziw-liu
Copy link
Contributor

ziw-liu commented Jan 8, 2025

CuPy/kvikio relies on nvidia's GPUDirect storage (GDS) driver and goes through PCIe. Metal GPUs are using unified memory, so CPU-to-GPU transfer can in theory be almost zero-cost (passing an address). If there is a way to pass the ownership of an array from CPU to GPU, nothing needs to be done in zarr unless there is need for GPU-accelerated decompression.

In practice though, at least torch implements the to("mps") method by cloning the tensor (memcpy-ish cost), and each ML framework may do different things. Another reference point is jax, which implements (experimental) serialization to zarr using tensorstore.

@nenb
Copy link
Author

nenb commented May 1, 2025

Thanks for the pointers, @ziw-liu - your comment sent me down a rabbit-hole and I think I’ve now got a more concrete proposal worth floating.

TL;DR

  • Goal: let zarr-python decode straight into MLX arrays so that data arriving from disk is already resident in Apple-Silicon unified memory and instantly visible to both CPU and GPU.
  • How: add a UnifiedMemoryBuffer (name TBD) that satisfies zarr-python's Buffer abstraction and is backed by MLX arrays under the hood.
  • Why it matters: MLX is basically the (zero-copy, GPU-aware) “NumPy for Apple devices.” Bridging Zarr → MLX lets Mac users stream huge model weights or datasets directly to unified memory and also hopefully continues to grow the Zarr community.

1 Background – why MLX is special

  • Unified Memory
    On Apple Silicon the CPU and GPU share the same memory. Frameworks that understand this (MLX) can treat “device transfers” as metadata operations only. There is an issue by one of the framework authors which describes well why such a new framework was introduced.

  • PyTorch’s current limitation
    AFAIK (based on @ziw-liu comment above and my further research) torch.Tensor.to("mps") still performs a clone because its device model was designed around discrete GPUs. That removes a lot of the benefits of unified memory.

  • Enter MLX
    MLX (docs, blog) wraps a fairly thin C++/Metal runtime in an almost-NumPy API. Because of this, MLX feels like the natural target array type for a Zarr buffer on Macs.

2 What I’m proposing

A new Buffer implementation specifically for macs with unified memory.

3 Questions

  • There are many gaps in my knowledge of zarr-python. I would appreciate any comments from those more familiar about why this might not be a good idea/might not work as I hope!

  • The potential of Unified Memory to facilitate shared CPU-GPU data access seems particularly relevant to Zarr codecs, especially with the ongoing exploration of GPU-based decompression to alleviate CPU bottlenecks in ML workflows. For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

4 Final Notes

There are a lot more subtleties to implementing this than I have outlined in this issue. But I wanted to start an initial discussion here to get some feedback, before proceeding to a PoC.

I know there’s a lot I don’t know about Zarr internals. Any pointers, pitfalls, or “please don’t do it this way” comments are very welcome!

@d-v-b
Copy link
Contributor

d-v-b commented May 1, 2025

A new Buffer implementation specifically for macs with unified memory.

I think the idea behind the buffer API design was to support exactly this strategy, so it looks like the right direction to me!

@TomAugspurger
Copy link
Contributor

One other thing to think through is the config system (docs: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html). We currently have a high-level zarr.config.enable_gpu() that updates a few config settings (the default buffer type being the big one). At least at the moment that's tied directly to CUDA / cuPy / NVIDIA GPUs. We'll need to figure out whether we want to try to have "gpu" mean "figure out stuff at runtime, based on the resources available". That sounds a bit complicated so for now I'd recommend namespacing everything under "mlx".

For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

I'm not an expert, but am starting to dig into it as part of #2904. It's pretty challenging... At least for NVIDIA GPUs, I think we might need finer-grained controls over the input and output buffers are for each stage of the pipeline. Maybe that's not an issue with the unified memory model though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants