Skip to content

Commit 58096a6

Browse files
TomNicholasdcherianpre-commit-ci[bot]
authored
Chunked array docs (#7951)
* draft updates * discuss array API standard * fix sparse examples so they run * Deepak's suggestions Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * link to duck arrays user guide from internals page * fix various links * itemized list * mention dispatching on functions not in the array API standard * examples of duckarrays * add intended audience to xarray internals section * draft page on chunked arrays * move paragraph on why its called a duck array upwards * delete section on numpy ufuncs * explain difference between .values and to_numpy * strongly prefer to_numpy over values * recommend to_numpy instead of values in the how do I? page * clearer about using to_numpy * merge section on missing features * remove todense from examples * whatsnew * double that Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * numpy array class clarification Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Remove sentence about xarray's internals Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * array API standard Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * proper link for sparse.COO type Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * links to docstrings of array types Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * don't put variable in parentheses Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * double backquote formatting Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * better bracketing Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * fix list formatting * add links to glue packages, dask, and cubed * link to todense method Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * link to numpy-like arrays page * link to numpy ufunc docs * more text about chunkmanagers * add example of using .to_numpy * note on ideally not having an entrypoint system * parallel processing without chunks * explain the user interface * how to register the chunkmanager * show example of .values failing * link from duck arrays page * whatsnew * move whatsnew entry to unreleased version * capitalization * fix warning in docs build * fix a bunch of links * display API of ChunkManagerEntrypoint class attributes and methods * improve docstrings in ABC * add cubed to intersphinx mapping * link to dask.array as module not class * typo * fix bold formatting * proper docstrings * mention from_array specifically and link to requirements section of duck array internals page * add explicit link to cubed * mention ramba and arkouda * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * py:mod * Present tense regarding wrapping cubed Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * add links to cubed * add references for numpy links in apply_gufunc docstring * fix some broken links to docstrings --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 0de7761 commit 58096a6

File tree

7 files changed

+489
-16
lines changed

7 files changed

+489
-16
lines changed

doc/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,7 @@
323323
"dask": ("https://docs.dask.org/en/latest", None),
324324
"cftime": ("https://unidata.github.io/cftime", None),
325325
"sparse": ("https://sparse.pydata.org/en/latest/", None),
326+
"cubed": ("https://tom-e-white.com/cubed/", None),
326327
}
327328

328329

doc/internals/chunked-arrays.rst

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
.. currentmodule:: xarray
2+
3+
.. _internals.chunkedarrays:
4+
5+
Alternative chunked array types
6+
===============================
7+
8+
.. warning::
9+
10+
This is a *highly* experimental feature. Please report any bugs or other difficulties on `xarray's issue tracker <https://github.com/pydata/xarray/issues>`_.
11+
In particular see discussion on `xarray issue #6807 <https://github.com/pydata/xarray/issues/6807>`_
12+
13+
Xarray can wrap chunked dask arrays (see :ref:`dask`), but can also wrap any other chunked array type that exposes the correct interface.
14+
This allows us to support using other frameworks for distributed and out-of-core processing, with user code still written as xarray commands.
15+
In particular xarray also supports wrapping :py:class:`cubed.Array` objects
16+
(see `Cubed's documentation <https://tom-e-white.com/cubed/>`_ and the `cubed-xarray package <https://github.com/xarray-contrib/cubed-xarray>`_).
17+
18+
The basic idea is that by wrapping an array that has an explicit notion of ``.chunks``, xarray can expose control over
19+
the choice of chunking scheme to users via methods like :py:meth:`DataArray.chunk` whilst the wrapped array actually
20+
implements the handling of processing all of the chunks.
21+
22+
Chunked array methods and "core operations"
23+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24+
25+
A chunked array needs to meet all the :ref:`requirements for normal duck arrays <internals.duckarrays.requirements>`, but must also
26+
implement additional features.
27+
28+
Chunked arrays have additional attributes and methods, such as ``.chunks`` and ``.rechunk``.
29+
Furthermore, Xarray dispatches chunk-aware computations across one or more chunked arrays using special functions known
30+
as "core operations". Examples include ``map_blocks``, ``blockwise``, and ``apply_gufunc``.
31+
32+
The core operations are generalizations of functions first implemented in :py:mod:`dask.array`.
33+
The implementation of these functions is specific to the type of arrays passed to them. For example, when applying the
34+
``map_blocks`` core operation, :py:class:`dask.array.Array` objects must be processed by :py:func:`dask.array.map_blocks`,
35+
whereas :py:class:`cubed.Array` objects must be processed by :py:func:`cubed.map_blocks`.
36+
37+
In order to use the correct implementation of a core operation for the array type encountered, xarray dispatches to the
38+
corresponding subclass of :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint`,
39+
also known as a "Chunk Manager". Therefore **a full list of the operations that need to be defined is set by the
40+
API of the** :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint` **abstract base class**. Note that chunked array
41+
methods are also currently dispatched using this class.
42+
43+
Chunked array creation is also handled by this class. As chunked array objects have a one-to-one correspondence with
44+
in-memory numpy arrays, it should be possible to create a chunked array from a numpy array by passing the desired
45+
chunking pattern to an implementation of :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint.from_array``.
46+
47+
.. note::
48+
49+
The :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint` abstract base class is mostly just acting as a
50+
namespace for containing the chunked-aware function primitives. Ideally in the future we would have an API standard
51+
for chunked array types which codified this structure, making the entrypoint system unnecessary.
52+
53+
.. currentmodule:: xarray.core.parallelcompat
54+
55+
.. autoclass:: xarray.core.parallelcompat.ChunkManagerEntrypoint
56+
:members:
57+
58+
Registering a new ChunkManagerEntrypoint subclass
59+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60+
61+
Rather than hard-coding various chunk managers to deal with specific chunked array implementations, xarray uses an
62+
entrypoint system to allow developers of new chunked array implementations to register their corresponding subclass of
63+
:py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint`.
64+
65+
66+
To register a new entrypoint you need to add an entry to the ``setup.cfg`` like this::
67+
68+
[options.entry_points]
69+
xarray.chunkmanagers =
70+
dask = xarray.core.daskmanager:DaskManager
71+
72+
See also `cubed-xarray <https://github.com/xarray-contrib/cubed-xarray>`_ for another example.
73+
74+
To check that the entrypoint has worked correctly, you may find it useful to display the available chunkmanagers using
75+
the internal function :py:func:`~xarray.core.parallelcompat.list_chunkmanagers`.
76+
77+
.. autofunction:: list_chunkmanagers
78+
79+
80+
User interface
81+
~~~~~~~~~~~~~~
82+
83+
Once the chunkmanager subclass has been registered, xarray objects wrapping the desired array type can be created in 3 ways:
84+
85+
#. By manually passing the array type to the :py:class:`~xarray.DataArray` constructor, see the examples for :ref:`numpy-like arrays <userguide.duckarrays>`,
86+
87+
#. Calling :py:meth:`~xarray.DataArray.chunk`, passing the keyword arguments ``chunked_array_type`` and ``from_array_kwargs``,
88+
89+
#. Calling :py:func:`~xarray.open_dataset`, passing the keyword arguments ``chunked_array_type`` and ``from_array_kwargs``.
90+
91+
The latter two methods ultimately call the chunkmanager's implementation of ``.from_array``, to which they pass the ``from_array_kwargs`` dict.
92+
The ``chunked_array_type`` kwarg selects which registered chunkmanager subclass to dispatch to. It defaults to ``'dask'``
93+
if Dask is installed, otherwise it defaults to whichever chunkmanager is registered if only one is registered.
94+
If multiple chunkmanagers are registered it will raise an error by default.
95+
96+
Parallel processing without chunks
97+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
98+
99+
To use a parallel array type that does not expose a concept of chunks explicitly, none of the information on this page
100+
is theoretically required. Such an array type (e.g. `Ramba <https://github.com/Python-for-HPC/ramba>`_ or
101+
`Arkouda <https://github.com/Bears-R-Us/arkouda>`_) could be wrapped using xarray's existing support for
102+
:ref:`numpy-like "duck" arrays <userguide.duckarrays>`.

doc/internals/duck-arrays-integration.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Integrating with duck arrays
1111
Xarray can wrap custom numpy-like arrays (":term:`duck array`\s") - see the :ref:`user guide documentation <userguide.duckarrays>`.
1212
This page is intended for developers who are interested in wrapping a new custom array type with xarray.
1313

14+
.. _internals.duckarrays.requirements:
15+
1416
Duck array requirements
1517
~~~~~~~~~~~~~~~~~~~~~~~
1618

doc/internals/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ The pages in this section are intended for:
2121

2222
variable-objects
2323
duck-arrays-integration
24+
chunked-arrays
2425
extending-xarray
2526
zarr-encoding-spec
2627
how-to-add-new-backend

doc/user-guide/duckarrays.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Some numpy-like array types that xarray already has some support for:
2727

2828
For information on wrapping dask arrays see :ref:`dask`. Whilst xarray wraps dask arrays in a similar way to that
2929
described on this page, chunked array types like :py:class:`dask.array.Array` implement additional methods that require
30-
slightly different user code (e.g. calling ``.chunk`` or ``.compute``).
30+
slightly different user code (e.g. calling ``.chunk`` or ``.compute``). See the docs on :ref:`wrapping chunked arrays <internals.chunkedarrays>`.
3131

3232
Why "duck"?
3333
-----------

doc/whats-new.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ Bug fixes
3838
Documentation
3939
~~~~~~~~~~~~~
4040

41+
- Added page on wrapping chunked numpy-like arrays as alternatives to dask arrays.
42+
(:pull:`7951`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
4143
- Expanded the page on wrapping numpy-like "duck" arrays.
4244
(:pull:`7911`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
4345
- Added examples to docstrings of :py:meth:`Dataset.isel`, :py:meth:`Dataset.reduce`, :py:meth:`Dataset.argmin`,

0 commit comments

Comments
 (0)