Skip to content

Update assign_coords with a MultiIndex to match new Coordinates API #8039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aulemahal opened this issue Aug 1, 2023 · 11 comments · Fixed by #8094
Closed

Update assign_coords with a MultiIndex to match new Coordinates API #8039

aulemahal opened this issue Aug 1, 2023 · 11 comments · Fixed by #8094

Comments

@aulemahal
Copy link
Contributor

What is your issue?

A pattern we used in xclim (and elsewhere) seems to be broken on the master.

See MWE:

import pandas as pd
import xarray as xr

da = xr.DataArray([1] * 730, coords={"time": xr.date_range('1900-01-01', periods=730, freq='D', calendar='noleap')})
mulind = pd.MultiIndex.from_arrays((da.time.dt.year.values, da.time.dt.dayofyear.values), names=('year', 'doy'))

# Override previous time axis with new MultiIndex
da.assign_coords(time=mulind).unstack('time')

Now this works ok with both the current master and the latest release. However, if we chunk da, the last line now fails:

da.chunk(time=50).assign_coords(time=mulind).unstack('time')

On the master, this gives: ValueError: unmatched keys found in indexes and variables: {'year', 'doy'}

Full traceback:


ValueError Traceback (most recent call last)
Cell In[44], line 1
----> 1 da.chunk(time=50).assign_coords(time=mulind).unstack("time")

File ~/Projets/xarray/xarray/core/dataarray.py:2868, in DataArray.unstack(self, dim, fill_value, sparse)
2808 def unstack(
2809 self,
2810 dim: Dims = None,
2811 fill_value: Any = dtypes.NA,
2812 sparse: bool = False,
2813 ) -> DataArray:
2814 """
2815 Unstack existing dimensions corresponding to MultiIndexes into
2816 multiple new dimensions.
(...)
2866 DataArray.stack
2867 """
-> 2868 ds = self._to_temp_dataset().unstack(dim, fill_value, sparse)
2869 return self._from_temp_dataset(ds)

File ~/Projets/xarray/xarray/core/dataset.py:5481, in Dataset.unstack(self, dim, fill_value, sparse)
5479 for d in dims:
5480 if needs_full_reindex:
-> 5481 result = result._unstack_full_reindex(
5482 d, stacked_indexes[d], fill_value, sparse
5483 )
5484 else:
5485 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse)

File ~/Projets/xarray/xarray/core/dataset.py:5365, in Dataset._unstack_full_reindex(self, dim, index_and_vars, fill_value, sparse)
5362 else:
5363 # TODO: we may depreciate implicit re-indexing with a pandas.MultiIndex
5364 xr_full_idx = PandasMultiIndex(full_idx, dim)
-> 5365 indexers = Indexes(
5366 {k: xr_full_idx for k in index_vars},
5367 xr_full_idx.create_variables(index_vars),
5368 )
5369 obj = self._reindex(
5370 indexers, copy=False, fill_value=fill_value, sparse=sparse
5371 )
5373 for name, var in obj.variables.items():

File ~/Projets/xarray/xarray/core/indexes.py:1435, in Indexes.init(self, indexes, variables, index_type)
1433 unmatched_keys = set(indexes) ^ set(variables)
1434 if unmatched_keys:
-> 1435 raise ValueError(
1436 f"unmatched keys found in indexes and variables: {unmatched_keys}"
1437 )
1439 if any(not isinstance(idx, index_type) for idx in indexes.values()):
1440 index_type_str = f"{index_type.module}.{index_type.name}"

ValueError: unmatched keys found in indexes and variables: {'year', 'doy'}

This seems related to PR #7368.

The reason for the title of this issue is that in both versions, I now realize the da.assign_coords(time=mulind) prints as:

<xarray.DataArray (time: 730)>
dask.array<xarray-<this-array>, shape=(730,), dtype=int64, chunksize=(50,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object MultiIndex

Something's fishy, because the two "sub" indexes are not showing.

And indeed, with the current master, I can get this to work by doing (again changing the last line):

da2 = xr.DataArray(da.data, coords=xr.Coordinates.from_pandas_multiindex(mulind, 'time'))
da2.chunk(time=50).unstack('time')

But it seems a bit odd to me that we need to reconstruct the DataArray to replace its coordinate with a "MultiIndex" one.

Thus, my questions are:

  1. How does one properly override a coordinate by a MultiIndex ? Is there a way to use assign_coords ? If not, then this issue would become a feature request.
  2. Is this a regression ? Or was I just "lucky" before ?
@aulemahal aulemahal added the needs triage Issue that has not been reviewed by xarray team member label Aug 1, 2023
@TomNicholas
Copy link
Member

@benbovy

@benbovy
Copy link
Member

benbovy commented Aug 2, 2023

On the current main you should be able to do this:

mulind_coords = xr.Coordinates.from_pandas_multiindex(mulind, 'time')

da.chunk(time=50).assign_coords(mulind_coords)

Directly using a pandas MultiIndex for creating or overriding Xarray coordinates is now deprecated. That said, it should still be supported in principle, so the inconsistent coordinates vs. multi-index state you get after chunk is probably a bug (maybe a special case that was missed during the index refactor and for which there is no xarray test?). We should fix it with a nicer user (deprecation) warning.

@dcherian dcherian added regression and removed needs triage Issue that has not been reviewed by xarray team member labels Aug 2, 2023
@aulemahal
Copy link
Contributor Author

aulemahal commented Aug 3, 2023

Thanks @benbovy !

Sadly, your proposed solution did not work, it failed with: ValueError: cannot re-index or align objects with conflicting indexes found for the following coordinates: 'time' (2 conflicting indexes).

However, the following line:

da.chunk(time=50).drop_vars('time').assign_coords(time=mulind)

works with both the master and 2023.7.0. Meaning that the multi-coordinates are correctly shown in the DataArray repr and that unstacking time doesn't fail with the unmatched keys error.
Of course, on master, this also works:

mulind_coords = xr.Coordinates.from_pandas_multiindex(mulind, 'time')
da.chunk(time=50).drop_vars('time').assign_coords(mulind_coords)

In my code, I'll fix the error by also converting mulind to a Coordinates object even though it currently doesn't seem to be needed.

It seems to me that overriding a coordinate variable should be possible with assign_coords, even if the new coordinate is not "alignable" with the old one.

Full traceback of the error above:


ValueError Traceback (most recent call last)
Cell In[9], line 8
5 mulind = pd.MultiIndex.from_arrays((da.time.dt.year.values, da.time.dt.dayofyear.values), names=('year', 'doy'))
7 # Override previous time axis with new MultiIndex
----> 8 da.assign_coords(xr.Coordinates.from_pandas_multiindex(mulind, 'time')).unstack('time')

File ~/Projets/xarray/xarray/core/common.py:620, in DataWithCoords.assign_coords(self, coords, **coords_kwargs)
617 else:
618 results = self._calc_assign_results(coords_combined)
--> 620 data.coords.update(results)
621 return data

File ~/Projets/xarray/xarray/core/coordinates.py:467, in Coordinates.update(self, other)
463 other_obj = getattr(other, "variables", other)
465 self._maybe_drop_multiindex_coords(set(other_obj))
--> 467 coords, indexes = merge_coords(
468 [self.variables, other_obj],
469 priority_arg=1,
470 indexes=self.xindexes,
471 )
473 self._update_coords(coords, indexes)

File ~/Projets/xarray/xarray/core/merge.py:555, in merge_coords(objects, compat, join, priority_arg, indexes, fill_value)
553 _assert_compat_valid(compat)
554 coerced = coerce_pandas_values(objects)
--> 555 aligned = deep_align(
556 coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
557 )
558 collected = collect_variables_and_indexes(aligned, indexes=indexes)
559 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)

File ~/Projets/xarray/xarray/core/alignment.py:847, in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
844 else:
845 out.append(variables)
--> 847 aligned = align(
848 *targets,
849 join=join,
850 copy=copy,
851 indexes=indexes,
852 exclude=exclude,
853 fill_value=fill_value,
854 )
856 for position, key, aligned_obj in zip(positions, keys, aligned):
857 if key is no_key:

File ~/Projets/xarray/xarray/core/alignment.py:783, in align(join, copy, indexes, exclude, fill_value, *objects)
587 """
588 Given any number of Dataset and/or DataArray objects, returns new
589 objects with aligned indexes and dimension sizes.
(...)
773
774 """
775 aligner = Aligner(
776 objects,
777 join=join,
(...)
781 fill_value=fill_value,
782 )
--> 783 aligner.align()
784 return aligner.results

File ~/Projets/xarray/xarray/core/alignment.py:567, in Aligner.align(self)
565 self.find_matching_indexes()
566 self.find_matching_unindexed_dims()
--> 567 self.assert_no_index_conflict()
568 self.align_indexes()
569 self.assert_unindexed_dim_sizes_equal()

File ~/Projets/xarray/xarray/core/alignment.py:312, in Aligner.assert_no_index_conflict(self)
308 if dup:
309 items_msg = ", ".join(
310 f"{k!r} ({v} conflicting indexes)" for k, v in dup.items()
311 )
--> 312 raise ValueError(
313 "cannot re-index or align objects with conflicting indexes found for "
314 f"the following {msg}: {items_msg}\n"
315 "Conflicting indexes may occur when\n"
316 "- they relate to different sets of coordinate and/or dimension names\n"
317 "- they don't have the same type\n"
318 "- they may be used to reindex data along common dimensions"
319 )

ValueError: cannot re-index or align objects with conflicting indexes found for the following coordinates: 'time' (2 conflicting indexes)
Conflicting indexes may occur when

  • they relate to different sets of coordinate and/or dimension names
  • they don't have the same type
  • they may be used to reindex data along common dimensions

aulemahal added a commit to Ouranosinc/xclim that referenced this issue Aug 3, 2023
<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #1417
- [ ] Tests for the changes have been added (for bug fixes / features)
- [ ] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGES.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?

Overriding a coordinate with a pandas MultiIndex was not working
properly before, but this didn't result in any errors up to a recent
change in xarray's master. When calling `assign_coords` with a
multiindex on a coordinate that already existed, the sub-indexes would
not appear in the DataArray repr. In the most recent xarray master, that
now triggers an error when trying to unstack the multiindex, but only if
dask is used. Before, the unstacking would proceed as normal and the
sub-indexes would magically appear on the result. Dropping the previous
coordinate solves this problem.

The next xarray will implement a new public `xr.Coordinates` class that
should be used to assign coords, instead of a raw `pd.MultiIndex`.
Nonetheless, overriding is still not possible and dropping the previous
coordinate is still needed.

### Does this PR introduce a breaking change?
No.


### Other information:
See pydata/xarray#8039.
@benbovy
Copy link
Member

benbovy commented Aug 3, 2023

It seems to me that overriding a coordinate variable should be possible with assign_coords, even if the new coordinate is not "alignable" with the old one.

Agreed. We don't need full alignment but we would still need to check that dimension sizes match and whether the new assigned coordinate(s) do not break any multi-coordinate index in the original object. Maybe we could just use xarray.core.merge.merge_coordinates_without_align internally but I haven't checked.

@dcherian
Copy link
Contributor

@aulemahal Since you have a noleap calendar, you should be able to use .coarsen(time=365 ).construct() to reshape.

@dcherian
Copy link
Contributor

@pydata/xarray does any one have time to fix this. I think we should consider it a release blocker.

@benbovy
Copy link
Member

benbovy commented Aug 14, 2023

I can have a deeper look at it later this week. I'm not sure that it will be straightforward to fix, though, at least properly 😕. If not, I'll see if we can provide a quick and dirty fix.

@benbovy
Copy link
Member

benbovy commented Aug 17, 2023

@aulemahal I've just run your MWE on the main branch (eceec5f) and it seems to work, could you confirm?

>>> import pandas as pd
>>> import xarray as xr
>>> da = xr.DataArray([1] * 730, coords={"time": xr.date_range('1900-01-01', periods=730, freq='D', calendar='noleap')})
>>> mulind = pd.MultiIndex.from_arrays((da.time.dt.year.values, da.time.dt.dayofyear.values), names=('year', 'doy'))
>>> da.chunk(time=50).assign_coords(time=mulind).unstack('time')
<xarray.DataArray (year: 2, doy: 365)>
dask.array<reshape, shape=(2, 365), dtype=int64, chunksize=(1, 365), chunktype=numpy.ndarray>
Coordinates:
  * year     (year) int64 1900 1901
  * doy      (doy) int64 1 2 3 4 5 6 7 8 9 ... 358 359 360 361 362 363 364 365

There are still two issues to fix:

  1. da.assign_coords(time=mulind) should correctly create coordinates for multi-index levels, and we should warn that this is deprecated
  2. da.assign_coords(time=mulind_coords) shoudn't raise an alignment error, i.e., we should also check and maybe drop single indexed coordinates in the original object prior to align it with the new coordinates (currently only multi-index coordinates are checked in the _maybe_drop_multiindex_coords internal function).

But if the MWE works in main at least we shouldn't consider this as a release blocker.

@aulemahal
Copy link
Contributor Author

The initial "regression" is fixed, it works for me! Thanks a lot!

@benbovy
Copy link
Member

benbovy commented Aug 17, 2023

Great, although I have no idea on what could have fixed that initial regression since #7368 :-)

@dcherian dcherian changed the title How to override coord with a MultiIndex ? Update assign_coords with a MultiIndex to match new Coordinates API Aug 17, 2023
@dcherian
Copy link
Contributor

Thanks all! Let's leave the issue open to track #8039 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants