Skip to content

API: __getitem__[int] vs __setitem__[int] with Float64Index #33469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Apr 10, 2020 · 6 comments · Fixed by #42215
Closed

API: __getitem__[int] vs __setitem__[int] with Float64Index #33469

jbrockmendel opened this issue Apr 10, 2020 · 6 comments · Fixed by #42215
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jbrockmendel
Copy link
Member

ser = pd.Series([1, 2, 3, 4], index=[1.1, 2.1, 3.0, 4.1])

>>> ser[5]                    # <--we are treating 5 as a label
KeyError: 5

>>> ser[5] = 5                # < --we are treating 5 as position
IndexError: index 5 is out of bounds for axis 0 with size 4

>>> ser[3] = 5                # <-- we are treating 3 as a label
>>> ser
1.1    1
2.1    2
3.0    5
4.1    4
dtype: int64

The ser[5] = 5 case is an outlier because instead of having its label-vs-positional behavior determined by ser.index._should_fallback_to_positional, it is defermined by if is_integer(key) and not self.index.inferred_type == "integer":

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2020
@jbrockmendel jbrockmendel added API - Consistency Internal Consistency of API/Behavior API Design Indexing Related to indexing on series/frames, not to indexes themselves and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2020
@jorisvandenbossche
Copy link
Member

This API is certainly a mess .. ;)

The proposal is to change the setitem case where it fallbacks to positional to also be strictly label-based?

But so the integer would also be interpreted as label, and thus do enlargement in case it is not yet present?

Using your example series:

In [40]: ser
Out[40]: 
1.1    1
2.1    2
3.0    3
4.1    4
dtype: int64

In [41]: ser[2] = 100

In [42]: ser
Out[42]: 
1.1      1
2.1      2
3.0    100
4.1      4
dtype: int64

In [43]: ser[2.0] = -100

In [44]: ser
Out[44]: 
1.1      1
2.1      2
3.0    100
4.1      4
2.0   -100
dtype: int64

I was first thinking this would be a relatively safe change, since it right now already tries label-based for setitem first (so if the integer key is present as a label, it would already do label-based now). So the potential difference in behaviour is only when the label is not present. Right now that can update a value, but in the future add another value like the above? Which is kind of silently resulting in different data. We could also deprecate integer keys for setitem first or raise an error for it?

@jbrockmendel
Copy link
Member Author

We could also deprecate integer keys for setitem first or raise an error for it?

Do you mean "deprecate ever treating integer as a label for Float64Index"? Or "deprecate allowing ser[integer] with Float64Index"? The first seems reasonable.

I'd be OK with "integers are always considered labels" or "integers are always considered positional", so there is no question of fall-back behavior.

@jbrockmendel
Copy link
Member Author

Options that come to mind for fixing this:

  1. Deprecate casting ints to floats in Float64Index.get_loc, so that ser[4] is always positional when we have a Float64Index.

  2. Change Float64Index._should_fallback_to_positional to return not (self == self.astype(int)).any() (ATM it always returns False).

    • Upsides: Fixes the OP issue without breaking any tests, fixes the inconsistency of how we check for fallback in Series.__setitem__ vs Series.__getitem__
    • Downsides: perf (caching would help some), value-dependent behavior
  3. Change Float64Index._should_fallback_to_positional to always return True.

    • Breaks 5 tests, 4 of which are checking that we raise, the last is trivial to fix.

@jorisvandenbossche
Copy link
Member

1. Deprecate casting ints to floats in Float64Index.get_loc, so that ser[4] is always positional when we have a Float64Index.

Currently I think ser[4] is always strictly label-based for getitem? Since that part is actually consistent / unambiguous (if you know the semantics ..) at this moment, I would maybe leave it as is. Or if we deprecate it, have it raise an error in the future instead of being positional.

2. Change Float64Index._should_fallback_to_positional to return not (self == self.astype(int)).any() (ATM it always returns False).

  • Downsides: perf (caching would help some), value-dependent behavior

For that last reason (value-dependent behavior), I wouldn't go for this. ser[2] meaning something different for pd.Series([..], index=[1.1, 2.0]) vs pd.Series([..], index=[1.0, 2.0]) seems quite confusing or surprising.

3. Change Float64Index._should_fallback_to_positional to always return True.

What's the behavioral consequence of this? So that if the key (cast to float) is not present, always use positional?

Is a 4th option to change Float64Index._should_fallback_to_positional to always return False? (like it is for Int64Index?)

@jbrockmendel
Copy link
Member Author

Note: in all the options described above, we are changing in Series.__setitem__

-            if is_integer(key) and self.index.inferred_type != "integer":
+            if is_integer(key) and self.index._should_fallback_to_positional():
  1. Change Float64Index._should_fallback_to_positional to always return True.

What's the behavioral consequence of this? So that if the key (cast to float) is not present, always use positional?

If we do that, and add in __getitem__ a clause so that we try label-based before falling back, then we break 5 tests e.g.

    def test_floating_misc(self):
        ser = Series(np.arange(5), index=np.arange(5) * 2.5, dtype=np.int64)

        with pytest.raises(KeyError, match=r"^4$"):
>           ser[4]
E           Failed: DID NOT RAISE <class 'KeyError'>

(without the extra clause in __getitem__ we break 9 tests, including a few that aren't with pytest.raises(...))

Is a 4th option to change Float64Index._should_fallback_to_positional to always return False? (like it is for Int64Index?)

That is the status quo for _should_fallback_to_positional. If we change the line in Series.__setitem__ and leave everything else unchanged, we break 1 test (huh, thought it was more:

    def test_setitem_index_float64(self):
        val = 5
        obj = pd.Series([1, 2, 3, 4], index=[1.1, 2.1, 3.1, 4.1])
        assert obj.index.dtype == np.float64
    
        # float + int -> int
        temp = obj.copy()
        msg = "index 5 is out of bounds for axis 0 with size 4"
        with pytest.raises(IndexError, match=msg):
>           temp[5] = 5
E           Failed: DID NOT RAISE <class 'IndexError'>

@jreback
Copy link
Contributor

jreback commented Jun 24, 2021

IIUC I think integers should always be label based for FloatIndex. so this means deprecating (and then changing) the setitem behavior.
getitem would be unchanged i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants