Skip to content

Assignment of column via .loc for numpy non-ns datetimes #27928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 13, 2019

Conversation

inmoonlight
Copy link
Contributor

@jreback jreback changed the title Fix issue 27395 Assignment of column via .loc for numpy non-ns datetimes Aug 15, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most important types of tests are replicating the OP indexing issue. put in pandas/tests/indexing/test_loc.py (see the setting tests)

@@ -32,7 +32,7 @@ Categorical
Datetimelike
^^^^^^^^^^^^
- Bug in :func:`to_datetime` where passing a timezone-naive :class:`DatetimeArray` or :class:`DatetimeIndex` and ``utc=True`` would incorrectly return a timezone-naive result (:issue:`27733`)
-
- Bug in :func:`maybe_cast_to_datetime` where converting a `np.datetime64` to `datetime64[D]` raise `TypeError` (:issue: `27395`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a private function, you would have to indicate the user facing experience here instead (e.g. the assignment)

@@ -1026,7 +1026,7 @@ def maybe_cast_to_datetime(value, dtype, errors="raise"):
)

if is_datetime64 and not is_dtype_equal(dtype, _NS_DTYPE):
if dtype.name in ("datetime64", "datetime64[ns]"):
if dtype.name in ("datetime64", "datetime64[ns]", "datetime64[D]"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use dtype.kind == 'M'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pandas doesn't support to convert datetime[ps], I used (dtype.kind == 'M') and (dtype.name != 'datetime64[ps]')

@@ -1044,7 +1044,7 @@ def maybe_cast_to_datetime(value, dtype, errors="raise"):
value = [value]

elif is_timedelta64 and not is_dtype_equal(dtype, _TD_DTYPE):
if dtype.name in ("timedelta64", "timedelta64[ns]"):
if dtype.name in ("timedelta64", "timedelta64[ns]", "timedelta64[D]"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype.kind == 'm'

Copy link
Contributor Author

@inmoonlight inmoonlight Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pandas doesn't support to convert timedelta[ps], I used (dtype.kind == 'm') and (dtype.name != 'timedelta64[ps]')

@@ -169,3 +170,14 @@ def test_cast_scalar_to_array(obj, dtype):

arr = cast_scalar_to_array(shape, obj, dtype=dtype)
tm.assert_numpy_array_equal(arr, exp)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok to test this i guess, but would need to parameterize on datetime64[D], ns, datetime64 etc and check the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I made all possible datetime formats except [ps] (as mentioned above)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jreback's comment was requesting testing those other frequencies here.

And then assert the result matches too.

result = maybe_cast_to_datetime(obj, dtype)
expected = pd.Timestamp(...)  # right?
assert result == expected

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 15, 2019
- update doc
- update dtype matching fucntion
- update test
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close. A few questions on the tests.

@@ -85,6 +84,7 @@ Indexing
- Bug in partial-string indexing returning a NumPy array rather than a ``Series`` when indexing with a scalar like ``.loc['2015']`` (:issue:`27516`)
- Break reference cycle involving :class:`Index` and other index classes to allow garbage collection of index objects without running the GC. (:issue:`27585`, :issue:`27840`)
- Fix regression in assigning values to a single column of a DataFrame with a ``MultiIndex`` columns (:issue:`27841`).
- Fix assignment of column via `.loc` with numpy `non-ns datetime` type (:issue: `27395`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fix assignment of column via `.loc` with numpy `non-ns datetime` type (:issue: `27395`)
- Fix assignment of column via `.loc` with numpy `non-ns datetime` type (:issue:`27395`)

I'm not sure if sphinx will complain about the space or not.

Also, I think the backticks on "non-ns datetime" aren't necessary.

@@ -169,3 +170,14 @@ def test_cast_scalar_to_array(obj, dtype):

arr = cast_scalar_to_array(shape, obj, dtype=dtype)
tm.assert_numpy_array_equal(arr, exp)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jreback's comment was requesting testing those other frequencies here.

And then assert the result matches too.

result = maybe_cast_to_datetime(obj, dtype)
expected = pd.Timestamp(...)  # right?
assert result == expected

}
)
df.loc[:, "day"] = df.loc[:, "timestamp"].values.astype(dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assert about the result here? tm.assert_series_equal(df['day'], expected)

Copy link
Contributor Author

@inmoonlight inmoonlight Aug 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I update the tests, I found a strange result. (Not sure this is intended.)

df = pd.DataFrame({'timestamp': [np.datetime64('2017-01-01 01:02:23'), np.datetime64('2018-11-12 12:34:12')]})
df['year'] = df.loc[:, 'timestamp'].values.astype('datetime64[Y]')
df['month'] = df.loc[:, 'timestamp'].values.astype('datetime64[M]')
df.loc[:, 'day'] = df.loc[:, 'timestamp'].values.astype('datetime64[D]')

result:

            timestamp        day      month       year
0 2017-01-01 01:02:23 2017-01-01 2017-01-01 2017-01-01
1 2018-11-12 12:34:12 2018-11-12 2018-11-01 2018-01-01

astype result is okay.

df.loc[:, 'timestamp'].values.astype('datetime64[Y]')

result:

array(['2017', '2018'], dtype='datetime64[Y]')

If this is intended, then I'll make an assert with this result and remove previous test (pandas/tests/dtypes/cast/test_infer_dtype.py)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without running the test in master, it isn't obvious to the reader what is being tested. I suggest something like:

@pytest.mark.parametrize("unit", ...)
def test_loc_setitem_datetime64_non_ns_converts(self, unit):
     # GH#27928 a few words about how this used to break
    df = DataFrame(
            {
                "timestamp": [
                    np.datetime64("2017-02-11 12:41:29"),
                    np.datetime64("1991-11-07 04:22:37"),
                ]
            }
        )
        df[unit] = df["timestamp"].values.astype("datetime64[{unit}]".format(unit=unit))
        df.loc[:, unit] = df["timestamp"].values.astype("datetime64[{unit}]".format(unit=unit))

@@ -1055,7 +1055,7 @@ def maybe_cast_to_datetime(value, dtype, errors="raise"):
)

if is_scalar(value):
if value == iNaT or isna(value):
if value is iNaT or isna(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel something to address later, we need to separate integer tests here

@inmoonlight is there a reason you changed this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer iNaT is not interned, so this seems dangerous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
I found this led to the failure in pandas-dev.pandas (Linux py37_np_dev) env. If you all think it is dangerous, I'll put it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore the original I think. We've had some flaky tests in master recently that we're fixing elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
Restored it.

@@ -1044,7 +1044,7 @@ def maybe_cast_to_datetime(value, dtype, errors="raise"):
value = [value]

elif is_timedelta64 and not is_dtype_equal(dtype, _TD_DTYPE):
if dtype.name in ("timedelta64", "timedelta64[ns]"):
if (dtype.kind == "m") and (dtype.name != "timedelta64[ps]"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave [ps] for another issue. I am not sure what we should do about it. we could convert it with precision loss rather than actually raising (which is what I think it would do now)

Copy link
Contributor Author

@inmoonlight inmoonlight Aug 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Without this, ci check failed...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember the CI failure? It may have been a flaky test.

Copy link
Contributor Author

@inmoonlight inmoonlight Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TomAugspurger and others added 3 commits August 21, 2019 08:29
- update docs
- remove previous tests
- update tests
- revert to the changes that made to pass the ci tests
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 26, 2019 via email

@inmoonlight
Copy link
Contributor Author

@TomAugspurger

Desperately...

I looked into several CI issues and found they all raise the same error.

=================================== FAILURES ===================================
_ TestSeriesConstructors.test_constructor_generic_timestamp_bad_frequency[m8[ps]-cannot convert timedeltalike] _
[gw0] linux -- Python 3.7.4 /home/travis/miniconda3/envs/pandas-dev/bin/python
self = <pandas.tests.series.test_constructors.TestSeriesConstructors object at 0x7f2b6f5c2910>
dtype = 'm8[ps]', msg = 'cannot convert timedeltalike'
    @pytest.mark.parametrize(
        "dtype,msg",
        [
            ("m8[ps]", "cannot convert timedeltalike"),
            ("M8[ps]", "cannot convert datetimelike"),
        ],
    )
    def test_constructor_generic_timestamp_bad_frequency(self, dtype, msg):
        # see gh-15524, gh-15987
    
        with pytest.raises(TypeError, match=msg):
>           Series([], dtype=dtype)
E           Failed: DID NOT RAISE <class 'TypeError'>
pandas/tests/series/test_constructors.py:1363: Failed
_ TestSeriesConstructors.test_constructor_generic_timestamp_bad_frequency[M8[ps]-cannot convert datetimelike] _
[gw0] linux -- Python 3.7.4 /home/travis/miniconda3/envs/pandas-dev/bin/python
self = <pandas.tests.series.test_constructors.TestSeriesConstructors object at 0x7f2b6f5df750>
dtype = 'M8[ps]', msg = 'cannot convert datetimelike'
    @pytest.mark.parametrize(
        "dtype,msg",
        [
            ("m8[ps]", "cannot convert timedeltalike"),
            ("M8[ps]", "cannot convert datetimelike"),
        ],
    )
    def test_constructor_generic_timestamp_bad_frequency(self, dtype, msg):
        # see gh-15524, gh-15987
    
        with pytest.raises(TypeError, match=msg):
>           Series([], dtype=dtype)
E           Failed: DID NOT RAISE <class 'TypeError'>
pandas/tests/series/test_constructors.py:1363: Failed

At first glance, I thought pandas does not support [ps] level. This led to 98a5061 .

How could I solve it?

@TomAugspurger
Copy link
Contributor

At first glance, I thought pandas does not support [ps] level.

That's correct. I think if the user explicitly requests [ps] we raise, and if we're giving a [ps]-resolution array we convert to [ns] (is that correct @jbrockmendel?)

@jbrockmendel
Copy link
Member

I think that's right. xref #23571

@inmoonlight
Copy link
Contributor Author

@TomAugspurger

So,,, raising ValueError when allocating [ps] resolution is an issue to be solved...
Which means pandas/tests/series/test_constructors.py:1363 can be removed?

Am I right? 🤔

@TomAugspurger
Copy link
Contributor

Which means pandas/tests/series/test_constructors.py:1363 can be removed?

IIUC, I think that should stay. I think these are the expected behaviors

>>> pd.Series([], dtype='datetime64[ps]')                                    # raises
>>> pd.Series(np.array([], dtype='datetime64[ps]'), dtype='datetime64[ps]')  # raises
>>> pd.Series(np.array([], dtype='datetime64[ps]'))                          # OK

@inmoonlight
Copy link
Contributor Author

@TomAugspurger

Thanks for the explanations!
The final modification raises an error on those inputs.
Still more modification required?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel can you take another look here?

@@ -1026,7 +1026,7 @@ def maybe_cast_to_datetime(value, dtype, errors="raise"):
)

if is_datetime64 and not is_dtype_equal(dtype, _NS_DTYPE):
if dtype.name in ("datetime64", "datetime64[ns]"):
if dtype <= np.dtype("M8[ns]"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does <= do for numpy dtypes? Check if it's a subtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me either:

>>> "timedelta64[ns]" < np.dtype("m8[ns]")
False
>>> "timedelta64[us]" < np.dtype("m8[ns]")
True
>>> "timedelta64[ps]" > np.dtype("m8[ns]")
True
>>> "timedelta64" < np.dtype("m8[ns]")
True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @jbrockmendel
It checks whether the dtype timespan is longer than [ns] or shorter. (FYI, https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to this effect (in both places)

@@ -49,7 +49,7 @@ Interval
Indexing
^^^^^^^^

-
- Fix assignment of column via `.loc` with numpy non-ns datetime type (:issue:`27395`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to go to v1.0.0.rst now, sorry.

@jreback jreback added this to the 1.0 milestone Sep 13, 2019
@jreback jreback merged commit 810fa77 into pandas-dev:master Sep 13, 2019
@jreback
Copy link
Contributor

jreback commented Sep 13, 2019

thanks @inmoonlight

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assignment of column via .loc
4 participants