Change the default dtype of get_dummies to bool

kianelbo · kianelbo · commit 9f7fbc4ab9c7 · 2022-09-22T18:04:07.000+02:00
diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst
@@ -608,7 +608,6 @@ values, can derive a :class:`DataFrame` containing ``k`` columns of 1s and 0s us
 :func:`~pandas.get_dummies`:
 
 .. ipython:: python
-   :okwarning:
 
    df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
 
@@ -618,7 +617,6 @@ Sometimes it's useful to prefix the column names, for example when merging the r
 with the original :class:`DataFrame`:
 
 .. ipython:: python
-   :okwarning:
 
    dummies = pd.get_dummies(df["key"], prefix="key")
    dummies
@@ -628,7 +626,6 @@ with the original :class:`DataFrame`:
 This function is often used along with discretization functions like :func:`~pandas.cut`:
 
 .. ipython:: python
-   :okwarning:
 
    values = np.random.randn(10)
    values
@@ -645,7 +642,6 @@ variables (categorical in the statistical sense, those with ``object`` or
 
 
 .. ipython:: python
-    :okwarning:
 
     df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]})
     pd.get_dummies(df)
@@ -654,7 +650,6 @@ All non-object columns are included untouched in the output. You can control
 the columns that are encoded with the ``columns`` keyword.
 
 .. ipython:: python
-    :okwarning:
 
     pd.get_dummies(df, columns=["A"])
 
@@ -672,7 +667,6 @@ the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways:
 * dict: Mapping column name to prefix.
 
 .. ipython:: python
-    :okwarning:
 
     simple = pd.get_dummies(df, prefix="new_prefix")
     simple
@@ -686,7 +680,6 @@ variable to avoid collinearity when feeding the result to statistical models.
 You can switch to this mode by turn on ``drop_first``.
 
 .. ipython:: python
-    :okwarning:
 
     s = pd.Series(list("abcaa"))
 
@@ -697,7 +690,6 @@ You can switch to this mode by turn on ``drop_first``.
 When a column contains only one level, it will be omitted in the result.
 
 .. ipython:: python
-    :okwarning:
 
     df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})
 
diff --git a/doc/source/whatsnew/v0.13.0.rst b/doc/source/whatsnew/v0.13.0.rst
@@ -501,7 +501,6 @@ Enhancements
 - ``NaN`` handing in get_dummies (:issue:`4446`) with ``dummy_na``
 
   .. ipython:: python
-     :okwarning:
 
      # previously, nan was erroneously counted as 2 here
      # now it is not counted at all
diff --git a/doc/source/whatsnew/v0.15.0.rst b/doc/source/whatsnew/v0.15.0.rst
@@ -1007,7 +1007,6 @@ Other:
   left untouched.
 
   .. ipython:: python
-    :okwarning:
 
     df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
                     'C': [1, 2, 3]})
diff --git a/doc/source/whatsnew/v0.19.0.rst b/doc/source/whatsnew/v0.19.0.rst
@@ -431,7 +431,6 @@ The ``pd.get_dummies`` function now returns dummy-encoded columns as small integ
 **New behavior**:
 
 .. ipython:: python
-   :okwarning:
 
    pd.get_dummies(["a", "b", "a", "c"]).dtypes
 
diff --git a/doc/source/whatsnew/v0.23.0.rst b/doc/source/whatsnew/v0.23.0.rst
@@ -366,7 +366,6 @@ Function ``get_dummies`` now supports ``dtype`` argument
 The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`)
 
 .. ipython:: python
-   :okwarning:
 
    df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
    pd.get_dummies(df, columns=['c']).dtypes
diff --git a/doc/source/whatsnew/v0.24.0.rst b/doc/source/whatsnew/v0.24.0.rst
@@ -833,7 +833,6 @@ then all the columns are dummy-encoded, and a :class:`SparseDataFrame` was retur
 Now, the return type is consistently a :class:`DataFrame`.
 
 .. ipython:: python
-   :okwarning:
 
    type(pd.get_dummies(df, sparse=True))
    type(pd.get_dummies(df[['B', 'C']], sparse=True))
diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst
@@ -932,7 +932,6 @@ Other Deprecations
 - Deprecated unused arguments ``encoding`` and ``verbose`` in :meth:`Series.to_excel` and :meth:`DataFrame.to_excel` (:issue:`47912`)
 - Deprecated the ``inplace`` keyword in :meth:`DataFrame.set_axis` and :meth:`Series.set_axis`, use ``obj = obj.set_axis(..., copy=False)`` instead (:issue:`48130`)
 - Deprecated producing a single element when iterating over a :class:`DataFrameGroupBy` or a :class:`SeriesGroupBy` that has been grouped by a list of length 1; A tuple of length one will be returned instead (:issue:`42795`)
-- Deprecated ``np.uint8`` as the default ``dtype`` for :func:`get_dummies` - in a future version, it will be changed to ``bool`` (:issue:`45848`)
 - Fixed up warning message of deprecation of :meth:`MultiIndex.lesort_depth` as public method, as the message previously referred to :meth:`MultiIndex.is_lexsorted` instead (:issue:`38701`)
 - Deprecated the ``sort_columns`` argument in :meth:`DataFrame.plot` and :meth:`Series.plot` (:issue:`47563`).
 - Deprecated positional arguments for all but the first argument of :meth:`DataFrame.to_stata` and :func:`read_stata`, use keyword arguments instead (:issue:`48128`).
@@ -1192,7 +1191,6 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrameGroupBy.resample` raises ``KeyError`` when getting the result from a key list which misses the resample key (:issue:`47362`)
 - Bug in :meth:`DataFrame.groupby` would lose index columns when the DataFrame is empty for transforms, like fillna (:issue:`47787`)
 - Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` with ``dropna=False`` and ``sort=False`` would put any null groups at the end instead the order that they are encountered (:issue:`46584`)
--
 
 Reshaping
 ^^^^^^^^^
@@ -1210,6 +1208,7 @@ Reshaping
 - Bug in :meth:`concat` when ``axis=1`` and ``sort=False`` where the resulting Index was a :class:`Int64Index` instead of a :class:`RangeIndex` (:issue:`46675`)
 - Bug in :meth:`wide_to_long` raises when ``stubnames`` is missing in columns and ``i`` contains string dtype column (:issue:`46044`)
 - Bug in :meth:`DataFrame.join` with categorical index results in unexpected reordering (:issue:`47812`)
+- Bug in :func:`get_dummies` ``np.uint8`` being the default ``dtype``, changed to ``bool`` (:issue:`45848`)
 
 Sparse
 ^^^^^^
diff --git a/doc/source/whatsnew/v1.5.1.rst b/doc/source/whatsnew/v1.5.1.rst
@@ -23,7 +23,7 @@ Fixed regressions
 
 Bug fixes
 ~~~~~~~~~
--
+- Bug in :func:`get_dummies` with default ``dtype`` being ``uint8`` - the default ``dtype`` is now changed to ``bool`` (:issue:`45848`)
 -
 
 .. ---------------------------------------------------------------------------
diff --git a/pandas/core/reshape/encoding.py b/pandas/core/reshape/encoding.py
@@ -1,16 +1,13 @@
 from __future__ import annotations
 
 from collections import defaultdict
-import inspect
 import itertools
 from typing import Hashable
-import warnings
 
 import numpy as np
 
 from pandas._libs.sparse import IntIndex
 from pandas._typing import Dtype
-from pandas.util._exceptions import find_stack_level
 
 from pandas.core.dtypes.common import (
     is_integer_dtype,
@@ -66,7 +63,7 @@ def get_dummies(
     drop_first : bool, default False
         Whether to get k-1 dummies out of k categorical levels by removing the
         first level.
-    dtype : dtype, default np.uint8
+    dtype : dtype, default bool
         Data type for new columns. Only a single dtype is allowed.
 
     Returns
@@ -89,50 +86,50 @@ def get_dummies(
     >>> s = pd.Series(list('abca'))
 
     >>> pd.get_dummies(s)
-       a  b  c
-    0  1  0  0
-    1  0  1  0
-    2  0  0  1
-    3  1  0  0
+           a      b      c
+    0   True  False  False
+    1  False   True  False
+    2  False  False   True
+    3   True  False  False
 
     >>> s1 = ['a', 'b', np.nan]
 
     >>> pd.get_dummies(s1)
-       a  b
-    0  1  0
-    1  0  1
-    2  0  0
+           a      b
+    0   True  False
+    1  False   True
+    2  False  False
 
     >>> pd.get_dummies(s1, dummy_na=True)
-       a  b  NaN
-    0  1  0    0
-    1  0  1    0
-    2  0  0    1
+           a      b    NaN
+    0   True  False  False
+    1  False   True  False
+    2  False  False   True
 
     >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
     ...                    'C': [1, 2, 3]})
 
     >>> pd.get_dummies(df, prefix=['col1', 'col2'])
        C  col1_a  col1_b  col2_a  col2_b  col2_c
-    0  1       1       0       0       1       0
-    1  2       0       1       1       0       0
-    2  3       1       0       0       0       1
+    0  1    True   False   False    True   False
+    1  2   False    True    True   False   False
+    2  3    True   False   False   False    True
 
     >>> pd.get_dummies(pd.Series(list('abcaa')))
-       a  b  c
-    0  1  0  0
-    1  0  1  0
-    2  0  0  1
-    3  1  0  0
-    4  1  0  0
+           a      b      c
+    0   True  False  False
+    1  False   True  False
+    2  False  False   True
+    3   True  False  False
+    4   True  False  False
 
     >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
-       b  c
-    0  0  0
-    1  1  0
-    2  0  1
-    3  0  0
-    4  0  0
+           b      c
+    0  False  False
+    1   True  False
+    2  False   True
+    3  False  False
+    4  False  False
 
     >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
          a    b    c
@@ -236,16 +233,11 @@ def _get_dummies_1d(
     codes, levels = factorize_from_iterable(Series(data))
 
     if dtype is None:
-        warnings.warn(
-            "In a future version of pandas the default dtype will change from "
-            "'uint8' to 'bool', please specify a dtype to silence this warning",
-            FutureWarning,
-            stacklevel=find_stack_level(inspect.currentframe()),
-        )
-        dtype = np.dtype(np.uint8)
+        dtype = bool
     # error: Argument 1 to "dtype" has incompatible type "Union[ExtensionDtype, str,
     # dtype[Any], Type[object]]"; expected "Type[Any]"
-    dtype = np.dtype(dtype)  # type: ignore[arg-type]
+    else:
+        dtype = np.dtype(dtype)  # type: ignore[arg-type]
 
     if is_object_dtype(dtype):
         raise ValueError("dtype=object is not a valid dtype for get_dummies")
diff --git a/pandas/tests/frame/indexing/test_getitem.py b/pandas/tests/frame/indexing/test_getitem.py
@@ -52,9 +52,7 @@ def test_getitem_list_of_labels_categoricalindex_cols(self):
         # GH#16115
         cats = Categorical([Timestamp("12-31-1999"), Timestamp("12-31-2000")])
 
-        expected = DataFrame(
-            [[1, 0], [0, 1]], dtype="uint8", index=[0, 1], columns=cats
-        )
+        expected = DataFrame([[1, 0], [0, 1]], dtype="bool", index=[0, 1], columns=cats)
         dummies = get_dummies(cats)
         result = dummies[list(dummies.columns)]
         tm.assert_frame_equal(result, expected)
diff --git a/pandas/tests/frame/methods/test_sort_values.py b/pandas/tests/frame/methods/test_sort_values.py
@@ -19,7 +19,7 @@ def test_sort_values_sparse_no_warning(self):
         # GH#45618
         # TODO(2.0): test will be unnecessary
         ser = pd.Series(Categorical(["a", "b", "a"], categories=["a", "b", "c"]))
-        df = pd.get_dummies(ser, sparse=True)
+        df = pd.get_dummies(ser, dtype=np.uint8, sparse=True)
 
         with tm.assert_produces_warning(None):
             # No warnings about constructing Index from SparseArray
diff --git a/pandas/tests/reshape/test_get_dummies.py b/pandas/tests/reshape/test_get_dummies.py

Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,7 @@ Fixed regressions`
`23`	`23`
`24`	`24`	`Bug fixes`
`25`	`25`	`~~~~~~~~~`
`26`		`--`
	`26`	+- Bug in :func:`get_dummies` with default ``dtype`` being ``uint8`` - the default ``dtype`` is now changed to ``bool`` (:issue:`45848`)
`27`	`27`	`-`
`28`	`28`
`29`	`29`	`.. ---------------------------------------------------------------------------`