Pandas 2.2 中文官方教程和指南（十七）（2）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十七）（1）https://developer.aliyun.com/article/1509822

设置类别

如果您想要一次性执行删除和添加新类别的操作（这样做有一定的速度优势），或者简单地将类别设置为预定义的规模，请使用set_categories()。

In [84]: s = pd.Series(["one", "two", "four", "-"], dtype="category")
In [85]: s
Out[85]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): ['-', 'four', 'one', 'two']
In [86]: s = s.cat.set_categories(["one", "two", "three", "four"])
In [87]: s
Out[87]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): ['one', 'two', 'three', 'four']

注意

请注意，Categorical.set_categories()无法知道某个类别是有意省略的还是因为拼写错误或（在 Python3 下）由于类型差异（例如，NumPy S1 dtype 和 Python 字符串）。这可能导致意外的行为！

排序和顺序

如果分类数据是有序的（s.cat.ordered == True），那么类别的顺序具有意义，并且可以执行某些操作。如果分类是无序的，.min()/.max()将引发TypeError。

In [88]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False))
In [89]: s = s.sort_values()
In [90]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
In [91]: s = s.sort_values()
In [92]: s
Out[92]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
In [93]: s.min(), s.max()
Out[93]: ('a', 'c')

您可以使用as_ordered()将分类数据设置为有序，或者使用as_unordered()将其设置为无序。这些方法默认会返回一个新对象。

In [94]: s.cat.as_ordered()
Out[94]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
In [95]: s.cat.as_unordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

排序将使用类别定义的顺序，而不是数据类型上存在的任何词法顺序。即使对于字符串和数字数据也是如此：

In [96]: s = pd.Series([1, 2, 3, 1], dtype="category")
In [97]: s = s.cat.set_categories([2, 3, 1], ordered=True)
In [98]: s
Out[98]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [99]: s = s.sort_values()
In [100]: s
Out[100]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [101]: s.min(), s.max()
Out[101]: (2, 1)

重新排序

通过Categorical.reorder_categories()和Categorical.set_categories()方法可以重新排序类别。对于Categorical.reorder_categories()，所有旧类别必须包含在新类别中，不允许有新类别。这将必然使排序顺序与类别顺序相同。

In [102]: s = pd.Series([1, 2, 3, 1], dtype="category")
In [103]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)
In [104]: s
Out[104]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [105]: s = s.sort_values()
In [106]: s
Out[106]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [107]: s.min(), s.max()
Out[107]: (2, 1)

注意

注意分配新类别和重新排序类别之间的区别：第一个重新命名类别，因此Series中的个别值也会更名，但是如果第一个位置最后被排序，重新命名的值仍将最后被排序。重新排序意味着排序值的方式在之后会有所不同，但不意味着Series中的个别值已更改。

注意

如果Categorical未排序，Series.min()和Series.max()会引发TypeError。数值运算如+、-、*、/及基于它们的操作（例如Series.median()，如果数组的长度为偶数，需要计算两个值之间的平均值）不起作用，并引发TypeError。

多列排序

分类dtyped列将以与其他列类似的方式参与多列排序。分类的排序由该列的categories确定。

In [108]: dfs = pd.DataFrame(
 .....:    {
 .....:        "A": pd.Categorical(
 .....:            list("bbeebbaa"),
 .....:            categories=["e", "a", "b"],
 .....:            ordered=True,
 .....:        ),
 .....:        "B": [1, 2, 1, 2, 2, 1, 2, 1],
 .....:    }
 .....: )
 .....: 
In [109]: dfs.sort_values(by=["A", "B"])
Out[109]: 
 A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

重新排序categories会改变未来的排序。

In [110]: dfs["A"] = dfs["A"].cat.reorder_categories(["a", "b", "e"])
In [111]: dfs.sort_values(by=["A", "B"])
Out[111]: 
 A  B
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2
2  e  1
3  e  2

比较

比较分类数据与其他对象可能有三种情况：

与类列表对象（列表、Series、数组等）进行相等比较（==和!=），长度与分类数据相同。
所有与另一个分类系列的比较（==、!=、>、>=、<和<=），当ordered==True且categories相同时。
所有分类数据与标量的比较。

所有其他比较，特别是两个具有不同类别或一个具有任何类列表对象的分类的“非相等”比较，都会引发TypeError。

注意

对分类数据与Series、np.array、list或具有不同类别或排序的分类数据的任何“非相等”比较都会引发TypeError，因为自定义类别排序可能会被解释为两种方式：一种考虑排序，一种不考虑排序。

In [112]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
In [113]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))
In [114]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
In [115]: cat
Out[115]: 
0    1
1    2
2    3
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [116]: cat_base
Out[116]: 
0    2
1    2
2    2
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [117]: cat_base2
Out[117]: 
0    2
1    2
2    2
dtype: category
Categories (1, int64): [2]

与具有相同类别和顺序的分类或标量进行比较有效：

In [118]: cat > cat_base
Out[118]: 
0     True
1    False
2    False
dtype: bool
In [119]: cat > 2
Out[119]: 
0     True
1    False
2    False
dtype: bool

相等比较适用于任何长度相同的类列表对象和标量：

In [120]: cat == cat_base
Out[120]: 
0    False
1     True
2    False
dtype: bool
In [121]: cat == np.array([1, 2, 3])
Out[121]: 
0    True
1    True
2    True
dtype: bool
In [122]: cat == 2
Out[122]: 
0    False
1     True
2    False
dtype: bool

这不起作用，因为类别不相同：

In [123]: try:
 .....:    cat > cat_base2
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Categoricals can only be compared if 'categories' are the same.

如果要对非分类数据进行“非相等”比较，需要明确地将分类数据转换回原始值：

In [124]: base = np.array([1, 2, 3])
In [125]: try:
 .....:    cat > base
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.
If you want to compare values, use 'np.asarray(cat) <op> other'.
In [126]: np.asarray(cat) > base
Out[126]: array([False, False, False])

当您比较具有相同类别的两个无序分类时，不考虑顺序：

In [127]: c1 = pd.Categorical(["a", "b"], categories=["a", "b"], ordered=False)
In [128]: c2 = pd.Categorical(["a", "b"], categories=["b", "a"], ordered=False)
In [129]: c1 == c2
Out[129]: array([ True,  True])

操作

除了Series.min(), Series.max() 和 Series.mode()，分类数据还可以进行以下操作：

Series 方法如Series.value_counts() 会使用所有类别，即使数据中有些类别不存在：

In [130]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
In [131]: s.value_counts()
Out[131]: 
c    2
a    1
b    1
d    0
Name: count, dtype: int64

DataFrame 方法如DataFrame.sum() 在 observed=False 时也会显示“未使用”的类别。

In [132]: columns = pd.Categorical(
 .....:    ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
 .....: )
 .....: 
In [133]: df = pd.DataFrame(
 .....:    data=[[1, 2, 3], [4, 5, 6]],
 .....:    columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
 .....: ).T
 .....: 
In [134]: df.groupby(level=1, observed=False).sum()
Out[134]: 
 0  1
One    3  9
Two    3  6
Three  0  0

Groupby 在 observed=False 时也会显示“未使用”的类别：

In [135]: cats = pd.Categorical(
 .....:    ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
 .....: )
 .....: 
In [136]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
In [137]: df.groupby("cats", observed=False).mean()
Out[137]: 
 values
cats 
a        1.0
b        2.0
c        4.0
d        NaN
In [138]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [139]: df2 = pd.DataFrame(
 .....:    {
 .....:        "cats": cats2,
 .....:        "B": ["c", "d", "c", "d"],
 .....:        "values": [1, 2, 3, 4],
 .....:    }
 .....: )
 .....: 
In [140]: df2.groupby(["cats", "B"], observed=False).mean()
Out[140]: 
 values
cats B 
a    c     1.0
 d     2.0
b    c     3.0
 d     4.0
c    c     NaN
 d     NaN

透视表：

In [141]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [142]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
In [143]: pd.pivot_table(df, values="values", index=["A", "B"], observed=False)
Out[143]: 
 values
A B 
a c     1.0
 d     2.0
b c     3.0
 d     4.0

数据整理

优化的 pandas 数据访问方法 .loc, .iloc, .at, 和 .iat 的工作方式与正常情况下相同。唯一的区别在于返回类型（用于获取）以及只有已在 categories 中的值才能被赋值。

获取

如果切片操作返回 DataFrame 或 Series 类型的列，则 category dtype 会被保留。

In [144]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [145]: cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx)
In [146]: values = [1, 2, 2, 2, 3, 4, 5]
In [147]: df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
In [148]: df.iloc[2:4, :]
Out[148]: 
 cats  values
j    b       2
k    b       2
In [149]: df.iloc[2:4, :].dtypes
Out[149]: 
cats      category
values       int64
dtype: object
In [150]: df.loc["h":"j", "cats"]
Out[150]: 
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [151]: df[df["cats"] == "b"]
Out[151]: 
 cats  values
i    b       2
j    b       2
k    b       2

类别类型未保留的一个例子是如果您取一行：结果的 Series 的 dtype 是 object：

# get the complete "h" row as a Series
In [152]: df.loc["h", :]
Out[152]: 
cats      a
values    1
Name: h, dtype: object

从分类数据中返回单个项目也会返回该值，而不是长度为“1”的分类。

In [153]: df.iat[0, 0]
Out[153]: 'a'
In [154]: df["cats"] = df["cats"].cat.rename_categories(["x", "y", "z"])
In [155]: df.at["h", "cats"]  # returns a string
Out[155]: 'x'

注意

这与 R 的 factor 函数相反，其中 factor(c(1,2,3))[1] 返回一个单个值 factor。

要获得类型为 category 的单个值 Series，您需要传入一个包含单个值的列表：

In [156]: df.loc[["h"], "cats"]
Out[156]: 
h    x
Name: cats, dtype: category
Categories (3, object): ['x', 'y', 'z']

字符串和日期时间访问器

如果 s.cat.categories 是适当类型，则访问器 .dt 和 .str 将起作用：

In [157]: str_s = pd.Series(list("aabb"))
In [158]: str_cat = str_s.astype("category")
In [159]: str_cat
Out[159]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): ['a', 'b']
In [160]: str_cat.str.contains("a")
Out[160]: 
0     True
1     True
2    False
3    False
dtype: bool
In [161]: date_s = pd.Series(pd.date_range("1/1/2015", periods=5))
In [162]: date_cat = date_s.astype("category")
In [163]: date_cat
Out[163]: 
0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
In [164]: date_cat.dt.day
Out[164]: 
0    1
1    2
2    3
3    4
4    5
dtype: int32

注意

返回的 Series（或 DataFrame）与在该类型的 Series 上使用 .str. / .dt. 时的类型相同（而不是 category 类型！）。

这意味着，从 Series 的访问器的方法和属性返回的值与将该 Series 转换为 category 类型后的访问器的方法和属性返回的值将相等：

In [165]: ret_s = str_s.str.contains("a")
In [166]: ret_cat = str_cat.str.contains("a")
In [167]: ret_s.dtype == ret_cat.dtype
Out[167]: True
In [168]: ret_s == ret_cat
Out[168]: 
0    True
1    True
2    True
3    True
dtype: bool

注意

工作是在 categories 上进行的，然后构建一个新的 Series。如果您有一个字符串类型的 Series，其中有很多重复的元素（即 Series 中的唯一元素数量远小于 Series 的长度），这可能会对性能产生影响。在这种情况下，将原始 Series 转换为 category 类型，并在其上使用 .str. 或 .dt. 可能更快。

设置

设置分类列（或 Series）中的值只要该值包含在 categories 中即可：

In [169]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [170]: cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"], categories=["a", "b"])
In [171]: values = [1, 1, 1, 1, 1, 1, 1]
In [172]: df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
In [173]: df.iloc[2:4, :] = [["b", 2], ["b", 2]]
In [174]: df
Out[174]: 
 cats  values
h    a       1
i    a       1
j    b       2
k    b       2
l    a       1
m    a       1
n    a       1
In [175]: try:
 .....:    df.iloc[2:4, :] = [["c", 3], ["c", 3]]
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot setitem on a Categorical with a new category, set the categories first

通过分配分类数据来设置值也会检查 categories 是否匹配：

In [176]: df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"])
In [177]: df
Out[177]: 
 cats  values
h    a       1
i    a       1
j    a       2
k    a       2
l    a       1
m    a       1
n    a       1
In [178]: try:
 .....:    df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"])
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot set a Categorical with another, without identical categories

将Categorical分配给其他类型列的部分将使用这些值：

In [179]: df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
In [180]: df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
In [181]: df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
In [182]: df
Out[182]: 
 a  b
0  1  a
1  b  a
2  b  b
3  1  b
4  1  a
In [183]: df.dtypes
Out[183]: 
a    object
b    object
dtype: object

合并/连接

默认情况下，合并包含相同类别的Series或DataFrames将导致category dtype，否则结果将取决于底层类别的 dtype。导致非分类 dtype 的合并可能会导致更高的内存使用量。使用.astype或union_categoricals来确保category结果。

In [184]: from pandas.api.types import union_categoricals
# same categories
In [185]: s1 = pd.Series(["a", "b"], dtype="category")
In [186]: s2 = pd.Series(["a", "b", "a"], dtype="category")
In [187]: pd.concat([s1, s2])
Out[187]: 
0    a
1    b
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']
# different categories
In [188]: s3 = pd.Series(["b", "c"], dtype="category")
In [189]: pd.concat([s1, s3])
Out[189]: 
0    a
1    b
0    b
1    c
dtype: object
# Output dtype is inferred based on categories values
In [190]: int_cats = pd.Series([1, 2], dtype="category")
In [191]: float_cats = pd.Series([3.0, 4.0], dtype="category")
In [192]: pd.concat([int_cats, float_cats])
Out[192]: 
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64
In [193]: pd.concat([s1, s3]).astype("category")
Out[193]: 
0    a
1    b
0    b
1    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [194]: union_categoricals([s1.array, s3.array])
Out[194]: 
['a', 'b', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

以下表格总结了合并Categoricals的结果：

arg1	arg2	相同	结果
类别	类别	True	类别
类别（object）	类别（object）	False	object（推断出的 dtype）

如果要合并不一定具有相同类别的分类，union_categoricals()函数将合并类别列表。新类别将是被合并类别的并集。

In [195]: from pandas.api.types import union_categoricals
In [196]: a = pd.Categorical(["b", "c"])
In [197]: b = pd.Categorical(["a", "b"])
In [198]: union_categoricals([a, b])
Out[198]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

默认情况下，生成的类别将按照它们在数据中出现的顺序排序。如果要使类别按字典顺序排序，请使用sort_categories=True参数。

In [199]: union_categoricals([a, b], sort_categories=True)
Out[199]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']

union_categoricals还适用于将具有相同类别和顺序信息的两个分类合并的“简单”情况（例如，您也可以使用append）。

In [200]: a = pd.Categorical(["a", "b"], ordered=True)
In [201]: b = pd.Categorical(["a", "b", "a"], ordered=True)
In [202]: union_categoricals([a, b])
Out[202]: 
['a', 'b', 'a', 'b', 'a']
Categories (2, object): ['a' < 'b']

以下引发TypeError，因为类别是有序的而且不相同。

In [203]: a = pd.Categorical(["a", "b"], ordered=True)
In [204]: b = pd.Categorical(["a", "b", "c"], ordered=True)
In [205]: union_categoricals([a, b])
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[205], line 1
----> 1 union_categoricals([a, b])
File ~/work/pandas/pandas/pandas/core/dtypes/concat.py:341, in union_categoricals(to_union, sort_categories, ignore_order)
  339     if all(c.ordered for c in to_union):
  340         msg = "to union ordered Categoricals, all categories must be the same"
--> 341         raise TypeError(msg)
  342     raise TypeError("Categorical.ordered must be the same")
  344 if ignore_order:
TypeError: to union ordered Categoricals, all categories must be the same

可以通过使用ignore_ordered=True参数来合并具有不同类别或排序的有序分类。

In [206]: a = pd.Categorical(["a", "b", "c"], ordered=True)
In [207]: b = pd.Categorical(["c", "b", "a"], ordered=True)
In [208]: union_categoricals([a, b], ignore_order=True)
Out[208]: 
['a', 'b', 'c', 'c', 'b', 'a']
Categories (3, object): ['a', 'b', 'c']

union_categoricals()还适用于CategoricalIndex或包含分类数据的Series，但请注意，结果数组将始终是普通的Categorical：

In [209]: a = pd.Series(["b", "c"], dtype="category")
In [210]: b = pd.Series(["a", "b"], dtype="category")
In [211]: union_categoricals([a, b])
Out[211]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

注意

当合并分类时，union_categoricals可能会重新编码类别的整数代码。这可能是您想要的，但如果您依赖于类别的确切编号，请注意。

In [212]: c1 = pd.Categorical(["b", "c"])
In [213]: c2 = pd.Categorical(["a", "b"])
In [214]: c1
Out[214]: 
['b', 'c']
Categories (2, object): ['b', 'c']
# "b" is coded to 0
In [215]: c1.codes
Out[215]: array([0, 1], dtype=int8)
In [216]: c2
Out[216]: 
['a', 'b']
Categories (2, object): ['a', 'b']
# "b" is coded to 1
In [217]: c2.codes
Out[217]: array([0, 1], dtype=int8)
In [218]: c = union_categoricals([c1, c2])
In [219]: c
Out[219]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']
# "b" is coded to 0 throughout, same as c1, different from c2
In [220]: c.codes
Out[220]: array([0, 1, 2, 0], dtype=int8)

获取数据的输入/输出

您可以将包含category dtypes 的数据写入HDFStore。请参见这里以获取示例和注意事项。

也可以将数据写入Stata格式文件并从中读取数据。请参见这里以获取示例和注意事项。

写入 CSV 文件将转换数据，有效地删除有关分类（类别和排序）的任何信息。因此，如果您读取回 CSV 文件，必须将相关列转换回category并分配正确的类别和类别排序。

In [221]: import io
In [222]: s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"]))
# rename the categories
In [223]: s = s.cat.rename_categories(["very good", "good", "bad"])
# reorder the categories and add missing categories
In [224]: s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [225]: df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]})
In [226]: csv = io.StringIO()
In [227]: df.to_csv(csv)
In [228]: df2 = pd.read_csv(io.StringIO(csv.getvalue()))
In [229]: df2.dtypes
Out[229]: 
Unnamed: 0     int64
cats          object
vals           int64
dtype: object
In [230]: df2["cats"]
Out[230]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: object
# Redo the category
In [231]: df2["cats"] = df2["cats"].astype("category")
In [232]: df2["cats"] = df2["cats"].cat.set_categories(
 .....:    ["very bad", "bad", "medium", "good", "very good"]
 .....: )
 .....: 
In [233]: df2.dtypes
Out[233]: 
Unnamed: 0       int64
cats          category
vals             int64
dtype: object
In [234]: df2["cats"]
Out[234]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

写入 SQL 数据库时也适用于相同规则。

缺失数据

pandas 主要使用值 np.nan 表示缺失数据。默认情况下不包括在计算中。请参阅缺失数据部分。

缺失值不应包含在分类的 categories 中，只应包含在 values 中。相反，应理解 NaN 是不同的，并且始终可能存在。在处理分类的 codes 时，缺失值的代码始终为 -1。

In [235]: s = pd.Series(["a", "b", np.nan, "a"], dtype="category")
# only two categories
In [236]: s
Out[236]: 
0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): ['a', 'b']
In [237]: s.cat.codes
Out[237]: 
0    0
1    1
2   -1
3    0
dtype: int8

用于处理缺失数据的方法，例如 isna()、fillna()、dropna()，都可以正常工作：

In [238]: s = pd.Series(["a", "b", np.nan], dtype="category")
In [239]: s
Out[239]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): ['a', 'b']
In [240]: pd.isna(s)
Out[240]: 
0    False
1    False
2     True
dtype: bool
In [241]: s.fillna("a")
Out[241]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']

与 R 的 `factor` 的差异

以下与 R 的 factor 函数的差异可以观察到：

R 的 levels 被命名为 categories。
R 的 levels 始终是字符串类型，而 pandas 的 categories 可以是任何 dtype。
不可能在创建时指定标签。之后使用 s.cat.rename_categories(new_labels)。在创建时指定标签。之后使用 s.cat.rename_categories(new_labels)。
与 R 的 factor 函数相反，将分类数据作为唯一输入来创建新的分类系列不会删除未使用的类别，而是创建一个与传入的相等的新分类系列！
R 允许在其 levels（pandas 的 categories）中包含缺失值。pandas 不允许 NaN 类别，但缺失值仍然可以在 values 中。

注意事项

内存使用

Categorical 的内存使用量与类别数量加上数据长度成正比。相比之下，object dtype 是数据长度的常数倍。

In [242]: s = pd.Series(["foo", "bar"] * 1000)
# object dtype
In [243]: s.nbytes
Out[243]: 16000
# category dtype
In [244]: s.astype("category").nbytes
Out[244]: 2016

注意

如果类别数量接近数据长度，Categorical 将使用几乎相同或更多的内存，比等效的 object dtype 表示法更多。

In [245]: s = pd.Series(["foo%04d" % i for i in range(2000)])
# object dtype
In [246]: s.nbytes
Out[246]: 16000
# category dtype
In [247]: s.astype("category").nbytes
Out[247]: 20000

`Categorical` 不是一个 `numpy` 数组

目前，分类数据和底层的 Categorical 实现为 Python 对象，而不是低级别的 NumPy 数组 dtype。这会导致一些问题。

NumPy 本身不知道新的 dtype：

In [248]: try:
 .....:    np.dtype("category")
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: data type 'category' not understood
In [249]: dtype = pd.Categorical(["a"]).dtype
In [250]: try:
 .....:    np.dtype(dtype)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False, categories_dtype=object)' as a data type

Dtype 比较有效：

In [251]: dtype == np.str_
Out[251]: False
In [252]: np.str_ == dtype
Out[252]: False

要检查 Series 是否包含分类数据，请使用 hasattr(s, 'cat')：

In [253]: hasattr(pd.Series(["a"], dtype="category"), "cat")
Out[253]: True
In [254]: hasattr(pd.Series(["a"]), "cat")
Out[254]: False

在类型为 category 的 Series 上使�� NumPy 函数应该不起作用，因为 Categoricals 不是数值数据（即使 .categories 是数值的情况下也是如此）。

In [255]: s = pd.Series(pd.Categorical([1, 2, 3, 4]))
In [256]: try:
 .....:    np.sum(s)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: 'Categorical' with dtype category does not support reduction 'sum'

注意

如果这样的函数有效，请在 pandas-dev/pandas 提交 bug！

在 apply 中的 dtype

pandas 目前在 apply 函数中不会保留 dtype：如果你沿着行应用，你会得到一个 Series，其 dtype 为 object（与获取一行相同 -> 获取一个元素将返回基本类型），并且沿着列应用也会转换为 object。NaN 值不受影响。你可以在应用函数之前使用 fillna 处理缺失值。

In [257]: df = pd.DataFrame(
 .....:    {
 .....:        "a": [1, 2, 3, 4],
 .....:        "b": ["a", "b", "c", "d"],
 .....:        "cats": pd.Categorical([1, 2, 3, 2]),
 .....:    }
 .....: )
 .....: 
In [258]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[258]: 
0    <class 'int'>
1    <class 'int'>
2    <class 'int'>
3    <class 'int'>
dtype: object
In [259]: df.apply(lambda col: col.dtype, axis=0)
Out[259]: 
a          int64
b         object
cats    category
dtype: object

分类索引

CategoricalIndex是一种支持具有重复索引的索引的类型。这是围绕Categorical的容器，允许有效地索引和存储具有大量重复元素的索引。有关更详细的解释，请参阅高级索引文档。

设置索引将创建CategoricalIndex：

In [260]: cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1])
In [261]: strings = ["a", "b", "c", "d"]
In [262]: values = [4, 2, 3, 1]
In [263]: df = pd.DataFrame({"strings": strings, "values": values}, index=cats)
In [264]: df.index
Out[264]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')
# This now sorts by the categories order
In [265]: df.sort_index()
Out[265]: 
 strings  values
4       d       1
2       b       2
3       c       3
1       a       4

副作用

从Categorical构造Series不会复制输入的Categorical。这意味着对Series的更改在大多数情况下会更改原始的Categorical：

In [266]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [267]: s = pd.Series(cat, name="cat")
In [268]: cat
Out[268]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [269]: s.iloc[0:2] = 10
In [270]: cat
Out[270]: 
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

使用copy=True来防止这种行为，或者简单地不要重复使用Categoricals：

In [271]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [272]: s = pd.Series(cat, name="cat", copy=True)
In [273]: cat
Out[273]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [274]: s.iloc[0:2] = 10
In [275]: cat
Out[275]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

注意

在某些情况下，当您提供 NumPy 数组而不是Categorical时，也会发生这种情况：使用整数数组（例如np.array([1,2,3,4])）将表现出相同的行为，而使用字符串数组（例如np.array(["a","b","c","a"])）则不会。

对象创建

Series 创建

可以通过多种方式创建分类Series或DataFrame中的列：

在构造Series时指定dtype="category"：

In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

通过将现有的Series或列转换为category dtype：

In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [4]: df["B"] = df["A"].astype("category")
In [5]: df
Out[5]: 
 A  B
0  a  a
1  b  b
2  c  c
3  a  a

通过使用特殊函数，例如cut()，将数据分组为离散的箱。请参阅文档中有关切片的示例。

In [6]: df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]: 
 value    group
0     65  60 - 69
1     49  40 - 49
2     56  50 - 59
3     43  40 - 49
4     43  40 - 49
5     91  90 - 99
6     32  30 - 39
7     87  80 - 89
8     36  30 - 39
9      8    0 - 9

通过将pandas.Categorical对象传递给Series或将其分配给DataFrame。

In [10]: raw_cat = pd.Categorical(
 ....:    ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
 ....: )
 ....: 
In [11]: s = pd.Series(raw_cat)
In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']
In [13]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [14]: df["B"] = raw_cat
In [15]: df
Out[15]: 
 A    B
0  a  NaN
1  b    b
2  c    c
3  a  NaN

分类数据具有特定的category dtype：

In [16]: df.dtypes
Out[16]: 
A      object
B    category
dtype: object

DataFrame 创建

类似于前一节中将单个列转换为分类的情况，可以在构建过程中或之后将DataFrame中的所有列批量转换为分类。

这可以在构造过程中通过在DataFrame构造函数中指定dtype="category"来完成：

In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
In [18]: df.dtypes
Out[18]: 
A    category
B    category
dtype: object

请注意，每列中存在的类别不同；转换是逐列进行的，因此只有给定列中存在的标签才是类别：

In [19]: df["A"]
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [20]: df["B"]
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']

类似地，可以使用DataFrame.astype()来批量转换现有DataFrame中的所有列：

In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [22]: df_cat = df.astype("category")
In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object

这种转换也是逐列进行的：

In [24]: df_cat["A"]
Out[24]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [25]: df_cat["B"]
Out[25]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']

控制行为

在上面的示��中，我们传递dtype='category'时，使用了默认行为：

类别是从数据中推断出来的。
类别是无序的。

要控制这些行为，而不是传递'category'，请使用CategoricalDtype的实例。

In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
In [29]: s_cat = s.astype(cat_type)
In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b' < 'c' < 'd']

同样，可以使用CategoricalDtype与DataFrame一起使用，以确保所有列中的类别保持一致。

In [31]: from pandas.api.types import CategoricalDtype
In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
In [34]: df_cat = df.astype(cat_type)
In [35]: df_cat["A"]
Out[35]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
In [36]: df_cat["B"]
Out[36]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

注意

要执行表格转换，其中整个DataFrame中的所有标签都用作每列的类别，可以通过categories = pd.unique(df.to_numpy().ravel())来以编程方式确定categories参数。

如果您已经有codes和categories，可以使用from_codes()构造函数在正常构造模式下保存因子化步骤：

In [37]: splitter = np.random.choice([0, 1], 5, p=[0.5, 0.5])
In [38]: s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))

恢复原始数据

要恢复到原始的Series或 NumPy 数组，请使用Series.astype(original_dtype)或np.asarray(categorical)：

In [39]: s = pd.Series(["a", "b", "c", "a"])
In [40]: s
Out[40]: 
0    a
1    b
2    c
3    a
dtype: object
In [41]: s2 = s.astype("category")
In [42]: s2
Out[42]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [43]: s2.astype(str)
Out[43]: 
0    a
1    b
2    c
3    a
dtype: object
In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

注意

与 R 的factor函数相比，分类数据不会将输入值转换为字符串；类别最终将与原始值相同的数据类型。

注意

与 R 的factor函数相比，目前没有办法在创建时分配/更改标签。在创建后使用categories来更改类别。

系列创建

可以通过几种方式创建DataFrame中的分类Series或列：

在构造Series时指定dtype="category"：

In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

通过将现有的Series或列转换为category dtype：

In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [4]: df["B"] = df["A"].astype("category")
In [5]: df
Out[5]: 
 A  B
0  a  a
1  b  b
2  c  c
3  a  a

通过使用特殊函数，例如cut()，将数据分组为离散的箱。请参阅文档中有关切片的示例。

In [6]: df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]: 
 value    group
0     65  60 - 69
1     49  40 - 49
2     56  50 - 59
3     43  40 - 49
4     43  40 - 49
5     91  90 - 99
6     32  30 - 39
7     87  80 - 89
8     36  30 - 39
9      8    0 - 9

通过将pandas.Categorical对象传递给Series或将其分配给DataFrame。

In [10]: raw_cat = pd.Categorical(
 ....:    ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
 ....: )
 ....: 
In [11]: s = pd.Series(raw_cat)
In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']
In [13]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [14]: df["B"] = raw_cat
In [15]: df
Out[15]: 
 A    B
0  a  NaN
1  b    b
2  c    c
3  a  NaN

分类数据具有特定的category dtype：

In [16]: df.dtypes
Out[16]: 
A      object
B    category
dtype: object

Pandas 2.2 中文官方教程和指南（十七）（3）https://developer.aliyun.com/article/1509825

Pandas 2.2 中文官方教程和指南（十七）（2）

设置类别

排序和顺序

重新排序

多列排序

比较

操作

数据整理

获取

字符串和日期时间访问器

设置

合并/连接

获取数据的输入/输出

缺失数据

与 R 的 `factor` 的差异

注意事项

内存使用

`Categorical` 不是一个 `numpy` 数组

在 apply 中的 dtype

分类索引

副作用

对象创建

Series 创建

DataFrame 创建

控制行为

恢复原始数据

系列创建

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Pandas 2.2 中文官方教程和指南（十七）（2）

设置类别

排序和顺序

重新排序

多列排序

比较

操作

数据整理

获取

字符串和日期时间访问器

设置

合并/连接

获取数据的输入/输出

缺失数据

与 R 的 factor 的差异

注意事项

内存使用

Categorical 不是一个 numpy 数组

在 apply 中的 dtype

分类索引

副作用

对象创建

Series 创建

DataFrame 创建

控制行为

恢复原始数据

系列创建

热门文章

最新文章

相关课程

相关电子书

与 R 的 `factor` 的差异

`Categorical` 不是一个 `numpy` 数组