Pandas 2.2 中文官方教程和指南（十七）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十七）（3）https://developer.aliyun.com/article/1509825

合并/连接

默认情况下，合并包含相同类别的Series或DataFrames将导致category类型，否则结果将取决于底层类别的类型。导致非分类类型的合并可能会导致更高的内存使用量。使用.astype或union_categoricals来确保category类型的结果。

In [184]: from pandas.api.types import union_categoricals
# same categories
In [185]: s1 = pd.Series(["a", "b"], dtype="category")
In [186]: s2 = pd.Series(["a", "b", "a"], dtype="category")
In [187]: pd.concat([s1, s2])
Out[187]: 
0    a
1    b
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']
# different categories
In [188]: s3 = pd.Series(["b", "c"], dtype="category")
In [189]: pd.concat([s1, s3])
Out[189]: 
0    a
1    b
0    b
1    c
dtype: object
# Output dtype is inferred based on categories values
In [190]: int_cats = pd.Series([1, 2], dtype="category")
In [191]: float_cats = pd.Series([3.0, 4.0], dtype="category")
In [192]: pd.concat([int_cats, float_cats])
Out[192]: 
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64
In [193]: pd.concat([s1, s3]).astype("category")
Out[193]: 
0    a
1    b
0    b
1    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [194]: union_categoricals([s1.array, s3.array])
Out[194]: 
['a', 'b', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

以下表总结了合并Categoricals的结果：

arg1	arg2	相同	结果
category	category	True	category
category (object)	category (object)	False	object (dtype is inferred)

如果要合并不一定具有相同类别的分类变量，union_categoricals()函数将合并类别的列表。新的类别将是被合并类别的并集。

In [195]: from pandas.api.types import union_categoricals
In [196]: a = pd.Categorical(["b", "c"])
In [197]: b = pd.Categorical(["a", "b"])
In [198]: union_categoricals([a, b])
Out[198]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

默认情况下，结果类别将按照它们在数据中出现的顺序排序。如果希望类别按字典序排序，请使用sort_categories=True参数。

In [199]: union_categoricals([a, b], sort_categories=True)
Out[199]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']

union_categoricals也适用于“简单”情况，即合并具有相同类别和顺序信息的两个分类变量（例如，您也可以使用append）。

In [200]: a = pd.Categorical(["a", "b"], ordered=True)
In [201]: b = pd.Categorical(["a", "b", "a"], ordered=True)
In [202]: union_categoricals([a, b])
Out[202]: 
['a', 'b', 'a', 'b', 'a']
Categories (2, object): ['a' < 'b']

以下代码会引发TypeError，因为类别是有序的而且不相同。

In [203]: a = pd.Categorical(["a", "b"], ordered=True)
In [204]: b = pd.Categorical(["a", "b", "c"], ordered=True)
In [205]: union_categoricals([a, b])
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[205], line 1
----> 1 union_categoricals([a, b])
File ~/work/pandas/pandas/pandas/core/dtypes/concat.py:341, in union_categoricals(to_union, sort_categories, ignore_order)
  339     if all(c.ordered for c in to_union):
  340         msg = "to union ordered Categoricals, all categories must be the same"
--> 341         raise TypeError(msg)
  342     raise TypeError("Categorical.ordered must be the same")
  344 if ignore_order:
TypeError: to union ordered Categoricals, all categories must be the same

使用ignore_ordered=True参数可以合并具有不同类别或排序的有序分类变量。

In [206]: a = pd.Categorical(["a", "b", "c"], ordered=True)
In [207]: b = pd.Categorical(["c", "b", "a"], ordered=True)
In [208]: union_categoricals([a, b], ignore_order=True)
Out[208]: 
['a', 'b', 'c', 'c', 'b', 'a']
Categories (3, object): ['a', 'b', 'c']

union_categoricals()也适用于CategoricalIndex或包含分类数据的Series，但请注意，结果数组将始终是普通的Categorical：

In [209]: a = pd.Series(["b", "c"], dtype="category")
In [210]: b = pd.Series(["a", "b"], dtype="category")
In [211]: union_categoricals([a, b])
Out[211]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

注意

当合并分类变量时，union_categoricals可能会重新编码类别的整数编码。这可能是您想要的，但如果您依赖于类别的确切编号，请注意。

In [212]: c1 = pd.Categorical(["b", "c"])
In [213]: c2 = pd.Categorical(["a", "b"])
In [214]: c1
Out[214]: 
['b', 'c']
Categories (2, object): ['b', 'c']
# "b" is coded to 0
In [215]: c1.codes
Out[215]: array([0, 1], dtype=int8)
In [216]: c2
Out[216]: 
['a', 'b']
Categories (2, object): ['a', 'b']
# "b" is coded to 1
In [217]: c2.codes
Out[217]: array([0, 1], dtype=int8)
In [218]: c = union_categoricals([c1, c2])
In [219]: c
Out[219]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']
# "b" is coded to 0 throughout, same as c1, different from c2
In [220]: c.codes
Out[220]: array([0, 1, 2, 0], dtype=int8)

获取

如果切片操作返回DataFrame或Series类型的列，category类型将被保留。

In [144]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [145]: cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx)
In [146]: values = [1, 2, 2, 2, 3, 4, 5]
In [147]: df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
In [148]: df.iloc[2:4, :]
Out[148]: 
 cats  values
j    b       2
k    b       2
In [149]: df.iloc[2:4, :].dtypes
Out[149]: 
cats      category
values       int64
dtype: object
In [150]: df.loc["h":"j", "cats"]
Out[150]: 
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [151]: df[df["cats"] == "b"]
Out[151]: 
 cats  values
i    b       2
j    b       2
k    b       2

如果您只取一行作为示例，类别类型可能不会被保留：结果的Series类型为object：

# get the complete "h" row as a Series
In [152]: df.loc["h", :]
Out[152]: 
cats      a
values    1
Name: h, dtype: object

从分类数据中返回单个项目也将返回该值，而不是长度为“1”的分类。

In [153]: df.iat[0, 0]
Out[153]: 'a'
In [154]: df["cats"] = df["cats"].cat.rename_categories(["x", "y", "z"])
In [155]: df.at["h", "cats"]  # returns a string
Out[155]: 'x'

注意

这与 R 的factor函数形成对比，其中factor(c(1,2,3))[1]返回一个单一值factor。

要获得类型为category的单一值Series，您可以传入一个只有一个值的列表：

In [156]: df.loc[["h"], "cats"]
Out[156]: 
h    x
Name: cats, dtype: category
Categories (3, object): ['x', 'y', 'z']

字符串和日期时间访问器

如果s.cat.categories的类型适当，访问器.dt和.str将起作用：

In [157]: str_s = pd.Series(list("aabb"))
In [158]: str_cat = str_s.astype("category")
In [159]: str_cat
Out[159]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): ['a', 'b']
In [160]: str_cat.str.contains("a")
Out[160]: 
0     True
1     True
2    False
3    False
dtype: bool
In [161]: date_s = pd.Series(pd.date_range("1/1/2015", periods=5))
In [162]: date_cat = date_s.astype("category")
In [163]: date_cat
Out[163]: 
0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
In [164]: date_cat.dt.day
Out[164]: 
0    1
1    2
2    3
3    4
4    5
dtype: int32

注意

返回的Series（或DataFrame）与在该类型的Series上使用.str. / .dt.时的类型相同（而不是category类型！）。

这意味着，从Series的访问器的方法和属性返回的值，以及将这个Series转换为category类型后从其访问器的方法和属性返回的值将是相等的：

In [165]: ret_s = str_s.str.contains("a")
In [166]: ret_cat = str_cat.str.contains("a")
In [167]: ret_s.dtype == ret_cat.dtype
Out[167]: True
In [168]: ret_s == ret_cat
Out[168]: 
0    True
1    True
2    True
3    True
dtype: bool

注意

工作是在categories上进行的，然后构建一个新的Series。如果您有一个字符串类型的Series，其中有很多重复的元素（即Series中唯一元素的数量远小于Series的长度），这会对性能产生一些影响。在这种情况下，将原始Series转换为category类型并在其上使用.str.或.dt.可能更快。

设置

在分类列（或Series）中设置值只要该值包含在categories中即可：

In [169]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [170]: cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"], categories=["a", "b"])
In [171]: values = [1, 1, 1, 1, 1, 1, 1]
In [172]: df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
In [173]: df.iloc[2:4, :] = [["b", 2], ["b", 2]]
In [174]: df
Out[174]: 
 cats  values
h    a       1
i    a       1
j    b       2
k    b       2
l    a       1
m    a       1
n    a       1
In [175]: try:
 .....:    df.iloc[2:4, :] = [["c", 3], ["c", 3]]
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot setitem on a Categorical with a new category, set the categories first

通过分配分类数据来设置值也会检查categories是否匹配：

In [176]: df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"])
In [177]: df
Out[177]: 
 cats  values
h    a       1
i    a       1
j    a       2
k    a       2
l    a       1
m    a       1
n    a       1
In [178]: try:
 .....:    df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"])
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot set a Categorical with another, without identical categories

将Categorical分配给其他类型列的部分将使用这些值：

In [179]: df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
In [180]: df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
In [181]: df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
In [182]: df
Out[182]: 
 a  b
0  1  a
1  b  a
2  b  b
3  1  b
4  1  a
In [183]: df.dtypes
Out[183]: 
a    object
b    object
dtype: object

合并/连接

默认情况下，合并包含相同类别的Series或DataFrames将导致category数据类型，否则结果将取决于底层类别的数据类型。导致非分类数据类型的合并可能会导致更高的内存使用量。使用.astype或union_categoricals来确保获得category结果。

In [184]: from pandas.api.types import union_categoricals
# same categories
In [185]: s1 = pd.Series(["a", "b"], dtype="category")
In [186]: s2 = pd.Series(["a", "b", "a"], dtype="category")
In [187]: pd.concat([s1, s2])
Out[187]: 
0    a
1    b
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']
# different categories
In [188]: s3 = pd.Series(["b", "c"], dtype="category")
In [189]: pd.concat([s1, s3])
Out[189]: 
0    a
1    b
0    b
1    c
dtype: object
# Output dtype is inferred based on categories values
In [190]: int_cats = pd.Series([1, 2], dtype="category")
In [191]: float_cats = pd.Series([3.0, 4.0], dtype="category")
In [192]: pd.concat([int_cats, float_cats])
Out[192]: 
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64
In [193]: pd.concat([s1, s3]).astype("category")
Out[193]: 
0    a
1    b
0    b
1    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [194]: union_categoricals([s1.array, s3.array])
Out[194]: 
['a', 'b', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

以下表总结了合并Categoricals的结果：

arg1	arg2	相同	结果
类别	类别	True	类别
类别（object）	类别（object）	False	object（数据类型被推断）
类别（int）	类别（float）	False	float（数据类型被推断）

联合

如果要组合不一定具有相同类别的分类变量，则union_categoricals() 函数将组合一个类别列表。新的类别将是被组合的类别的并集。

In [195]: from pandas.api.types import union_categoricals
In [196]: a = pd.Categorical(["b", "c"])
In [197]: b = pd.Categorical(["a", "b"])
In [198]: union_categoricals([a, b])
Out[198]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

默认情况下，结果类别将按照它们在数据中出现的顺序排序。如果希望类别按字典顺序排序，请使用sort_categories=True参数。

In [199]: union_categoricals([a, b], sort_categories=True)
Out[199]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']

union_categoricals 也适用于“简单”情况，即组合具有相同类别和排序信息的两个分类变量（例如，您也可以使用append）。

In [200]: a = pd.Categorical(["a", "b"], ordered=True)
In [201]: b = pd.Categorical(["a", "b", "a"], ordered=True)
In [202]: union_categoricals([a, b])
Out[202]: 
['a', 'b', 'a', 'b', 'a']
Categories (2, object): ['a' < 'b']

以下代码会引发TypeError，因为类别是有序的且不相同。

In [203]: a = pd.Categorical(["a", "b"], ordered=True)
In [204]: b = pd.Categorical(["a", "b", "c"], ordered=True)
In [205]: union_categoricals([a, b])
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[205], line 1
----> 1 union_categoricals([a, b])
File ~/work/pandas/pandas/pandas/core/dtypes/concat.py:341, in union_categoricals(to_union, sort_categories, ignore_order)
  339     if all(c.ordered for c in to_union):
  340         msg = "to union ordered Categoricals, all categories must be the same"
--> 341         raise TypeError(msg)
  342     raise TypeError("Categorical.ordered must be the same")
  344 if ignore_order:
TypeError: to union ordered Categoricals, all categories must be the same

可以通过使用ignore_ordered=True参数来组合具有不同类别或排序的有序分类。

In [206]: a = pd.Categorical(["a", "b", "c"], ordered=True)
In [207]: b = pd.Categorical(["c", "b", "a"], ordered=True)
In [208]: union_categoricals([a, b], ignore_order=True)
Out[208]: 
['a', 'b', 'c', 'c', 'b', 'a']
Categories (3, object): ['a', 'b', 'c']

union_categoricals() 也适用于CategoricalIndex或包含分类数据的Series，但请注意，结果数组将始终是普通的Categorical：

In [209]: a = pd.Series(["b", "c"], dtype="category")
In [210]: b = pd.Series(["a", "b"], dtype="category")
In [211]: union_categoricals([a, b])
Out[211]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

注意

当组合分类数据时，union_categoricals可能会重新编码类别的整数代码。这可能是您想要的，但如果依赖于类别的确切编号，请注意。

In [212]: c1 = pd.Categorical(["b", "c"])
In [213]: c2 = pd.Categorical(["a", "b"])
In [214]: c1
Out[214]: 
['b', 'c']
Categories (2, object): ['b', 'c']
# "b" is coded to 0
In [215]: c1.codes
Out[215]: array([0, 1], dtype=int8)
In [216]: c2
Out[216]: 
['a', 'b']
Categories (2, object): ['a', 'b']
# "b" is coded to 1
In [217]: c2.codes
Out[217]: array([0, 1], dtype=int8)
In [218]: c = union_categoricals([c1, c2])
In [219]: c
Out[219]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']
# "b" is coded to 0 throughout, same as c1, different from c2
In [220]: c.codes
Out[220]: array([0, 1, 2, 0], dtype=int8)

数据的读取/写入

您可以将包含category dtypes 的数据写入HDFStore。参见这里以获取示例和注意事项。

也可以将数据写入和从Stata格式文件中读取。参见这里以获取示例和注意事项。

写入 CSV 文件将转换数据，实际上删除有关分类（类别和排序）的任何信息。因此，如果您读取 CSV 文件，必须将相关列转换回category并分配正确的类别和类别排序。

In [221]: import io
In [222]: s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"]))
# rename the categories
In [223]: s = s.cat.rename_categories(["very good", "good", "bad"])
# reorder the categories and add missing categories
In [224]: s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [225]: df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]})
In [226]: csv = io.StringIO()
In [227]: df.to_csv(csv)
In [228]: df2 = pd.read_csv(io.StringIO(csv.getvalue()))
In [229]: df2.dtypes
Out[229]: 
Unnamed: 0     int64
cats          object
vals           int64
dtype: object
In [230]: df2["cats"]
Out[230]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: object
# Redo the category
In [231]: df2["cats"] = df2["cats"].astype("category")
In [232]: df2["cats"] = df2["cats"].cat.set_categories(
 .....:    ["very bad", "bad", "medium", "good", "very good"]
 .....: )
 .....: 
In [233]: df2.dtypes
Out[233]: 
Unnamed: 0       int64
cats          category
vals             int64
dtype: object
In [234]: df2["cats"]
Out[234]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

使用to_sql将数据写入 SQL 数据��时也是如此。

缺失数据

pandas 主要使用数值np.nan来表示缺失数据。默认情况下不包括在计算中。参见缺失数据部分。

缺失值不应包括在分类categories中，只应包括在values中。相反，应理解 NaN 是不同的，并且始终可能存在。在处理分类codes时，缺失值将始终具有代码-1。

In [235]: s = pd.Series(["a", "b", np.nan, "a"], dtype="category")
# only two categories
In [236]: s
Out[236]: 
0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): ['a', 'b']
In [237]: s.cat.codes
Out[237]: 
0    0
1    1
2   -1
3    0
dtype: int8

处理缺失数据的方法，例如isna()，fillna()，dropna()，都可以正常工作：

In [238]: s = pd.Series(["a", "b", np.nan], dtype="category")
In [239]: s
Out[239]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): ['a', 'b']
In [240]: pd.isna(s)
Out[240]: 
0    False
1    False
2     True
dtype: bool
In [241]: s.fillna("a")
Out[241]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']

与 R 的`factor`的差异

以下与 R 的因子函数的差异可以观察到：

R 的levels被命名为categories。
R 的levels始终为字符串类型，而 pandas 中的categories可以是任何 dtype。
不可能在创建时指定标签。之后使用s.cat.rename_categories(new_labels)。
与 R 的factor函数相反，将分类数据作为创建新分类系列的唯一输入将不会删除未使用的类别，而是创建一个等于传入的新分类系列！
R 允许在其levels（pandas 的categories）中包含缺失值。pandas 不允许NaN类别，但缺失值仍然可以在values中。

注意事项

内存使用

Categorical的内存使用量与类别数和数据长度成正比。相比之下，object dtype 是数据长度的常数倍。

In [242]: s = pd.Series(["foo", "bar"] * 1000)
# object dtype
In [243]: s.nbytes
Out[243]: 16000
# category dtype
In [244]: s.astype("category").nbytes
Out[244]: 2016

注意

如果类别数接近数据长度，Categorical将使用几乎相同或更多的内存，而不是等效的object dtype 表示。

In [245]: s = pd.Series(["foo%04d" % i for i in range(2000)])
# object dtype
In [246]: s.nbytes
Out[246]: 16000
# category dtype
In [247]: s.astype("category").nbytes
Out[247]: 20000

`Categorical`不是`numpy`数组

目前，分类数据和底层的Categorical是作为 Python 对象实现的，而不是作为低级别的 NumPy 数组 dtype。这会导致一些问题。

NumPy 本身不知道新的 dtype：

In [248]: try:
 .....:    np.dtype("category")
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: data type 'category' not understood
In [249]: dtype = pd.Categorical(["a"]).dtype
In [250]: try:
 .....:    np.dtype(dtype)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False, categories_dtype=object)' as a data type

Dtype 比较有效：

In [251]: dtype == np.str_
Out[251]: False
In [252]: np.str_ == dtype
Out[252]: False

要检查 Series 是否包含分类数据，请使用 hasattr(s, 'cat')：

In [253]: hasattr(pd.Series(["a"], dtype="category"), "cat")
Out[253]: True
In [254]: hasattr(pd.Series(["a"]), "cat")
Out[254]: False

在类型为 category 的 Series 上使用 NumPy 函数应该不起作用，因为 Categoricals 不是数值数据（即使 .categories 是数值的情况下也是如此）。

In [255]: s = pd.Series(pd.Categorical([1, 2, 3, 4]))
In [256]: try:
 .....:    np.sum(s)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: 'Categorical' with dtype category does not support reduction 'sum'

注意

如果这样的函数有效，请在 pandas-dev/pandas 提交 bug！

apply 中的 dtype

pandas 目前不会在 apply 函数中保留 dtype：如果你沿着行应用，你会得到一个 object dtype 的 Series（与获取一行相同 -> 获取一个元素将返回一个基本类型），并且沿着列应用也会转换为 object。NaN 值不受影响。你可以在应用函数之前使用 fillna 处理缺失值。

In [257]: df = pd.DataFrame(
 .....:    {
 .....:        "a": [1, 2, 3, 4],
 .....:        "b": ["a", "b", "c", "d"],
 .....:        "cats": pd.Categorical([1, 2, 3, 2]),
 .....:    }
 .....: )
 .....: 
In [258]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[258]: 
0    <class 'int'>
1    <class 'int'>
2    <class 'int'>
3    <class 'int'>
dtype: object
In [259]: df.apply(lambda col: col.dtype, axis=0)
Out[259]: 
a          int64
b         object
cats    category
dtype: object

分类索引

CategoricalIndex 是一种支持具有重复索引的索引的类型。这是围绕一个 Categorical 的容器，允许有效地索引和存储具有大量重复元素的索引。有关更详细的解释，请参阅高级索引文档。

设置索引将创建一个 CategoricalIndex：

In [260]: cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1])
In [261]: strings = ["a", "b", "c", "d"]
In [262]: values = [4, 2, 3, 1]
In [263]: df = pd.DataFrame({"strings": strings, "values": values}, index=cats)
In [264]: df.index
Out[264]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')
# This now sorts by the categories order
In [265]: df.sort_index()
Out[265]: 
 strings  values
4       d       1
2       b       2
3       c       3
1       a       4

副作用

从 Categorical 构建 Series 不会复制输入的 Categorical。这意味着对 Series 的更改在大多数情况下会改变原始的 Categorical：

In [266]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [267]: s = pd.Series(cat, name="cat")
In [268]: cat
Out[268]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [269]: s.iloc[0:2] = 10
In [270]: cat
Out[270]: 
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

使用 copy=True 来防止这种行为，或者简单地不要重复使用 Categoricals：

In [271]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [272]: s = pd.Series(cat, name="cat", copy=True)
In [273]: cat
Out[273]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [274]: s.iloc[0:2] = 10
In [275]: cat
Out[275]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

注意

在某些情况下，当您提供一个 NumPy 数组而不是 Categorical 时也会发生这种情况：使用整数数组（例如 np.array([1,2,3,4])）会表现出相同的行为，而使用字符串数组（例如 np.array(["a","b","c","a"])）则不会。

内存使用

Categorical 的内存使用量与类别数量加上数据长度成正比。相比之下，object dtype 是数据长度的常数倍。

In [242]: s = pd.Series(["foo", "bar"] * 1000)
# object dtype
In [243]: s.nbytes
Out[243]: 16000
# category dtype
In [244]: s.astype("category").nbytes
Out[244]: 2016

注意

如果类别数量接近数据长度，Categorical 将使用几乎相同或更多的内存，与等效的 object dtype 表示相比。

In [245]: s = pd.Series(["foo%04d" % i for i in range(2000)])
# object dtype
In [246]: s.nbytes
Out[246]: 16000
# category dtype
In [247]: s.astype("category").nbytes
Out[247]: 20000

`Categorical` 不是一个 `numpy` 数组

当前，分类数据和底层的 Categorical 是作为 Python 对象实现的，而不是作为低级 NumPy 数组 dtype。这会导致一些问题。

NumPy 本身不知道新的 dtype：

In [248]: try:
 .....:    np.dtype("category")
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: data type 'category' not understood
In [249]: dtype = pd.Categorical(["a"]).dtype
In [250]: try:
 .....:    np.dtype(dtype)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False, categories_dtype=object)' as a data type

Dtype 比较有效：

In [251]: dtype == np.str_
Out[251]: False
In [252]: np.str_ == dtype
Out[252]: False

要检查 Series 是否包含分类数据，请使用 hasattr(s, 'cat')：

In [253]: hasattr(pd.Series(["a"], dtype="category"), "cat")
Out[253]: True
In [254]: hasattr(pd.Series(["a"]), "cat")
Out[254]: False

在类型为 category 的 Series 上使用 NumPy 函数应该不起作用，因为 Categoricals 不是数值数据（即使 .categories 是数值的情况下也是如此）。

In [255]: s = pd.Series(pd.Categorical([1, 2, 3, 4]))
In [256]: try:
 .....:    np.sum(s)
 .....: except TypeError as e:
 .....:    print("TypeError:", str(e))
 .....: 
TypeError: 'Categorical' with dtype category does not support reduction 'sum'

注意

如果这样的函数有效，请在 pandas-dev/pandas 提交 bug！

apply 中的 dtype

pandas 目前不会在应用函数中保留 dtype：如果沿着行应用，你会得到一个dtype为object的Series（与获取一行相同 -> 获取一个元素将返回基本类型），并且沿着列应用也会转换为 object。NaN值不受影响。你可以在应用函数之前使用fillna来处理缺失值。

In [257]: df = pd.DataFrame(
 .....:    {
 .....:        "a": [1, 2, 3, 4],
 .....:        "b": ["a", "b", "c", "d"],
 .....:        "cats": pd.Categorical([1, 2, 3, 2]),
 .....:    }
 .....: )
 .....: 
In [258]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[258]: 
0    <class 'int'>
1    <class 'int'>
2    <class 'int'>
3    <class 'int'>
dtype: object
In [259]: df.apply(lambda col: col.dtype, axis=0)
Out[259]: 
a          int64
b         object
cats    category
dtype: object

分类索引

CategoricalIndex是一种支持具有重复索引的索引的类型。这是围绕一个Categorical的容器，允许高效地索引和存储具有大量重复元素的索引。查看高级索引文档以获取更详细的解释。

设置索引将创建一个CategoricalIndex：

In [260]: cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1])
In [261]: strings = ["a", "b", "c", "d"]
In [262]: values = [4, 2, 3, 1]
In [263]: df = pd.DataFrame({"strings": strings, "values": values}, index=cats)
In [264]: df.index
Out[264]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')
# This now sorts by the categories order
In [265]: df.sort_index()
Out[265]: 
 strings  values
4       d       1
2       b       2
3       c       3
1       a       4

副作用

从Categorical构建Series不会复制输入的Categorical。这意味着对Series的更改在大多数情况下会改变原始的Categorical：

In [266]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [267]: s = pd.Series(cat, name="cat")
In [268]: cat
Out[268]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [269]: s.iloc[0:2] = 10
In [270]: cat
Out[270]: 
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

使用copy=True来防止这种行为，或者简单地不要重复使用Categoricals：

In [271]: cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
In [272]: s = pd.Series(cat, name="cat", copy=True)
In [273]: cat
Out[273]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [274]: s.iloc[0:2] = 10
In [275]: cat
Out[275]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

注意

在某些情况下，当您提供一个 NumPy 数组而不是Categorical时也会发生这种情况：使用整数数组（例如np.array([1,2,3,4])）将表现出相同的行为，而使用字符串数组（例如np.array(["a","b","c","a"])）则不会。

Pandas 2.2 中文官方教程和指南（十七）（4）

合并/连接

获取

字符串和日期时间访问器

设置

合并/连接

联合

数据的读取/写入

缺失数据

与 R 的`factor`的差异

注意事项

内存使用

`Categorical`不是`numpy`数组

apply 中的 dtype

分类索引

副作用

内存使用

`Categorical` 不是一个 `numpy` 数组

apply 中的 dtype

分类索引

副作用

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Pandas 2.2 中文官方教程和指南（十七）（4）

合并/连接

获取

字符串和日期时间访问器

设置

合并/连接

联合

数据的读取/写入

缺失数据

与 R 的factor的差异

注意事项

内存使用

Categorical不是numpy数组

apply 中的 dtype

分类索引

副作用

内存使用

Categorical 不是一个 numpy 数组

apply 中的 dtype

分类索引

副作用

热门文章

最新文章

相关课程

相关电子书

相关实验场景

与 R 的`factor`的差异

`Categorical`不是`numpy`数组

`Categorical` 不是一个 `numpy` 数组