Pandas 2.2 中文官方教程和指南（九·一）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（九·一）（3）https://developer.aliyun.com/article/1509697

默认值

默认情况下，整数类型为int64，浮点类型为float64，不受平台（32 位或 64 位）影响。以下都将导致int64数据类型。

In [359]: pd.DataFrame([1, 2], columns=["a"]).dtypes
Out[359]: 
a    int64
dtype: object
In [360]: pd.DataFrame({"a": [1, 2]}).dtypes
Out[360]: 
a    int64
dtype: object
In [361]: pd.DataFrame({"a": 1}, index=list(range(2))).dtypes
Out[361]: 
a    int64
dtype: object

请注意，当创建数组时，Numpy 将选择依赖于平台的类型。在 32 位平台上，以下将导致int32。

In [362]: frame = pd.DataFrame(np.array([1, 2]))

向上转换

当与其他类型组合时，类型可能会被向上转换，这意味着它们从当前类型（例如int到float）提升。

In [363]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [364]: df3
Out[364]: 
 A         B      C
0  1.047606  0.256090   26.0
1  3.497968  1.426469   86.0
2 -0.150862 -0.416203   46.0
3  0.724370  1.139976  212.0
4 -1.203098 -1.193477   26.0
5  1.346426  0.096706    7.0
6 -0.052599 -1.956850  184.0
7 -0.756495 -0.714337  206.0
In [365]: df3.dtypes
Out[365]: 
A    float32
B    float64
C    float64
dtype: object

DataFrame.to_numpy()将返回数据类型的最低公共分母，即可以容纳结果同类数据类型的 NumPy 数组。这可能会强制进行一些向上转换。

In [366]: df3.to_numpy().dtype
Out[366]: dtype('float64')

astype

您可以使用astype()方法将数据类型明确转换为另一种。默认情况下，即使数据类型未更改（通过传递copy=False来更改此行为），它们也将返回一个副本。此外，如果 astype 操作无效，它们将引发异常。

向上转换始终遵循NumPy规则。如果操作涉及两种不同的数据类型，则将使用更通用的数据类型作为操作的结果。

In [367]: df3
Out[367]: 
 A         B      C
0  1.047606  0.256090   26.0
1  3.497968  1.426469   86.0
2 -0.150862 -0.416203   46.0
3  0.724370  1.139976  212.0
4 -1.203098 -1.193477   26.0
5  1.346426  0.096706    7.0
6 -0.052599 -1.956850  184.0
7 -0.756495 -0.714337  206.0
In [368]: df3.dtypes
Out[368]: 
A    float32
B    float64
C    float64
dtype: object
# conversion of dtypes
In [369]: df3.astype("float32").dtypes
Out[369]: 
A    float32
B    float32
C    float32
dtype: object

使用astype()将列的子集转换为指定类型。

In [370]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
In [371]: dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8)
In [372]: dft
Out[372]: 
 a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
In [373]: dft.dtypes
Out[373]: 
a    uint8
b    uint8
c    int64
dtype: object

通过将字典传递给astype()将某些列转换为特定数据类型。

In [374]: dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})
In [375]: dft1 = dft1.astype({"a": np.bool_, "c": np.float64})
In [376]: dft1
Out[376]: 
 a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0
In [377]: dft1.dtypes
Out[377]: 
a       bool
b      int64
c    float64
dtype: object

注意

当尝试使用astype()和loc()将列的子集转换为指定类型时，将发生向上转换。

loc()尝试适应我们分配给当前数据类型的内容，而[]将覆盖它们，从右侧获取数据类型。因此，以下代码片段会产生意外结果。

In [378]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
In [379]: dft.loc[:, ["a", "b"]].astype(np.uint8).dtypes
Out[379]: 
a    uint8
b    uint8
dtype: object
In [380]: dft.loc[:, ["a", "b"]] = dft.loc[:, ["a", "b"]].astype(np.uint8)
In [381]: dft.dtypes
Out[381]: 
a    int64
b    int64
c    int64
dtype: object

对象转换

pandas 提供了各种函数来尝试将object数据类型转换为其他类型。在数据已经是正确类型但存储在object数组中的情况下，可以使用DataFrame.infer_objects()和Series.infer_objects()方法进行软转换为正确类型。

In [382]: import datetime
In [383]: df = pd.DataFrame(
 .....:    [
 .....:        [1, 2],
 .....:        ["a", "b"],
 .....:        [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)],
 .....:    ]
 .....: )
 .....: 
In [384]: df = df.T
In [385]: df
Out[385]: 
 0  1                    2
0  1  a  2016-03-02 00:00:00
1  2  b  2016-03-02 00:00:00
In [386]: df.dtypes
Out[386]: 
0    object
1    object
2    object
dtype: object

由于数据被转置，原始推断将所有列存储为对象，infer_objects将进行更正。

In [387]: df.infer_objects().dtypes
Out[387]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

以下函数适用于一维对象数组或标量，以执行将对象硬转换为指定类型的操作：

to_numeric()（转换为数值数据类型）

In [388]: m = ["1.1", 2, 3]
In [389]: pd.to_numeric(m)
Out[389]: array([1.1, 2\. , 3\. ])

to_datetime()（转换为日期时间对象）

In [390]: import datetime
In [391]: m = ["2016-07-09", datetime.datetime(2016, 3, 2)]
In [392]: pd.to_datetime(m)
Out[392]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta()（转换为时间间隔对象）

In [393]: m = ["5us", pd.Timedelta("1day")]
In [394]: pd.to_timedelta(m)
Out[394]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

要强制转换，我们可以传入一个errors参数，指定 pandas 如何处理无法转换为所需数据类型或对象的元素。默认情况下，errors='raise'，意味着在转换过程中遇到任何错误都将被引发。但是，如果errors='coerce'，这些错误将被忽略，pandas 将把有问题的元素转换为pd.NaT（对于日期时间和时间间隔）或np.nan（对于数值）。如果您正在读取大部分是所需数据类型（例如数值、日期时间）的数据，但偶尔混有不符合规范的元素，您希望将其表示为缺失值，则这可能很有用：

In [395]: import datetime
In [396]: m = ["apple", datetime.datetime(2016, 3, 2)]
In [397]: pd.to_datetime(m, errors="coerce")
Out[397]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)
In [398]: m = ["apple", 2, 3]
In [399]: pd.to_numeric(m, errors="coerce")
Out[399]: array([nan,  2.,  3.])
In [400]: m = ["apple", pd.Timedelta("1day")]
In [401]: pd.to_timedelta(m, errors="coerce")
Out[401]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

除了对象转换，to_numeric()还提供另一个参数downcast，该参数可以将新（或已有）的数值数据向下转换为较小的数据类型，以节省内存：

In [402]: m = ["1", 2, 3]
In [403]: pd.to_numeric(m, downcast="integer")  # smallest signed int dtype
Out[403]: array([1, 2, 3], dtype=int8)
In [404]: pd.to_numeric(m, downcast="signed")  # same as 'integer'
Out[404]: array([1, 2, 3], dtype=int8)
In [405]: pd.to_numeric(m, downcast="unsigned")  # smallest unsigned int dtype
Out[405]: array([1, 2, 3], dtype=uint8)
In [406]: pd.to_numeric(m, downcast="float")  # smallest float dtype
Out[406]: array([1., 2., 3.], dtype=float32)

由于这些方法仅适用于一维数组、列表或标量；不能直接用于多维对象，如 DataFrames。但是，通过apply()，我们可以高效地对每列应用函数：

In [407]: import datetime
In [408]: df = pd.DataFrame([["2016-07-09", datetime.datetime(2016, 3, 2)]] * 2, dtype="O")
In [409]: df
Out[409]: 
 0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00
In [410]: df.apply(pd.to_datetime)
Out[410]: 
 0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [411]: df = pd.DataFrame([["1.1", 2, 3]] * 2, dtype="O")
In [412]: df
Out[412]: 
 0  1  2
0  1.1  2  3
1  1.1  2  3
In [413]: df.apply(pd.to_numeric)
Out[413]: 
 0  1  2
0  1.1  2  3
1  1.1  2  3
In [414]: df = pd.DataFrame([["5us", pd.Timedelta("1day")]] * 2, dtype="O")
In [415]: df
Out[415]: 
 0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00
In [416]: df.apply(pd.to_timedelta)
Out[416]: 
 0      1
0 0 days 00:00:00.000005 1 days
1 0 days 00:00:00.000005 1 days

注意事项

在integer类型数据上执行选择操作很容易将数据向上转换为floating。在不引入nans的情况下，输入数据的数据类型将被保留。另请参阅对整数 NA 的支持。

In [417]: dfi = df3.astype("int32")
In [418]: dfi["E"] = 1
In [419]: dfi
Out[419]: 
 A  B    C  E
0  1  0   26  1
1  3  1   86  1
2  0  0   46  1
3  0  1  212  1
4 -1 -1   26  1
5  1  0    7  1
6  0 -1  184  1
7  0  0  206  1
In [420]: dfi.dtypes
Out[420]: 
A    int32
B    int32
C    int32
E    int64
dtype: object
In [421]: casted = dfi[dfi > 0]
In [422]: casted
Out[422]: 
 A    B    C  E
0  1.0  NaN   26  1
1  3.0  1.0   86  1
2  NaN  NaN   46  1
3  NaN  1.0  212  1
4  NaN  NaN   26  1
5  1.0  NaN    7  1
6  NaN  NaN  184  1
7  NaN  NaN  206  1
In [423]: casted.dtypes
Out[423]: 
A    float64
B    float64
C      int32
E      int64
dtype: object

浮点数数据类型保持不变。

In [424]: dfa = df3.copy()
In [425]: dfa["A"] = dfa["A"].astype("float32")
In [426]: dfa.dtypes
Out[426]: 
A    float32
B    float64
C    float64
dtype: object
In [427]: casted = dfa[df2 > 0]
In [428]: casted
Out[428]: 
 A         B      C
0  1.047606  0.256090   26.0
1  3.497968  1.426469   86.0
2       NaN       NaN   46.0
3       NaN  1.139976  212.0
4       NaN       NaN   26.0
5  1.346426  0.096706    7.0
6       NaN       NaN  184.0
7       NaN       NaN  206.0
In [429]: casted.dtypes
Out[429]: 
A    float32
B    float64
C    float64
dtype: object

基于`dtype`选择列

select_dtypes()方法实现基于dtype的列子集选择。

首先，让我们创建一个具有各种不同数据类型的DataFrame：

In [430]: df = pd.DataFrame(
 .....:    {
 .....:        "string": list("abc"),
 .....:        "int64": list(range(1, 4)),
 .....:        "uint8": np.arange(3, 6).astype("u1"),
 .....:        "float64": np.arange(4.0, 7.0),
 .....:        "bool1": [True, False, True],
 .....:        "bool2": [False, True, False],
 .....:        "dates": pd.date_range("now", periods=3),
 .....:        "category": pd.Series(list("ABC")).astype("category"),
 .....:    }
 .....: )
 .....: 
In [431]: df["tdeltas"] = df.dates.diff()
In [432]: df["uint64"] = np.arange(3, 6).astype("u8")
In [433]: df["other_dates"] = pd.date_range("20130101", periods=3)
In [434]: df["tz_aware_dates"] = pd.date_range("20130101", periods=3, tz="US/Eastern")
In [435]: df
Out[435]: 
 string  int64  uint8  ...  uint64  other_dates            tz_aware_dates
0      a      1      3  ...       3   2013-01-01 2013-01-01 00:00:00-05:00
1      b      2      4  ...       4   2013-01-02 2013-01-02 00:00:00-05:00
2      c      3      5  ...       5   2013-01-03 2013-01-03 00:00:00-05:00
[3 rows x 12 columns]

以及数据类型：

In [436]: df.dtypes
Out[436]: 
string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes()有两个参数include和exclude，允许您说“给我这些数据类型的列”（include）和/或“给我没有这些数据类型的列”（exclude）。

例如，要选择bool列：

In [437]: df.select_dtypes(include=[bool])
Out[437]: 
 bool1  bool2
0   True  False
1  False   True
2   True  False

您还可以在NumPy 数据类型层次结构中传递数据类型的名称：

In [438]: df.select_dtypes(include=["bool"])
Out[438]: 
 bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes()也适用于通用数据类型。

例如，要选择所有数值和布尔列，同时排除无符号整数：

In [439]: df.select_dtypes(include=["number", "bool"], exclude=["unsignedinteger"])
Out[439]: 
 int64  float64  bool1  bool2 tdeltas
0      1      4.0   True  False     NaT
1      2      5.0  False   True  1 days
2      3      6.0   True  False  1 days

要选择字符串列，必须使用object数据类型：

In [440]: df.select_dtypes(include=["object"])
Out[440]: 
 string
0      a
1      b
2      c

要查看类似numpy.number这样的通用dtype的所有子数据类型，您可以定义一个返回子数据类型树的函数：

In [441]: def subdtypes(dtype):
 .....:    subs = dtype.__subclasses__()
 .....:    if not subs:
 .....:        return dtype
 .....:    return [dtype, [subdtypes(dt) for dt in subs]]
 .....:

所有 NumPy 数据类型都是numpy.generic的子类：

In [442]: subdtypes(np.generic)
Out[442]: 
[numpy.generic,
 [[numpy.number,
 [[numpy.integer,
 [[numpy.signedinteger,
 [numpy.int8,
 numpy.int16,
 numpy.int32,
 numpy.int64,
 numpy.longlong,
 numpy.timedelta64]],
 [numpy.unsignedinteger,
 [numpy.uint8,
 numpy.uint16,
 numpy.uint32,
 numpy.uint64,
 numpy.ulonglong]]]],
 [numpy.inexact,
 [[numpy.floating,
 [numpy.float16, numpy.float32, numpy.float64, numpy.longdouble]],
 [numpy.complexfloating,
 [numpy.complex64, numpy.complex128, numpy.clongdouble]]]]]],
 [numpy.flexible,
 [[numpy.character, [numpy.bytes_, numpy.str_]],
 [numpy.void, [numpy.record]]]],
 numpy.bool_,
 numpy.datetime64,
 numpy.object_]]

注意

pandas 还定义了类型category和datetime64[ns, tz]，它们没有集成到正常的 NumPy 层次结构中，不会显示在上述函数中。

Head 和 tail

要查看 Series 或 DataFrame 对象的小样本，请使用head()和tail()方法。默认显示的元素数量为五个，但您可以传递自定义数量。

In [4]: long_series = pd.Series(np.random.randn(1000))
In [5]: long_series.head()
Out[5]: 
0   -1.157892
1   -1.344312
2    0.844885
3    1.075770
4   -0.109050
dtype: float64
In [6]: long_series.tail(3)
Out[6]: 
997   -0.289388
998   -1.020544
999    0.589993
dtype: float64

属性和底层数据

pandas 对象具有许多属性，使您能够访问元数据

shape：给出对象的轴维度，与 ndarray 一致
轴标签

Series：索引（仅轴）
DataFrame：索引（行）和列

注意，这些属性可以安全地分配给！

In [7]: df[:2]
Out[7]: 
 A         B         C
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
In [8]: df.columns = [x.lower() for x in df.columns]
In [9]: df
Out[9]: 
 a         b         c
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03  1.071804  0.721555 -0.706771
2000-01-04 -1.039575  0.271860 -0.424972
2000-01-05  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427
2000-01-07  0.524988  0.404705  0.577046
2000-01-08 -1.715002 -1.039268 -0.370647

pandas 对象（Index、Series、DataFrame）可以被视为数组的容器，其中保存实际数据并执行实际计算。对于许多类型，底层数组是一个numpy.ndarray。但是，pandas 和第三方库可能会扩展 NumPy 的类型系统以支持自定义数组（请参阅 dtypes）。

要获取Index或Series内的实际数据，请使用.array属性

In [10]: s.array
Out[10]: 
<NumpyExtensionArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934,  1.2121120250208506]
Length: 5, dtype: float64
In [11]: s.index.array
Out[11]: 
<NumpyExtensionArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array将始终是一个ExtensionArray。关于ExtensionArray的确切细节以及 pandas 为什么使用它们略微超出了本介绍的范围。更多信息请参见 dtypes。

如果您知道您需要一个 NumPy 数组，请使用to_numpy()或numpy.asarray()。

In [12]: s.to_numpy()
Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])
In [13]: np.asarray(s)
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

当 Series 或 Index 由ExtensionArray支持时，to_numpy()可能涉及复制数据和强制值。更多信息请参见 dtypes。

to_numpy()可以对生成的numpy.ndarray的dtype进行一些控制。例如，考虑带有时区的日期时间。NumPy 没有一种 dtype 来表示带时区的日期时间，因此有两种可能有用的表示方式：

一个带有Timestamp对象的对象数据类型numpy.ndarray，每个对象都具有正确的tz
一个datetime64[ns] -dtype numpy.ndarray，其中值已转换为 UTC 并且时区已丢弃

时区可能会被保留为dtype=object

In [14]: ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))
In [15]: ser.to_numpy(dtype=object)
Out[15]: 
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET'),
 Timestamp('2000-01-02 00:00:00+0100', tz='CET')], dtype=object)

或者使用dtype='datetime64[ns]'丢弃

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]: 
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
 dtype='datetime64[ns]')

获取DataFrame内的“原始数据”可能会更加复杂。当您的DataFrame所有列只有一个数据类型时，DataFrame.to_numpy()将返回底层数据：

In [17]: df.to_numpy()
Out[17]: 
array([[-0.1732,  0.1192, -1.0442],
 [-0.8618, -2.1046, -0.4949],
 [ 1.0718,  0.7216, -0.7068],
 [-1.0396,  0.2719, -0.425 ],
 [ 0.567 ,  0.2762, -1.0874],
 [-0.6737,  0.1136, -1.4784],
 [ 0.525 ,  0.4047,  0.577 ],
 [-1.715 , -1.0393, -0.3706]])

如果 DataFrame 包含同质类型数据，ndarray 实际上可以就地修改，并且更改将反映在数据结构中。对于异构数据（例如 DataFrame 的某些列不全是相同 dtype），情况并非如此。与轴标签不同，值属性本身不能被赋值。

注意

当处理异构数据时，生成的 ndarray 的 dtype 将被选择以容纳所有涉及的数据。例如，如果涉及字符串，则结果将是对象 dtype。如果只有浮点数和整数，则生成的数组将是浮点 dtype。

在过去，pandas 推荐使用Series.values或DataFrame.values来从 Series 或 DataFrame 中提取数据。您仍然会在旧代码库和在线上找到对这些的引用。未来，我们建议避免使用.values，而是使用.array或.to_numpy()。.values具有以下缺点：

当您的 Series 包含扩展类型时，不清楚Series.values是返回 NumPy 数组还是扩展数组。Series.array将始终返回一个ExtensionArray，并且永远不会复制数据。Series.to_numpy()将始终返回一个 NumPy 数组，可能会以复制/强制转换值为代价。
当您的 DataFrame 包含不同数据类型时，DataFrame.values可能涉及复制数据并将值强制转换为公共 dtype，这是一个相对昂贵的操作。作为一个方法，DataFrame.to_numpy()更清晰地表明返回的 NumPy 数组可能不是 DataFrame 中相同数据的视图。

加速操作

pandas 支持使用numexpr库和bottleneck库加速某些类型的二进制数值和布尔运算。

当处理大型数据集时，这些库特别有用，并提供大幅加速。numexpr使用智能分块、缓存和多核。bottleneck是一组专门的 cython 例程，当处理具有nans的数组时特别快。

这里是一个示例（使用 100 列 x 100,000 行的DataFrames）：

操作	0.11.0（毫秒）	之前版本（毫秒）	相对于之前的比率
`df1 > df2`	13.32	125.35	0.1063
`df1 * df2`	21.71	36.63	0.5928
`df1 + df2`	22.04	36.50	0.6039

你强烈建议安装这两个库。查看推荐依赖项部分获取更多安装信息。

这两者默认启用，您可以通过设置选项来控制：

pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)

灵活的二进制操作

在 pandas 数据结构之间进行二进制操作时，有两个关键点值得关注：

高维（例如 DataFrame）和低维（例如 Series）对象之间的广播行为。
计算中的缺失数据。

我们将演示如何独立处理这些问题，尽管它们可以同时处理。

匹配/广播行为

DataFrame 具有方法add()、sub()、mul()、div()和相关函数radd()、rsub()等用于执行二进制操作。对于广播行为，Series 输入是主要关注点。使用这些函数，您可以通过 axis 关键字来匹配索引或列：

In [18]: df = pd.DataFrame(
 ....:    {
 ....:        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
 ....:        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
 ....:        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
 ....:    }
 ....: )
 ....: 
In [19]: df
Out[19]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172
In [20]: row = df.iloc[1]
In [21]: column = df["two"]
In [22]: df.sub(row, axis="columns")
Out[22]: 
 one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782
In [23]: df.sub(row, axis=1)
Out[23]: 
 one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782
In [24]: df.sub(column, axis="index")
Out[24]: 
 one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516
In [25]: df.sub(column, axis=0)
Out[25]: 
 one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

此外，您可以将 MultiIndexed DataFrame 的一个级别与 Series 对齐。

In [26]: dfmi = df.copy()
In [27]: dfmi.index = pd.MultiIndex.from_tuples(
 ....:    [(1, "a"), (1, "b"), (1, "c"), (2, "a")], names=["first", "second"]
 ....: )
 ....: 
In [28]: dfmi.sub(column, axis=0, level="second")
Out[28]: 
 one       two     three
first second 
1     a      -0.377535  0.000000       NaN
 b      -1.569069  0.000000 -1.962513
 c      -0.783123  0.000000 -0.250933
2     a            NaN -1.493173 -2.385688

Series 和 Index 也支持divmod()内置函数。该函数同时进行地板除法和取模运算，返回与左侧相同类型的两元组。例如：

In [29]: s = pd.Series(np.arange(10))
In [30]: s
Out[30]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
In [31]: div, rem = divmod(s, 3)
In [32]: div
Out[32]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64
In [33]: rem
Out[33]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64
In [34]: idx = pd.Index(np.arange(10))
In [35]: idx
Out[35]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
In [36]: div, rem = divmod(idx, 3)
In [37]: div
Out[37]: Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
In [38]: rem
Out[38]: Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

我们也可以逐元素进行divmod()操作：

In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
In [40]: div
Out[40]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64
In [41]: rem
Out[41]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

缺失数据/使用填充值的操作

在 Series 和 DataFrame 中，算术函数有一个 fill_value 选项，即在某个位置的值中至多有一个缺失时要替换的值。例如，当添加两个 DataFrame 对象时，您可能希望将 NaN 视为 0，除非两个 DataFrame 都缺少该值，此时结果将为 NaN（如果需要，您可以稍后使用 fillna 将 NaN 替换为其他值）。

In [42]: df2 = df.copy()
In [43]: df2.loc["a", "three"] = 1.0
In [44]: df
Out[44]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172
In [45]: df2
Out[45]: 
 one       two     three
a  1.394981  1.772517  1.000000
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172
In [46]: df + df2
Out[46]: 
 one       two     three
a  2.789963  3.545034       NaN
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343
In [47]: df.add(df2, fill_value=0)
Out[47]: 
 one       two     three
a  2.789963  3.545034  1.000000
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

灵活的比较

Series 和 DataFrame 具有二进制比较方法 eq、ne、lt、gt、le 和 ge，其行为类似于上述二进制算术操作：

In [48]: df.gt(df2)
Out[48]: 
 one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False
In [49]: df2.ne(df)
Out[49]: 
 one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

这些操作产生与左侧输入相同类型的 dtype 为 bool 的 pandas 对象。这些 boolean 对象可以用于索引操作，请参阅布尔索引部分。### 布尔规约

您可以应用以下规约：empty，any()，all() 和 bool() 来提供一种总结布尔结果的方式。

In [50]: (df > 0).all()
Out[50]: 
one      False
two       True
three    False
dtype: bool
In [51]: (df > 0).any()
Out[51]: 
one      True
two      True
three    True
dtype: bool

您可以将其减少为最终布尔值。

In [52]: (df > 0).any().any()
Out[52]: True

您可以通过 empty 属性来测试 pandas 对象是否为空。

In [53]: df.empty
Out[53]: False
In [54]: pd.DataFrame(columns=list("ABC")).empty
Out[54]: True

警告

断言 pandas 对象的真实性将引发错误，因为对空值或值的测试是模棱两可的。

In [55]: if df:
 ....:    print(True)
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-55-318d08b2571a> in ?()
----> 1 if df:
  2     print(True)
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [56]: df and df2
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-56-b241b64bb471> in ?()
----> 1 df and df2
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

有关更详细讨论，请参阅陷阱。### 比较对象是否等价

通常您可能会发现有多种方法可以计算相同的结果。举个简单的例子，考虑 df + df 和 df * 2。为了测试这两个计算是否产生相同的结果，根据上面展示的工具，您可以想象使用 (df + df == df * 2).all()。但事实上，这个表达式是 False：

In [57]: df + df == df * 2
Out[57]: 
 one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True
In [58]: (df + df == df * 2).all()
Out[58]: 
one      False
two       True
three    False
dtype: bool

注意到布尔型 DataFrame df + df == df * 2 包含一些 False 值！这是因为 NaN 不会被视为相等：

In [59]: np.nan == np.nan
Out[59]: False

因此，NDFrames（如 Series 和 DataFrames）具有一个用于测试相等性的 equals() 方法，其中对应位置的 NaN 被视为相等。

In [60]: (df + df).equals(df * 2)
Out[60]: True

请注意，为了使相等性为 True，Series 或 DataFrame 索引需要按相同顺序排列：

In [61]: df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})
In [62]: df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
In [63]: df1.equals(df2)
Out[63]: False
In [64]: df1.equals(df2.sort_index())
Out[64]: True

Pandas 2.2 中文官方教程和指南（九·一）（4）

默认值

向上转换

astype

对象转换

注意事项

基于`dtype`选择列

Head 和 tail

属性和底层数据

加速操作

灵活的二进制操作

匹配/广播行为

缺失数据/使用填充值的操作

灵活的比较

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Pandas 2.2 中文官方教程和指南（九·一）（4）

默认值

向上转换

astype

对象转换

注意事项

基于dtype选择列

Head 和 tail

属性和底层数据

加速操作

灵活的二进制操作

匹配/广播行为

缺失数据/使用填充值的操作

灵活的比较

热门文章

最新文章

相关课程

相关电子书

基于`dtype`选择列