Pandas 2.2 中文官方教程和指南（二十四）（3）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（二十四）（2）https://developer.aliyun.com/article/1508865

常见问题（FAQ）

原文：pandas.pydata.org/docs/user_guide/gotchas.html

DataFrame 内存使用情况

在调用 info() 时，DataFrame 的内存使用情况（包括索引）会显示出来。一个配置选项，display.memory_usage（参见选项列表），指定了在调用 info() 方法时是否会显示 DataFrame 的内存使用情况。

例如，在调用 info() 时，下面的 DataFrame 的内存使用情况会显示如下：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 
In [2]: n = 5000
In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
In [4]: df = pd.DataFrame(data)
In [5]: df["categorical"] = df["object"].astype("category")
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符号表示真实内存使用量可能更高，因为 pandas 不会计算具有 dtype=object 的列中的值所使用的内存。

传递 memory_usage='deep' 将启用更准确的内存使用报告，考虑到所包含对象的完整使用情况。这是可选的，因为进行这种更深层次的内省可能很昂贵。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

默认情况下，显示选项设置为 True，但是在调用 info() 时可以通过显式传递 memory_usage 参数来明确覆盖。

可以通过调用 memory_usage() 方法找到每列的内存使用情况。这会返回一个 Series，其索引由列名表示，并显示每列的内存使用情况（以字节为单位）。对于上述的 DataFrame，可以通过 memory_usage() 方法找到每列的内存使用情况和总内存使用情况：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64
# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

默认情况下，返回的 Series 中显示 DataFrame 索引的内存使用情况，可以通过传递 index=False 参数来抑制索引的内存使用情况：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法显示的内存使用情况利用了 memory_usage() 方法来确定 DataFrame 的内存使用情况，同时以人类可读的单位格式化输出（基于 2 的表示法；即 1KB = 1024 字节）。

另请参阅分类记忆用法。 ## 在 pandas 中使用 if/truth 语句

pandas 遵循 NumPy 的惯例，当你尝试将某些内容转换为 bool 时会引发错误。这会在 if 语句中或使用布尔操作：and、or 和 not 时发生。以下代码的结果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

应该是 True 吗，因为它不是零长度，还是 False 因为有 False 值？不清楚，所以 pandas 引发了 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

你需要明确选择你想要对 DataFrame 做什么，例如使用 any()、all() 或 empty()。或者，你可能想要比较 pandas 对象是否为 None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

下面是如何检查任何值是否为 True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位运算布尔值

位运算布尔运算符如 == 和 != 返回一个布尔 Series，与标量进行比较时执行逐元素比较。

In [14]: s = pd.Series(range(5))
In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

查看布尔值比较获取更多示例。

使用 `in` 运算符

在 Series 上使用 Python in 运算符测试成员身份在索引中，而不是在值之间。

In [16]: s = pd.Series(range(5), index=list("abcde"))
In [17]: 2 in s
Out[17]: False
In [18]: 'b' in s
Out[18]: True

如果这种行为令人惊讶，请记住，在 Python 字典上使用 in 测试键，而不是值，并且 Series 类似于字典。要测试成员身份是否在值中，请使用方法 isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool
In [20]: s.isin([2]).any()
Out[20]: True

对于 DataFrame，同样地，in 应用于列轴，测试是否在列名列表中。 ## 通过用户定义的函数 (UDF) 方法进行变异

此部分适用于需要 UDF 的 pandas 方法。特别是 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter() 方法。

在编程中，通常的规则是在容器被迭代时不要改变容器。变异将使迭代器无效，导致意外行为。考虑以下例子：

In [21]: values = [0, 1, 2, 3, 4, 5]
In [22]: n_removed = 0
In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 
In [24]: values
Out[24]: [1, 4, 5]

人们可能会期望结果是 [1, 3, 5]。当使用需要 UDF 的 pandas 方法时，内部 pandas 通常会迭代 DataFrame 或其他 pandas 对象。因此，如果 UDF 改变了 DataFrame，可能会出现意外行为。

这里有一个类似的例子，使用 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 
In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'a'
The above exception was the direct cause of the following exception:
KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")
File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")
File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()
File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()
File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)
Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s
File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)
File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result
File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):
File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)
KeyError: 'a'

要解决这个问题，可以制作一份副本，这样变异就不会应用于正在迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]
In [29]: n_removed = 0
In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 
In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 
In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})
In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

由于在 NumPy 和 Python 中普遍缺乏对 NA（缺失）的支持，NA 可以用以下方式表示：

一种 掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一组哨兵值来表示各种 dtypes 中的 NA。

选择特殊值 np.nan（非数字）作为 NumPy 类型的 NA 值，并且有一些 API 函数如 DataFrame.isna() 和 DataFrame.notna() 可以用于各种 dtypes 来检测 NA 值。然而，这个选择有一个缺点，即将缺失的整数数据强制转换为浮点类型，如整数 NA 的支持所示。

NumPy 类型的 `NA` 类型提升

当通过reindex()或其他方式向现有的Series或DataFrame引入 NA 时，布尔和整数类型将被提升为不同的 dtype 以存储 NA。这些提升总结在这个表中：

类型	用于存储 NA 的提升 dtype
`floating`	无变化
`object`	无变化
`integer`	转换为`float64`
`boolean`	转换为`object`

支持整数`NA`

在 NumPy 中没有从头开始构建高性能NA支持的情况下，主要的牺牲品是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))
In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64
In [37]: s.dtype
Out[37]: dtype('int64')
In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])
In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64
In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是出于内存和性能原因，以及确保生成的Series继续是“数值型”的原因。

如果需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())
In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64
In [43]: s_int.dtype
Out[43]: Int64Dtype()
In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])
In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64
In [46]: s2_int.dtype
Out[46]: Int64Dtype()
In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")
In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

查看可空整数数据类型和 PyArrow 功能以获取更多信息。

为什么不让 NumPy 像 R 一样呢？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言R中存在的NA支持。部分原因是 NumPy 类型层次结构：

类型	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character和boolean。NA类型是通过为每种类型保留特殊的位模式来实现的，用作缺失值。虽然在整个 NumPy 类型层次结构中执行此操作是可能的，但这将是一个更重大的权衡（特别是对于 8 位和 16 位数据类型），并且需要更多的实现工作。

但是，R 的NA语义现在可通过使用遮罩 NumPy 类型（例如Int64Dtype）或 PyArrow 类型（ArrowDtype）来实现。

与 NumPy 的差异

对于Series和DataFrame对象，var()通过N-1进行归一化以生成无偏的总体方差估计，而 NumPy 的numpy.var()通过 N 进行归一化，该方法测量样本的方差。请注意，cov()在 pandas 和 NumPy 中都通过N-1进行归一化。

线程安全性

pandas 并非 100%线程安全。已知问题与copy()方法有关。如果您在线程之间共享的DataFrame对象上进行大量复制操作，我们建议在发生数据复制的线程内持有锁定。

有关更多信息，请参见此链接。

字节顺序问题

偶尔你可能需要处理在与运行 Python 的机器上的字节顺序不同的机器上创建的数据。此问题的常见症状是错误，例如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要处理此问题，您应该在将底层 NumPy 数组传递给Series或DataFrame构造函数之前将其转换为本机系统字节顺序，如下所示：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian
In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder
In [51]: s = pd.Series(newx)

有关更多详情，请参阅NumPy 关于字节顺序的文档。

DataFrame 内存使用情况

调用info()时，会显示DataFrame（包括索引）的内存使用情况。配置选项display.memory_usage（请参阅选项列表）指定在调用info()方法时是否显示DataFrame的内存使用情况。

例如，调用 info() 时，下面的 DataFrame 的内存使用情况会显示出来：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 
In [2]: n = 5000
In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
In [4]: df = pd.DataFrame(data)
In [5]: df["categorical"] = df["object"].astype("category")
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符号表示真正的内存使用量可能更高，因为 pandas 不计算具有 dtype=object 的列中值的内存使用量。

通过传递 memory_usage='deep' 将启用更准确的内存使用报告，考虑到所包含对象的完整使用情况。这是可选的，因为进行更深入的内省可能会很昂贵。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

默认情况下，显示选项设置为 True，但可以通过在调用 info() 时传递 memory_usage 参数来显式地覆盖。

通过调用 memory_usage() 方法可以找到每列的内存使用情况。这将返回一个由列名表示的索引的 Series，其中显示了每列的内存使用情况（以字节为单位）。对于上述的 DataFrame，可以通过 memory_usage() 方法找到每列的内存使用情况和总内存使用情况：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64
# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

默认情况下，返回的 Series 中显示了 DataFrame 索引的内存使用情况，可以通过传递 index=False 参数来抑制索引的内存使用情况：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法显示的内存使用情况利用 memory_usage() 方法来确定 DataFrame 的内存使用情况，同时以人类可读的单位格式化输出（基于 2 的表示法；即 1KB = 1024 字节）。

另请参阅分类内存使用。

Pandas 2.2 中文官方教程和指南（二十四）（4）https://developer.aliyun.com/article/1508870

Pandas 2.2 中文官方教程和指南（二十四）（3）

常见问题（FAQ）

DataFrame 内存使用情况

位运算布尔值

使用 `in` 运算符

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

支持整数`NA`

为什么不让 NumPy 像 R 一样呢？

与 NumPy 的差异

线程安全性

字节顺序问题

DataFrame 内存使用情况

热门文章

最新文章

相关课程

相关电子书

相关实验场景

Pandas 2.2 中文官方教程和指南（二十四）（3）

常见问题（FAQ）

DataFrame 内存使用情况

位运算布尔值

使用 in 运算符

NumPy 类型的缺失值表示

np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的 NA 类型提升

支持整数NA

为什么不让 NumPy 像 R 一样呢？

与 NumPy 的差异

线程安全性

字节顺序问题

DataFrame 内存使用情况

热门文章

最新文章

相关课程

相关电子书

相关实验场景

使用 `in` 运算符

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

支持整数`NA`