Pandas 2.2 中文官方教程和指南（二十四）（4）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（二十四）（3）https://developer.aliyun.com/article/1508867

使用 pandas 进行 if/truth 语句

pandas 遵循 NumPy 的惯例，当你尝试将某些东西转换为 bool 时会引发错误。这发生在 if 语句中或在使用布尔运算时：and、or 和 not。下面的代码应该得到什么结果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

它应该是 True，因为它不是零长度，还是 False，因为存在 False 值？不清楚，因此，pandas 引发了一个 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")
~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

您需要明确选择您要对DataFrame进行的操作，例如使用any()、all()或empty()。或者，您可能想要比较 pandas 对象是否为None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

以下是如何检查任何值是否为True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位运算布尔

像==和!=这样的位运算布尔运算符返回一个布尔Series，当与标量比较时进行逐元素比较。

In [14]: s = pd.Series(range(5))
In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

请参阅布尔比较以获取更多示例。

使用`in`运算符

在Series上使用 Python 的in运算符测试是否属于索引，而不是值之间的成员关系。

In [16]: s = pd.Series(range(5), index=list("abcde"))
In [17]: 2 in s
Out[17]: False
In [18]: 'b' in s
Out[18]: True

如果此行为令人惊讶，请记住，在 Python 字典上使用in测试键，而不是值，而Series类似于字典。要测试值的成员资格，请使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool
In [20]: s.isin([2]).any()
Out[20]: True

对于DataFrame，同样地，in应用于列轴，测试是否在列名列表中。

位运算布尔

像==和!=这样的位运算布尔运算符返回一个布尔Series，当与标量比较时进行逐元素比较。

In [14]: s = pd.Series(range(5))
In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

请参阅布尔比较以获取更多示例。

使用`in`运算符

在Series上使用 Python 的in运算符测试是否属于索引，而不是值之间的成员关系。

In [16]: s = pd.Series(range(5), index=list("abcde"))
In [17]: 2 in s
Out[17]: False
In [18]: 'b' in s
Out[18]: True

如果此行为令人惊讶，请记住，在 Python 字典上使用in测试键，而不是值，而Series类似于字典。要测试值的成员资格，请使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool
In [20]: s.isin([2]).any()
Out[20]: True

对于DataFrame，同样地，in应用于列轴，测试是否在列名列表中。

使用用户定义函数（UDF）方法进行变异

本节适用于接受 UDF 的 pandas 方法。特别是，方法 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter()。

编程中的一个通用规则是，在迭代容器时不应该改变容器。改变会使迭代器失效，导致意外行为。考虑下面的例子：

In [21]: values = [0, 1, 2, 3, 4, 5]
In [22]: n_removed = 0
In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 
In [24]: values
Out[24]: [1, 4, 5]

人们可能本来期望结果会是[1, 3, 5]。当使用一个接受用户定义函数（UDF）的 pandas 方法时，内部 pandas 经常会迭代DataFrame 或其他 pandas 对象。因此，如果 UDF 改变了 DataFrame，可能会导致意外行为的发生。

下面是一个类似的例子，使用了 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 
In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'a'
The above exception was the direct cause of the following exception:
KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")
File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")
File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()
File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()
File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)
Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s
File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)
File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result
File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):
File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)
KeyError: 'a'

要解决此问题，可以制作一个副本，以便变化不适用于被迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]
In [29]: n_removed = 0
In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 
In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 
In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})
In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

由于 NumPy 和 Python 一般都不支持从底层开始的 NA（缺失）支持，因此 NA 可以用以下方式表示：

掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一组哨兵值来表示跨 dtypes 的 NA。

选择了特殊值 np.nan（Not-A-Number）作为 NumPy 类型的 NA 值，并且有像 DataFrame.isna() 和 DataFrame.notna() 这样的 API 函数，可以用于跨 dtypes 检测 NA 值。然而，这种选择的缺点是会将缺失的整数数据强制转换为浮点类型，如在整数 NA 的支持中所示。

NumPy 类型的 `NA` 类型提升

通过 reindex() 或其他方式将 NA 引入现有的 Series 或 DataFrame 时，布尔和整数类型将被提升为不同的 dtype 以存储 NA。这些提升总结在这个表中：

类型类	用于存储 NA 的提升 dtype
`floating`	无变化
`object`	无变化
`integer`	转换为 `float64`
`boolean`	转换为 `object`

对整数 `NA` 的支持

在 NumPy 中没有内置高性能的 NA 支持的情况下，主要的牺牲是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))
In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64
In [37]: s.dtype
Out[37]: dtype('int64')
In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])
In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64
In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是为了内存和性能原因，以及确保生成的 Series 仍然是“数值型”的。

如果需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())
In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64
In [43]: s_int.dtype
Out[43]: Int64Dtype()
In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])
In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64
In [46]: s2_int.dtype
Out[46]: Int64Dtype()
In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")
In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多信息请参阅可空整数数据类型和 PyArrow 功能。

为什么不让 NumPy 像 R 一样？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言 R 中存在的 NA 支持。部分原因是 NumPy 的类型层次结构：

类型类	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character 和 boolean。 NA 类型是通过为每种类型保留特殊的位模式来实现的，用作缺失值。虽然在 NumPy 的完整类型层次结构中执行这一操作是可能的，但这将是一个更为重大的权衡（特别是对于 8 位和 16 位数据类型）和实现任务。

然而，通过使用像 Int64Dtype 或 PyArrow 类型（ArrowDtype）这样的掩码 NumPy 类型，现在可以使用 R NA 语义。

使用 `np.nan` 作为 NumPy 类型的 `NA` 表示

由于 NumPy 和 Python 在一般情况下缺乏从头开始的 NA（缺失）支持，NA 可以用以下方式表示：

一种 掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的标记值、位模式或一组标记值来表示跨数据类型的 NA。

选择了特殊值 np.nan（非数字）作为 NumPy 类型的 NA 值，还有像 DataFrame.isna() 和 DataFrame.notna() 这样的 API 函数，可以跨数据类��用于检测 NA 值。然而，这种选择的缺点是将缺失的整数数据强制转换为浮点类型，如整数 NA 支持中所示。

NumPy 类型的`NA`类型提升

当通过 reindex() 或其他方式将 NAs 引入现有的 Series 或 DataFrame 时，布尔值和整数类型将被提升为不同的数据类型以存储 NA。这些提升总结在这个表中：

类型类	用于存储 NA 的提升数据类型
`浮点数`	无变化
`对象`	无变化
`整数`	转换为 `float64`
`布尔值`	转换为 `对象`

整数 `NA` 支持

在 NumPy 中没有从头开始构建高性能NA支持的情况下，主要的牺牲品是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))
In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64
In [37]: s.dtype
Out[37]: dtype('int64')
In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])
In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64
In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是出于内存和性能原因，以及确保生成的 Series 仍然是“数值型”的。

如果您需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展数据类型之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())
In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64
In [43]: s_int.dtype
Out[43]: Int64Dtype()
In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])
In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64
In [46]: s2_int.dtype
Out[46]: Int64Dtype()
In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")
In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多信息，请参见可空整数数据类型和 PyArrow 功能。

为什么不让 NumPy 像 R 一样？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言R中存在的NA支持。部分原因是 NumPy 类型层次结构：

类型类	数据类型
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character和boolean。NA类型是通过为每种类型保留特殊的位模式来实现的，以用作缺失值。虽然使用完整的 NumPy 类型层次结构进行此操作是可能的，但这将是一个更重大的折衷（特别是对于 8 位和 16 位数据类型）和实施任务。

然而，现在可以通过使用掩码 NumPy 类型（如Int64Dtype）或 PyArrow 类型（ArrowDtype）来实现 R 的NA语义。

与 NumPy 的差异

对于Series和DataFrame对象，var()通过N-1进行归一化，以产生总体方差的无偏估计，而 NumPy 的numpy.var()通过 N 进行归一化，这测量了样本的方差。请注意，cov()在 pandas 和 NumPy 中都通过N-1进行归一化。

线程安全性

pandas 并非 100%线程安全。已知问题与copy()方法有关。如果您正在对在线程之间共享的DataFrame对象进行大量复制，我们建议在进行数据复制的线程内部保持锁定。

更多信息，请参见此链接。

字节顺序问题

有时您可能需要处理在与运行 Python 的机器上具有不同字节顺序的机器上创建的数据。这个问题的常见症状是出现错误，如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要解决这个问题，您应该在将其传递给Series或DataFrame构造函数之前，将底层 NumPy 数组转换为本机系统字节顺序，类似于以下内容：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian
In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder
In [51]: s = pd.Series(newx)

查看更多详细信息，请参阅NumPy 文档中关于字节顺序的部分。

Pandas 2.2 中文官方教程和指南（二十四）（4）

使用 pandas 进行 if/truth 语句

位运算布尔

使用`in`运算符

位运算布尔

使用`in`运算符

使用用户定义函数（UDF）方法进行变异

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

对整数 `NA` 的支持

为什么不让 NumPy 像 R 一样？

使用 `np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的`NA`类型提升

整数 `NA` 支持

为什么不让 NumPy 像 R 一样？

与 NumPy 的差异

线程安全性

字节顺序问题

热门文章

最新文章

相关课程

相关电子书

相关实验场景

Pandas 2.2 中文官方教程和指南（二十四）（4）

使用 pandas 进行 if/truth 语句

位运算布尔

使用in运算符

位运算布尔

使用in运算符

使用用户定义函数（UDF）方法进行变异

NumPy 类型的缺失值表示

np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的 NA 类型提升

对整数 NA 的支持

为什么不让 NumPy 像 R 一样？

使用 np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的NA类型提升

整数 NA 支持

为什么不让 NumPy 像 R 一样？

与 NumPy 的差异

线程安全性

字节顺序问题

热门文章

最新文章

相关课程

相关电子书

相关实验场景

使用`in`运算符

使用`in`运算符

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

对整数 `NA` 的支持

使用 `np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的`NA`类型提升

整数 `NA` 支持