Pandas 2.2 中文官方教程和指南（十一·二）（2）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（十一·二）（1）https://developer.aliyun.com/article/1509869

结合位置和基于标签的索引

如果您希望从‘A’列的索引中获取第 0 和第 2 个元素，可以这样做：

In [107]: dfd = pd.DataFrame({'A': [1, 2, 3],
 .....:                    'B': [4, 5, 6]},
 .....:                   index=list('abc'))
 .....: 
In [108]: dfd
Out[108]: 
 A  B
a  1  4
b  2  5
c  3  6
In [109]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[109]: 
a    1
c    3
Name: A, dtype: int64

这也可以使用.iloc来表示，通过明确在索引器上获取位置，并使用位置索引来选择内容。

In [110]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[110]: 
a    1
c    3
Name: A, dtype: int64

对于获取多个索引器，可以使用 .get_indexer：

In [111]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
Out[111]: 
 A  B
a  1  4
c  3  6

重新索引

实现选择可能不存在的元素的惯用方法是通过 .reindex()。另请参阅重新索引部分。

In [112]: s = pd.Series([1, 2, 3])
In [113]: s.reindex([1, 2, 3])
Out[113]: 
1    2.0
2    3.0
3    NaN
dtype: float64

或者，如果你只想选择有效的键，下面是惯用的且高效的方法；它保证保持选择的 dtype。

In [114]: labels = [1, 2, 3]
In [115]: s.loc[s.index.intersection(labels)]
Out[115]: 
1    2
2    3
dtype: int64

当存在重复索引时，对 .reindex() 的调用将引发错误：

In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
In [117]: labels = ['c', 'd']
In [118]: s.reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[118], line 1
----> 1 s.reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels

通常情况下，你可以将所需的标签与当前轴相交，然后重新索引。

In [119]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[119]: 
c    3.0
d    NaN
dtype: float64

然而，如果你的结果索引重复，这仍然会引发错误。

In [120]: labels = ['a', 'd']
In [121]: s.loc[s.index.intersection(labels)].reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[121], line 1
----> 1 s.loc[s.index.intersection(labels)].reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels

重新索引

实现选择可能不存在的元素的惯用方法是通过 .reindex()。另请参阅重新索引部分。

In [112]: s = pd.Series([1, 2, 3])
In [113]: s.reindex([1, 2, 3])
Out[113]: 
1    2.0
2    3.0
3    NaN
dtype: float64

或者，如果你只想选择有效的键，下面是惯用的且高效的方法；它保证保持选择的 dtype。

In [114]: labels = [1, 2, 3]
In [115]: s.loc[s.index.intersection(labels)]
Out[115]: 
1    2
2    3
dtype: int64

当存在重复索引时，对 .reindex() 的调用将引发错误：

In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
In [117]: labels = ['c', 'd']
In [118]: s.reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[118], line 1
----> 1 s.reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels

通常情况下，你可以将所需的标签与当前轴相交，然后重新索引。

In [119]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[119]: 
c    3.0
d    NaN
dtype: float64

然而，如果你的结果索引重复，这仍然会引发错误。

In [120]: labels = ['a', 'd']
In [121]: s.loc[s.index.intersection(labels)].reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[121], line 1
----> 1 s.loc[s.index.intersection(labels)].reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels

选择随机样本

使用 sample() 方法从 Series 或 DataFrame 中随机选择行或列。该方法默认会抽样行，并接受要返回的特定行/列数或行/列的比例。

In [122]: s = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1 row.
In [123]: s.sample()
Out[123]: 
4    4
dtype: int64
# One may specify either a number of rows:
In [124]: s.sample(n=3)
Out[124]: 
0    0
4    4
1    1
dtype: int64
# Or a fraction of the rows:
In [125]: s.sample(frac=0.5)
Out[125]: 
5    5
3    3
1    1
dtype: int64

默认情况下，sample 将至多一次返回每行，但也可以使用 replace 选项进行替换抽样：

In [126]: s = pd.Series([0, 1, 2, 3, 4, 5])
# Without replacement (default):
In [127]: s.sample(n=6, replace=False)
Out[127]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64
# With replacement:
In [128]: s.sample(n=6, replace=True)
Out[128]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64

默认情况下，每行被选中的概率相等，但如果你希望行具有不同的概率，可以将 sample 函数的抽样权重作为 weights 参数传递。这些权重可以是列表、NumPy 数组或 Series，但它们的长度必须与你正在抽样的对象相同。缺失值将被视为权重为零，不允许存在无穷大值。如果权重的总和不为 1，则通过将所有权重除以权重总和来重新归一化权重。例如：

In [129]: s = pd.Series([0, 1, 2, 3, 4, 5])
In [130]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [131]: s.sample(n=3, weights=example_weights)
Out[131]: 
5    5
4    4
3    3
dtype: int64
# Weights will be re-normalized automatically
In [132]: example_weights2 = [0.5, 0, 0, 0, 0, 0]
In [133]: s.sample(n=1, weights=example_weights2)
Out[133]: 
0    0
dtype: int64

当应用于 DataFrame 时，你可以通过简单地将 DataFrame 的列名作为字符串传递来使用 DataFrame 的列作为抽样权重（前提是你正在抽样行而不是列）。

In [134]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
 .....:                    'weight_column': [0.5, 0.4, 0.1, 0]})
 .....: 
In [135]: df2.sample(n=3, weights='weight_column')
Out[135]: 
 col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1

sample 还允许用户使用 axis 参数对列而不是行进行抽样。

In [136]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
In [137]: df3.sample(n=1, axis=1)
Out[137]: 
 col1
0     1
1     2
2     3

最后，可以使用 random_state 参数为 sample 的随机数生成器设置种子，该参数将接受整数（作为种子）或 NumPy RandomState 对象。

In [138]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
# With a given seed, the sample will always draw the same rows.
In [139]: df4.sample(n=2, random_state=2)
Out[139]: 
 col1  col2
2     3     4
1     2     3
In [140]: df4.sample(n=2, random_state=2)
Out[140]: 
 col1  col2
2     3     4
1     2     3

设置时进行扩展

.loc/[] 操作在为该轴设置不存在的键时可以进行扩展。

在 Series 情况下，这实际上是一个追加操作。

In [141]: se = pd.Series([1, 2, 3])
In [142]: se
Out[142]: 
0    1
1    2
2    3
dtype: int64
In [143]: se[5] = 5.
In [144]: se
Out[144]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

通过 .loc，可以在任一轴上扩展 DataFrame。

In [145]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
 .....:                   columns=['A', 'B'])
 .....: 
In [146]: dfi
Out[146]: 
 A  B
0  0  1
1  2  3
2  4  5
In [147]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']
In [148]: dfi
Out[148]: 
 A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

这就像是对 DataFrame 进行了一次 append 操作。

In [149]: dfi.loc[3] = 5
In [150]: dfi
Out[150]: 
 A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

快速标量值的获取和设置

由于使用 [] 进行索引必须处理很多情况（单标签访问、切片、布尔索引等），因此为了弄清楚您要请求的内容，它需要一点额外的开销。如果您只想访问一个标量值，最快的方法是使用所有数据结构上都实现的 at 和 iat 方法。

与 loc 类似，at 提供基于标签的标量查找，而 iat 则类似于 iloc 提供整数基础的查找。

In [151]: s.iat[5]
Out[151]: 5
In [152]: df.at[dates[5], 'A']
Out[152]: 0.1136484096888855
In [153]: df.iat[3, 0]
Out[153]: -0.7067711336300845

您也可以使用相同的索引器进行设置。

In [154]: df.at[dates[5], 'E'] = 7
In [155]: df.iat[3, 0] = 7

如果索引器缺失，at 可能会像上面那样就地放大对象。

In [156]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
In [157]: df
Out[157]: 
 A         B         C         D    E    0
2000-01-01 -0.282863  0.469112 -1.509059 -1.135632  NaN  NaN
2000-01-02 -0.173215  1.212112  0.119209 -1.044236  NaN  NaN
2000-01-03 -2.104569 -0.861849 -0.494929  1.071804  NaN  NaN
2000-01-04  7.000000  0.721555 -1.039575  0.271860  NaN  NaN
2000-01-05  0.567020 -0.424972  0.276232 -1.087401  NaN  NaN
2000-01-06  0.113648 -0.673690 -1.478427  0.524988  7.0  NaN
2000-01-07  0.577046  0.404705 -1.715002 -1.039268  NaN  NaN
2000-01-08 -1.157892 -0.370647 -1.344312  0.844885  NaN  NaN
2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0

布尔索引

另一个常见的操作是使用布尔向量来过滤数据。操作符有：| 代表 或，& 代表 与，~ 代表 非。这些必须使用括号进行分组，因为默认情况下 Python 将会评估如下表达式 df['A'] > 2 & df['B'] < 3 为 df['A'] > (2 & df['B']) < 3，而所需的评估顺序应为 (df['A'] > 2) & (df['B'] < 3)。

使用布尔向量来索引 Series 的工作方式与 NumPy ndarray 中完全相同：

In [158]: s = pd.Series(range(-3, 4))
In [159]: s
Out[159]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64
In [160]: s[s > 0]
Out[160]: 
4    1
5    2
6    3
dtype: int64
In [161]: s[(s < -1) | (s > 0.5)]
Out[161]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64
In [162]: s[~(s < 0)]
Out[162]: 
3    0
4    1
5    2
6    3
dtype: int64

您可以使用与 DataFrame 索引相同长度的布尔向量从 DataFrame 中选择行（例如，从 DataFrame 的列之一派生的内容）：

In [163]: df[df['A'] > 0]
Out[163]: 
 A         B         C         D    E   0
2000-01-04  7.000000  0.721555 -1.039575  0.271860  NaN NaN
2000-01-05  0.567020 -0.424972  0.276232 -1.087401  NaN NaN
2000-01-06  0.113648 -0.673690 -1.478427  0.524988  7.0 NaN
2000-01-07  0.577046  0.404705 -1.715002 -1.039268  NaN NaN

列表推导和 Series 的 map 方法也可以用于生成更复杂的条件：

In [164]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
 .....:                    'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
 .....:                    'c': np.random.randn(7)})
 .....: 
# only want 'two' or 'three'
In [165]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [166]: df2[criterion]
Out[166]: 
 a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075
# equivalent but slower
In [167]: df2[[x.startswith('t') for x in df2['a']]]
Out[167]: 
 a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075
# Multiple criteria
In [168]: df2[criterion & (df2['b'] == 'x')]
Out[168]: 
 a  b         c
3  three  x  0.361719

使用选择方法按标签选择、按位置选择和高级索引，您可以使用布尔向量与其他索引表达式组合沿多个轴选择。

In [169]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[169]: 
 b         c
3  x  0.361719

警告

iloc 支持两种布尔索引。如果索引器是布尔 Series，则会引发错误。例如，在以下示例中，df.iloc[s.values, 1] 是可以的。布尔索引器是一个数组。但是 df.iloc[s, 1] 会引发 ValueError。

In [170]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
 .....:                  index=list('abc'),
 .....:                  columns=['A', 'B'])
 .....: 
In [171]: s = (df['A'] > 2)
In [172]: s
Out[172]: 
a    False
b     True
c     True
Name: A, dtype: bool
In [173]: df.loc[s, 'B']
Out[173]: 
b    4
c    6
Name: B, dtype: int64
In [174]: df.iloc[s.values, 1]
Out[174]: 
b    4
c    6
Name: B, dtype: int64

使用 isin 进行索引

考虑 Series 的 isin() 方法，它返回一个布尔向量，其中 Series 元素存在于传递的列表中的位置为真。这使您可以选择具有您想要的值的行：

In [175]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [176]: s
Out[176]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64
In [177]: s.isin([2, 4, 6])
Out[177]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool
In [178]: s[s.isin([2, 4, 6])]
Out[178]: 
2    2
0    4
dtype: int64

对于 Index 对象也有相同的方法，并且在您不知道所寻找的标签中实际存在哪些情况下很有用：

In [179]: s[s.index.isin([2, 4, 6])]
Out[179]: 
4    0
2    2
dtype: int64
# compare it to the following
In [180]: s.reindex([2, 4, 6])
Out[180]: 
2    2.0
4    0.0
6    NaN
dtype: float64

除此之外，MultiIndex 允许选择一个单独的级别用于成员检查：

In [181]: s_mi = pd.Series(np.arange(6),
 .....:                 index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
 .....: 
In [182]: s_mi
Out[182]: 
0  a    0
 b    1
 c    2
1  a    3
 b    4
 c    5
dtype: int64
In [183]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[183]: 
0  c    2
1  a    3
dtype: int64
In [184]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[184]: 
0  a    0
 c    2
1  a    3
 c    5
dtype: int64

DataFrame 还具有 isin() 方法。调用 isin 时，将一组值作为数组或字典传递。如果值是一个数组，则 isin 返回一个布尔值的 DataFrame，其形状与原始 DataFrame 相同，其中 True 表示元素在值序列中。

In [185]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
 .....:                   'ids2': ['a', 'n', 'c', 'n']})
 .....: 
In [186]: values = ['a', 'b', 1, 3]
In [187]: df.isin(values)
Out[187]: 
 vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

通常您会想要将某些值与某些列匹配。只需将值作为 dict，其中键是列，值是您要检查的项目列表。

In [188]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}
In [189]: df.isin(values)
Out[189]: 
 vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False

要返回布尔值 DataFrame，其中值不在原始 DataFrame 中，请使用 ~ 运算符：

In [190]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}
In [191]: ~df.isin(values)
Out[191]: 
 vals    ids  ids2
0  False  False  True
1   True  False  True
2  False   True  True
3   True   True  True

将 DataFrame 的 isin 与 any() 和 all() 方法结合起来，快速选择满足给定条件的数据子集。要选择每列满足自身条件的行：

In [192]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [193]: row_mask = df.isin(values).all(1)
In [194]: df[row_mask]
Out[194]: 
 vals ids ids2
0     1   a    a

`where()` 方法和掩码

从 Series 中选择布尔向量通常会返回数据的子集。为了保证选择输出具有与原始数据相同的形状，您可以在 Series 和 DataFrame 中使用 where 方法。

仅返回选定的行：

In [195]: s[s > 0]
Out[195]: 
3    1
2    2
1    3
0    4
dtype: int64

要返回与原始数据相同形状的 Series：

In [196]: s.where(s > 0)
Out[196]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

从 DataFrame 中选择值时，现在也会保留输入数据形状。where 在内部用作实现。下面的代码等效于 df.where(df < 0)。

In [197]: dates = pd.date_range('1/1/2000', periods=8)
In [198]: df = pd.DataFrame(np.random.randn(8, 4),
 .....:                  index=dates, columns=['A', 'B', 'C', 'D'])
 .....: 
In [199]: df[df < 0]
Out[199]: 
 A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

此外，where 还可以接受一个可选的 other 参数，用于在返回的副本中替换条件为 False 的值。

In [200]: df.where(df < 0, -df)
Out[200]: 
 A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

您可能希望基于某些布尔条件设置值。可以像这样直观地完成：

In [201]: s2 = s.copy()
In [202]: s2[s2 < 0] = 0
In [203]: s2
Out[203]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64
In [204]: df2 = df.copy()
In [205]: df2[df2 < 0] = 0
In [206]: df2
Out[206]: 
 A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000

where 返回数据的修改副本。

注意

DataFrame.where() 的签名与 numpy.where() 不同。大致上 df1.where(m, df2) 等同于 np.where(m, df1, df2)。

In [207]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[207]: 
 A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

对齐

此外，where 对齐输入的布尔条件（ndarray 或 DataFrame），使得可以通过设置进行部分选择。这类似于通过 .loc 进行部分设置（但是针对的是内容而不是轴标签）。

In [208]: df2 = df.copy()
In [209]: df2[df2[1:4] > 0] = 3
In [210]: df2
Out[210]: 
 A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838

where 还可以接受 axis 和 level 参数，在执行 where 时对输入进行对齐。

In [211]: df2 = df.copy()
In [212]: df2.where(df2 > 0, df2['A'], axis='index')
Out[212]: 
 A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

这等效于（但比以下方法更快）。

In [213]: df2 = df.copy()
In [214]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[214]: 
 A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

where 可以接受一个可调用对象作为条件和 other 参数。该函数必须带有一个参数（调用的 Series 或 DataFrame），并返回作为条件和 other 参数的有效输出。

In [215]: df3 = pd.DataFrame({'A': [1, 2, 3],
 .....:                    'B': [4, 5, 6],
 .....:                    'C': [7, 8, 9]})
 .....: 
In [216]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[216]: 
 A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

掩码

mask() 是 where 的反布尔运算。

In [217]: s.mask(s >= 0)
Out[217]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64
In [218]: df.mask(df >= 0)
Out[218]: 
 A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

掩码

mask() 是 where 的反布尔运算。

In [217]: s.mask(s >= 0)
Out[217]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64
In [218]: df.mask(df >= 0)
Out[218]: 
 A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

使用 `numpy()` 条件性地进行扩展设置

一个替代 where() 的方法是使用 numpy.where()。结合设置新列，您可以使用它在条件确定的情况下扩展 DataFrame 的值。

假设你在以下 DataFrame 中有两个选择可供选择。当第二列为‘Z’时，你想将新列颜色设置为‘green’。你可以这样做：

In [219]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})
In [220]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')
In [221]: df
Out[221]: 
 col1 col2  color
0    A    Z  green
1    B    Z  green
2    B    X    red
3    C    Y    red

如果有多个条件，你可以使用numpy.select()来实现。假设对应三个条件有三种颜色选择，第四种颜色作为备用，你可以这样做。

In [222]: conditions = [
 .....:    (df['col2'] == 'Z') & (df['col1'] == 'A'),
 .....:    (df['col2'] == 'Z') & (df['col1'] == 'B'),
 .....:    (df['col1'] == 'B')
 .....: ]
 .....: 
In [223]: choices = ['yellow', 'blue', 'purple']
In [224]: df['color'] = np.select(conditions, choices, default='black')
In [225]: df
Out[225]: 
 col1 col2   color
0    A    Z  yellow
1    B    Z    blue
2    B    X  purple
3    C    Y   black

Pandas 2.2 中文官方教程和指南（十一·二）（3）https://developer.aliyun.com/article/1509871

Pandas 2.2 中文官方教程和指南（十一·二）（2）

结合位置和基于标签的索引

重新索引

重新索引

选择随机样本

设置时进行扩展

快速标量值的获取和设置

布尔索引

使用 isin 进行索引

`where()` 方法和掩码

掩码

掩码

使用 `numpy()` 条件性地进行扩展设置

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Pandas 2.2 中文官方教程和指南（十一·二）（2）

结合位置和基于标签的索引

重新索引

重新索引

选择随机样本

设置时进行扩展

快速标量值的获取和设置

布尔索引

使用 isin 进行索引

where() 方法和掩码

掩码

掩码

使用 numpy() 条件性地进行扩展设置

热门文章

最新文章

相关课程

相关电子书

`where()` 方法和掩码

使用 `numpy()` 条件性地进行扩展设置