Pandas 2.2 中文官方教程和指南(十一·二)(1)https://developer.aliyun.com/article/1509869
结合位置和基于标签的索引
如果您希望从‘A’列的索引中获取第 0 和第 2 个元素,可以这样做:
In [107]: dfd = pd.DataFrame({'A': [1, 2, 3], .....: 'B': [4, 5, 6]}, .....: index=list('abc')) .....: In [108]: dfd Out[108]: A B a 1 4 b 2 5 c 3 6 In [109]: dfd.loc[dfd.index[[0, 2]], 'A'] Out[109]: a 1 c 3 Name: A, dtype: int64
这也可以使用.iloc
来表示,通过明确在索引器上获取位置,并使用位置索引来选择内容。
In [110]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')] Out[110]: a 1 c 3 Name: A, dtype: int64
对于获取 多个 索引器,可以使用 .get_indexer
:
In [111]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])] Out[111]: A B a 1 4 c 3 6
重新索引
实现选择可能不存在的元素的惯用方法是通过 .reindex()
。另请参阅 重新索引 部分。
In [112]: s = pd.Series([1, 2, 3]) In [113]: s.reindex([1, 2, 3]) Out[113]: 1 2.0 2 3.0 3 NaN dtype: float64
或者,如果你只想选择 有效 的键,下面是惯用的且高效的方法;它保证保持选择的 dtype。
In [114]: labels = [1, 2, 3] In [115]: s.loc[s.index.intersection(labels)] Out[115]: 1 2 2 3 dtype: int64
当存在重复索引时,对 .reindex()
的调用将引发错误:
In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c']) In [117]: labels = ['c', 'd'] In [118]: s.reindex(labels) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[118], line 1 ----> 1 s.reindex(labels) File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance) 5136 @doc( 5137 NDFrame.reindex, # type: ignore[has-type] 5138 klass=_shared_doc_kwargs["klass"], (...) 5151 tolerance=None, 5152 ) -> Series: -> 5153 return super().reindex( 5154 index=index, 5155 method=method, 5156 copy=copy, 5157 level=level, 5158 fill_value=fill_value, 5159 limit=limit, 5160 tolerance=tolerance, 5161 ) File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance) 5607 return self._reindex_multi(axes, copy, fill_value) 5609 # perform the reindex on the axes -> 5610 return self._reindex_axes( 5611 axes, level, limit, tolerance, method, fill_value, copy 5612 ).__finalize__(self, method="reindex") File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 5630 continue 5632 ax = self._get_axis(a) -> 5633 new_index, indexer = ax.reindex( 5634 labels, level=level, limit=limit, tolerance=tolerance, method=method 5635 ) 5637 axis = self._get_axis_number(a) 5638 obj = obj._reindex_with_indexers( 5639 {axis: [new_index, indexer]}, 5640 fill_value=fill_value, 5641 copy=copy, 5642 allow_dups=False, 5643 ) File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance) 4426 raise ValueError("cannot handle a non-unique multi-index!") 4427 elif not self.is_unique: 4428 # GH#42568 -> 4429 raise ValueError("cannot reindex on an axis with duplicate labels") 4430 else: 4431 indexer, _ = self.get_indexer_non_unique(target) ValueError: cannot reindex on an axis with duplicate labels
通常情况下,你可以将所需的标签与当前轴相交,然后重新索引。
In [119]: s.loc[s.index.intersection(labels)].reindex(labels) Out[119]: c 3.0 d NaN dtype: float64
然而,如果你的结果索引重复,这仍然会引发错误。
In [120]: labels = ['a', 'd'] In [121]: s.loc[s.index.intersection(labels)].reindex(labels) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[121], line 1 ----> 1 s.loc[s.index.intersection(labels)].reindex(labels) File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance) 5136 @doc( 5137 NDFrame.reindex, # type: ignore[has-type] 5138 klass=_shared_doc_kwargs["klass"], (...) 5151 tolerance=None, 5152 ) -> Series: -> 5153 return super().reindex( 5154 index=index, 5155 method=method, 5156 copy=copy, 5157 level=level, 5158 fill_value=fill_value, 5159 limit=limit, 5160 tolerance=tolerance, 5161 ) File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance) 5607 return self._reindex_multi(axes, copy, fill_value) 5609 # perform the reindex on the axes -> 5610 return self._reindex_axes( 5611 axes, level, limit, tolerance, method, fill_value, copy 5612 ).__finalize__(self, method="reindex") File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 5630 continue 5632 ax = self._get_axis(a) -> 5633 new_index, indexer = ax.reindex( 5634 labels, level=level, limit=limit, tolerance=tolerance, method=method 5635 ) 5637 axis = self._get_axis_number(a) 5638 obj = obj._reindex_with_indexers( 5639 {axis: [new_index, indexer]}, 5640 fill_value=fill_value, 5641 copy=copy, 5642 allow_dups=False, 5643 ) File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance) 4426 raise ValueError("cannot handle a non-unique multi-index!") 4427 elif not self.is_unique: 4428 # GH#42568 -> 4429 raise ValueError("cannot reindex on an axis with duplicate labels") 4430 else: 4431 indexer, _ = self.get_indexer_non_unique(target) ValueError: cannot reindex on an axis with duplicate labels
重新索引
实现选择可能不存在的元素的惯用方法是通过 .reindex()
。另请参阅 重新索引 部分。
In [112]: s = pd.Series([1, 2, 3]) In [113]: s.reindex([1, 2, 3]) Out[113]: 1 2.0 2 3.0 3 NaN dtype: float64
或者,如果你只想选择 有效 的键,下面是惯用的且高效的方法;它保证保持选择的 dtype。
In [114]: labels = [1, 2, 3] In [115]: s.loc[s.index.intersection(labels)] Out[115]: 1 2 2 3 dtype: int64
当存在重复索引时,对 .reindex()
的调用将引发错误:
In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c']) In [117]: labels = ['c', 'd'] In [118]: s.reindex(labels) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[118], line 1 ----> 1 s.reindex(labels) File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance) 5136 @doc( 5137 NDFrame.reindex, # type: ignore[has-type] 5138 klass=_shared_doc_kwargs["klass"], (...) 5151 tolerance=None, 5152 ) -> Series: -> 5153 return super().reindex( 5154 index=index, 5155 method=method, 5156 copy=copy, 5157 level=level, 5158 fill_value=fill_value, 5159 limit=limit, 5160 tolerance=tolerance, 5161 ) File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance) 5607 return self._reindex_multi(axes, copy, fill_value) 5609 # perform the reindex on the axes -> 5610 return self._reindex_axes( 5611 axes, level, limit, tolerance, method, fill_value, copy 5612 ).__finalize__(self, method="reindex") File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 5630 continue 5632 ax = self._get_axis(a) -> 5633 new_index, indexer = ax.reindex( 5634 labels, level=level, limit=limit, tolerance=tolerance, method=method 5635 ) 5637 axis = self._get_axis_number(a) 5638 obj = obj._reindex_with_indexers( 5639 {axis: [new_index, indexer]}, 5640 fill_value=fill_value, 5641 copy=copy, 5642 allow_dups=False, 5643 ) File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance) 4426 raise ValueError("cannot handle a non-unique multi-index!") 4427 elif not self.is_unique: 4428 # GH#42568 -> 4429 raise ValueError("cannot reindex on an axis with duplicate labels") 4430 else: 4431 indexer, _ = self.get_indexer_non_unique(target) ValueError: cannot reindex on an axis with duplicate labels
通常情况下,你可以将所需的标签与当前轴相交,然后重新索引。
In [119]: s.loc[s.index.intersection(labels)].reindex(labels) Out[119]: c 3.0 d NaN dtype: float64
然而,如果你的结果索引重复,这仍然会引发错误。
In [120]: labels = ['a', 'd'] In [121]: s.loc[s.index.intersection(labels)].reindex(labels) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[121], line 1 ----> 1 s.loc[s.index.intersection(labels)].reindex(labels) File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance) 5136 @doc( 5137 NDFrame.reindex, # type: ignore[has-type] 5138 klass=_shared_doc_kwargs["klass"], (...) 5151 tolerance=None, 5152 ) -> Series: -> 5153 return super().reindex( 5154 index=index, 5155 method=method, 5156 copy=copy, 5157 level=level, 5158 fill_value=fill_value, 5159 limit=limit, 5160 tolerance=tolerance, 5161 ) File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance) 5607 return self._reindex_multi(axes, copy, fill_value) 5609 # perform the reindex on the axes -> 5610 return self._reindex_axes( 5611 axes, level, limit, tolerance, method, fill_value, copy 5612 ).__finalize__(self, method="reindex") File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 5630 continue 5632 ax = self._get_axis(a) -> 5633 new_index, indexer = ax.reindex( 5634 labels, level=level, limit=limit, tolerance=tolerance, method=method 5635 ) 5637 axis = self._get_axis_number(a) 5638 obj = obj._reindex_with_indexers( 5639 {axis: [new_index, indexer]}, 5640 fill_value=fill_value, 5641 copy=copy, 5642 allow_dups=False, 5643 ) File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance) 4426 raise ValueError("cannot handle a non-unique multi-index!") 4427 elif not self.is_unique: 4428 # GH#42568 -> 4429 raise ValueError("cannot reindex on an axis with duplicate labels") 4430 else: 4431 indexer, _ = self.get_indexer_non_unique(target) ValueError: cannot reindex on an axis with duplicate labels
选择随机样本
使用 sample()
方法从 Series 或 DataFrame 中随机选择行或列。该方法默认会抽样行,并接受要返回的特定行/列数或行/列的比例。
In [122]: s = pd.Series([0, 1, 2, 3, 4, 5]) # When no arguments are passed, returns 1 row. In [123]: s.sample() Out[123]: 4 4 dtype: int64 # One may specify either a number of rows: In [124]: s.sample(n=3) Out[124]: 0 0 4 4 1 1 dtype: int64 # Or a fraction of the rows: In [125]: s.sample(frac=0.5) Out[125]: 5 5 3 3 1 1 dtype: int64
默认情况下,sample
将至多一次返回每行,但也可以使用 replace
选项进行替换抽样:
In [126]: s = pd.Series([0, 1, 2, 3, 4, 5]) # Without replacement (default): In [127]: s.sample(n=6, replace=False) Out[127]: 0 0 1 1 5 5 3 3 2 2 4 4 dtype: int64 # With replacement: In [128]: s.sample(n=6, replace=True) Out[128]: 0 0 4 4 3 3 2 2 4 4 4 4 dtype: int64
默认情况下,每行被选中的概率相等,但如果你希望行具有不同的概率,可以将 sample
函数的抽样权重作为 weights
参数传递。这些权重可以是列表、NumPy 数组或 Series,但它们的长度必须与你正在抽样的对象相同。缺失值将被视为权重为零,不允许存在无穷大值。如果权重的总和不为 1,则通过将所有权重除以权重总和来重新归一化权重。例如:
In [129]: s = pd.Series([0, 1, 2, 3, 4, 5]) In [130]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4] In [131]: s.sample(n=3, weights=example_weights) Out[131]: 5 5 4 4 3 3 dtype: int64 # Weights will be re-normalized automatically In [132]: example_weights2 = [0.5, 0, 0, 0, 0, 0] In [133]: s.sample(n=1, weights=example_weights2) Out[133]: 0 0 dtype: int64
当应用于 DataFrame 时,你可以通过简单地将 DataFrame 的列名作为字符串传递来使用 DataFrame 的列作为抽样权重(前提是你正在抽样行而不是列)。
In [134]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6], .....: 'weight_column': [0.5, 0.4, 0.1, 0]}) .....: In [135]: df2.sample(n=3, weights='weight_column') Out[135]: col1 weight_column 1 8 0.4 0 9 0.5 2 7 0.1
sample
还允许用户使用 axis
参数对列而不是行进行抽样。
In [136]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]}) In [137]: df3.sample(n=1, axis=1) Out[137]: col1 0 1 1 2 2 3
最后,可以使用 random_state
参数为 sample
的随机数生成器设置种子,该参数将接受整数(作为种子)或 NumPy RandomState 对象。
In [138]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]}) # With a given seed, the sample will always draw the same rows. In [139]: df4.sample(n=2, random_state=2) Out[139]: col1 col2 2 3 4 1 2 3 In [140]: df4.sample(n=2, random_state=2) Out[140]: col1 col2 2 3 4 1 2 3
设置时进行扩展
.loc/[]
操作在为该轴设置不存在的键时可以进行扩展。
在 Series
情况下,这实际上是一个追加操作。
In [141]: se = pd.Series([1, 2, 3]) In [142]: se Out[142]: 0 1 1 2 2 3 dtype: int64 In [143]: se[5] = 5. In [144]: se Out[144]: 0 1.0 1 2.0 2 3.0 5 5.0 dtype: float64
通过 .loc
,可以在任一轴上扩展 DataFrame。
In [145]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2), .....: columns=['A', 'B']) .....: In [146]: dfi Out[146]: A B 0 0 1 1 2 3 2 4 5 In [147]: dfi.loc[:, 'C'] = dfi.loc[:, 'A'] In [148]: dfi Out[148]: A B C 0 0 1 0 1 2 3 2 2 4 5 4
这就像是对 DataFrame
进行了一次 append
操作。
In [149]: dfi.loc[3] = 5 In [150]: dfi Out[150]: A B C 0 0 1 0 1 2 3 2 2 4 5 4 3 5 5 5
快速标量值的获取和设置
由于使用 []
进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此为了弄清楚您要请求的内容,它需要一点额外的开销。如果您只想访问一个标量值,最快的方法是使用所有数据结构上都实现的 at
和 iat
方法。
与 loc
类似,at
提供基于标签的标量查找,而 iat
则类似于 iloc
提供整数基础的查找。
In [151]: s.iat[5] Out[151]: 5 In [152]: df.at[dates[5], 'A'] Out[152]: 0.1136484096888855 In [153]: df.iat[3, 0] Out[153]: -0.7067711336300845
您也可以使用相同的索引器进行设置。
In [154]: df.at[dates[5], 'E'] = 7 In [155]: df.iat[3, 0] = 7
如果索引器缺失,at
可能会像上面那样就地放大对象。
In [156]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7 In [157]: df Out[157]: A B C D E 0 2000-01-01 -0.282863 0.469112 -1.509059 -1.135632 NaN NaN 2000-01-02 -0.173215 1.212112 0.119209 -1.044236 NaN NaN 2000-01-03 -2.104569 -0.861849 -0.494929 1.071804 NaN NaN 2000-01-04 7.000000 0.721555 -1.039575 0.271860 NaN NaN 2000-01-05 0.567020 -0.424972 0.276232 -1.087401 NaN NaN 2000-01-06 0.113648 -0.673690 -1.478427 0.524988 7.0 NaN 2000-01-07 0.577046 0.404705 -1.715002 -1.039268 NaN NaN 2000-01-08 -1.157892 -0.370647 -1.344312 0.844885 NaN NaN 2000-01-09 NaN NaN NaN NaN NaN 7.0
布尔索引
另一个常见的操作是使用布尔向量来过滤数据。操作符有:|
代表 或
,&
代表 与
,~
代表 非
。这些必须使用括号进行分组,因为默认情况下 Python 将会评估如下表达式 df['A'] > 2 & df['B'] < 3
为 df['A'] > (2 & df['B']) < 3
,而所需的评估顺序应为 (df['A'] > 2) & (df['B'] < 3)
。
使用布尔向量来索引 Series 的工作方式与 NumPy ndarray 中完全相同:
In [158]: s = pd.Series(range(-3, 4)) In [159]: s Out[159]: 0 -3 1 -2 2 -1 3 0 4 1 5 2 6 3 dtype: int64 In [160]: s[s > 0] Out[160]: 4 1 5 2 6 3 dtype: int64 In [161]: s[(s < -1) | (s > 0.5)] Out[161]: 0 -3 1 -2 4 1 5 2 6 3 dtype: int64 In [162]: s[~(s < 0)] Out[162]: 3 0 4 1 5 2 6 3 dtype: int64
您可以使用与 DataFrame 索引相同长度的布尔向量从 DataFrame 中选择行(例如,从 DataFrame 的列之一派生的内容):
In [163]: df[df['A'] > 0] Out[163]: A B C D E 0 2000-01-04 7.000000 0.721555 -1.039575 0.271860 NaN NaN 2000-01-05 0.567020 -0.424972 0.276232 -1.087401 NaN NaN 2000-01-06 0.113648 -0.673690 -1.478427 0.524988 7.0 NaN 2000-01-07 0.577046 0.404705 -1.715002 -1.039268 NaN NaN
列表推导和 Series 的 map
方法也可以用于生成更复杂的条件:
In [164]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'], .....: 'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'], .....: 'c': np.random.randn(7)}) .....: # only want 'two' or 'three' In [165]: criterion = df2['a'].map(lambda x: x.startswith('t')) In [166]: df2[criterion] Out[166]: a b c 2 two y 0.041290 3 three x 0.361719 4 two y -0.238075 # equivalent but slower In [167]: df2[[x.startswith('t') for x in df2['a']]] Out[167]: a b c 2 two y 0.041290 3 three x 0.361719 4 two y -0.238075 # Multiple criteria In [168]: df2[criterion & (df2['b'] == 'x')] Out[168]: a b c 3 three x 0.361719
使用选择方法 按标签选择、按位置选择 和 高级索引,您可以使用布尔向量与其他索引表达式组合沿多个轴选择。
In [169]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c'] Out[169]: b c 3 x 0.361719
警告
iloc
支持两种布尔索引。如果索引器是布尔 Series
,则会引发错误。例如,在以下示例中,df.iloc[s.values, 1]
是可以的。布尔索引器是一个数组。但是 df.iloc[s, 1]
会引发 ValueError
。
In [170]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], .....: index=list('abc'), .....: columns=['A', 'B']) .....: In [171]: s = (df['A'] > 2) In [172]: s Out[172]: a False b True c True Name: A, dtype: bool In [173]: df.loc[s, 'B'] Out[173]: b 4 c 6 Name: B, dtype: int64 In [174]: df.iloc[s.values, 1] Out[174]: b 4 c 6 Name: B, dtype: int64
使用 isin 进行索引
考虑 Series
的 isin()
方法,它返回一个布尔向量,其中 Series
元素存在于传递的列表中的位置为真。这使您可以选择具有您想要的值的行:
In [175]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64') In [176]: s Out[176]: 4 0 3 1 2 2 1 3 0 4 dtype: int64 In [177]: s.isin([2, 4, 6]) Out[177]: 4 False 3 False 2 True 1 False 0 True dtype: bool In [178]: s[s.isin([2, 4, 6])] Out[178]: 2 2 0 4 dtype: int64
对于 Index
对象也有相同的方法,并且在您不知道所寻找的标签中实际存在哪些情况下很有用:
In [179]: s[s.index.isin([2, 4, 6])] Out[179]: 4 0 2 2 dtype: int64 # compare it to the following In [180]: s.reindex([2, 4, 6]) Out[180]: 2 2.0 4 0.0 6 NaN dtype: float64
除此之外,MultiIndex
允许选择一个单独的级别用于成员检查:
In [181]: s_mi = pd.Series(np.arange(6), .....: index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']])) .....: In [182]: s_mi Out[182]: 0 a 0 b 1 c 2 1 a 3 b 4 c 5 dtype: int64 In [183]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])] Out[183]: 0 c 2 1 a 3 dtype: int64 In [184]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)] Out[184]: 0 a 0 c 2 1 a 3 c 5 dtype: int64
DataFrame 还具有 isin()
方法。调用 isin
时,将一组值作为数组或字典传递。如果值是一个数组,则 isin
返回一个布尔值的 DataFrame,其形状与原始 DataFrame 相同,其中 True 表示元素在值序列中。
In [185]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'], .....: 'ids2': ['a', 'n', 'c', 'n']}) .....: In [186]: values = ['a', 'b', 1, 3] In [187]: df.isin(values) Out[187]: vals ids ids2 0 True True True 1 False True False 2 True False False 3 False False False
通常您会想要将某些值与某些列匹配。只需将值作为 dict
,其中键是列,值是您要检查的项目列表。
In [188]: values = {'ids': ['a', 'b'], 'vals': [1, 3]} In [189]: df.isin(values) Out[189]: vals ids ids2 0 True True False 1 False True False 2 True False False 3 False False False
要返回布尔值 DataFrame,其中值不在原始 DataFrame 中,请使用 ~
运算符:
In [190]: values = {'ids': ['a', 'b'], 'vals': [1, 3]} In [191]: ~df.isin(values) Out[191]: vals ids ids2 0 False False True 1 True False True 2 False True True 3 True True True
将 DataFrame 的 isin
与 any()
和 all()
方法结合起来,快速选择满足给定条件的数据子集。要选择每列满足自身条件的行:
In [192]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]} In [193]: row_mask = df.isin(values).all(1) In [194]: df[row_mask] Out[194]: vals ids ids2 0 1 a a
where()
方法和掩码
从 Series 中选择布尔向量通常会返回数据的子集。为了保证选择输出具有与原始数据相同的形状,您可以在 Series
和 DataFrame
中使用 where
方法。
仅返回选定的行:
In [195]: s[s > 0] Out[195]: 3 1 2 2 1 3 0 4 dtype: int64
要返回与原始数据相同形状的 Series:
In [196]: s.where(s > 0) Out[196]: 4 NaN 3 1.0 2 2.0 1 3.0 0 4.0 dtype: float64
从 DataFrame 中选择值时,现在也会保留输入数据形状。where
在内部用作实现。下面的代码等效于 df.where(df < 0)
。
In [197]: dates = pd.date_range('1/1/2000', periods=8) In [198]: df = pd.DataFrame(np.random.randn(8, 4), .....: index=dates, columns=['A', 'B', 'C', 'D']) .....: In [199]: df[df < 0] Out[199]: A B C D 2000-01-01 -2.104139 -1.309525 NaN NaN 2000-01-02 -0.352480 NaN -1.192319 NaN 2000-01-03 -0.864883 NaN -0.227870 NaN 2000-01-04 NaN -1.222082 NaN -1.233203 2000-01-05 NaN -0.605656 -1.169184 NaN 2000-01-06 NaN -0.948458 NaN -0.684718 2000-01-07 -2.670153 -0.114722 NaN -0.048048 2000-01-08 NaN NaN -0.048788 -0.808838
此外,where
还可以接受一个可选的 other
参数,用于在返回的副本中替换条件为 False 的值。
In [200]: df.where(df < 0, -df) Out[200]: A B C D 2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166 2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824 2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059 2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203 2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416 2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718 2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048 2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838
您可能希望基于某些布尔条件设置值。可以像这样直观地完成:
In [201]: s2 = s.copy() In [202]: s2[s2 < 0] = 0 In [203]: s2 Out[203]: 4 0 3 1 2 2 1 3 0 4 dtype: int64 In [204]: df2 = df.copy() In [205]: df2[df2 < 0] = 0 In [206]: df2 Out[206]: A B C D 2000-01-01 0.000000 0.000000 0.485855 0.245166 2000-01-02 0.000000 0.390389 0.000000 1.655824 2000-01-03 0.000000 0.299674 0.000000 0.281059 2000-01-04 0.846958 0.000000 0.600705 0.000000 2000-01-05 0.669692 0.000000 0.000000 0.342416 2000-01-06 0.868584 0.000000 2.297780 0.000000 2000-01-07 0.000000 0.000000 0.168904 0.000000 2000-01-08 0.801196 1.392071 0.000000 0.000000
where
返回数据的修改副本。
注意
DataFrame.where()
的签名与 numpy.where()
不同。大致上 df1.where(m, df2)
等同于 np.where(m, df1, df2)
。
In [207]: df.where(df < 0, -df) == np.where(df < 0, df, -df) Out[207]: A B C D 2000-01-01 True True True True 2000-01-02 True True True True 2000-01-03 True True True True 2000-01-04 True True True True 2000-01-05 True True True True 2000-01-06 True True True True 2000-01-07 True True True True 2000-01-08 True True True True
对齐
此外,where
对齐输入的布尔条件(ndarray 或 DataFrame),使得可以通过设置进行部分选择。这类似于通过 .loc
进行部分设置(但是针对的是内容而不是轴标签)。
In [208]: df2 = df.copy() In [209]: df2[df2[1:4] > 0] = 3 In [210]: df2 Out[210]: A B C D 2000-01-01 -2.104139 -1.309525 0.485855 0.245166 2000-01-02 -0.352480 3.000000 -1.192319 3.000000 2000-01-03 -0.864883 3.000000 -0.227870 3.000000 2000-01-04 3.000000 -1.222082 3.000000 -1.233203 2000-01-05 0.669692 -0.605656 -1.169184 0.342416 2000-01-06 0.868584 -0.948458 2.297780 -0.684718 2000-01-07 -2.670153 -0.114722 0.168904 -0.048048 2000-01-08 0.801196 1.392071 -0.048788 -0.808838
where
还可以接受 axis
和 level
参数,在执行 where
时对输入进行对齐。
In [211]: df2 = df.copy() In [212]: df2.where(df2 > 0, df2['A'], axis='index') Out[212]: A B C D 2000-01-01 -2.104139 -2.104139 0.485855 0.245166 2000-01-02 -0.352480 0.390389 -0.352480 1.655824 2000-01-03 -0.864883 0.299674 -0.864883 0.281059 2000-01-04 0.846958 0.846958 0.600705 0.846958 2000-01-05 0.669692 0.669692 0.669692 0.342416 2000-01-06 0.868584 0.868584 2.297780 0.868584 2000-01-07 -2.670153 -2.670153 0.168904 -2.670153 2000-01-08 0.801196 1.392071 0.801196 0.801196
这等效于(但比以下方法更快)。
In [213]: df2 = df.copy() In [214]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A']) Out[214]: A B C D 2000-01-01 -2.104139 -2.104139 0.485855 0.245166 2000-01-02 -0.352480 0.390389 -0.352480 1.655824 2000-01-03 -0.864883 0.299674 -0.864883 0.281059 2000-01-04 0.846958 0.846958 0.600705 0.846958 2000-01-05 0.669692 0.669692 0.669692 0.342416 2000-01-06 0.868584 0.868584 2.297780 0.868584 2000-01-07 -2.670153 -2.670153 0.168904 -2.670153 2000-01-08 0.801196 1.392071 0.801196 0.801196
where
可以接受一个可调用对象作为条件和 other
参数。该函数必须带有一个参数(调用的 Series 或 DataFrame),并返回作为条件和 other
参数的有效输出。
In [215]: df3 = pd.DataFrame({'A': [1, 2, 3], .....: 'B': [4, 5, 6], .....: 'C': [7, 8, 9]}) .....: In [216]: df3.where(lambda x: x > 4, lambda x: x + 10) Out[216]: A B C 0 11 14 7 1 12 5 8 2 13 6 9
掩码
mask()
是 where
的反布尔运算。
In [217]: s.mask(s >= 0) Out[217]: 4 NaN 3 NaN 2 NaN 1 NaN 0 NaN dtype: float64 In [218]: df.mask(df >= 0) Out[218]: A B C D 2000-01-01 -2.104139 -1.309525 NaN NaN 2000-01-02 -0.352480 NaN -1.192319 NaN 2000-01-03 -0.864883 NaN -0.227870 NaN 2000-01-04 NaN -1.222082 NaN -1.233203 2000-01-05 NaN -0.605656 -1.169184 NaN 2000-01-06 NaN -0.948458 NaN -0.684718 2000-01-07 -2.670153 -0.114722 NaN -0.048048 2000-01-08 NaN NaN -0.048788 -0.808838
掩码
mask()
是 where
的反布尔运算。
In [217]: s.mask(s >= 0) Out[217]: 4 NaN 3 NaN 2 NaN 1 NaN 0 NaN dtype: float64 In [218]: df.mask(df >= 0) Out[218]: A B C D 2000-01-01 -2.104139 -1.309525 NaN NaN 2000-01-02 -0.352480 NaN -1.192319 NaN 2000-01-03 -0.864883 NaN -0.227870 NaN 2000-01-04 NaN -1.222082 NaN -1.233203 2000-01-05 NaN -0.605656 -1.169184 NaN 2000-01-06 NaN -0.948458 NaN -0.684718 2000-01-07 -2.670153 -0.114722 NaN -0.048048 2000-01-08 NaN NaN -0.048788 -0.808838
使用 numpy()
条件性地进行扩展设置
一个替代 where()
的方法是使用 numpy.where()
。结合设置新列,您可以使用它在条件确定的情况下扩展 DataFrame 的值。
假设你在以下 DataFrame 中有两个选择可供选择。当第二列为‘Z’时,你想将新列颜色设置为‘green’。你可以这样做:
In [219]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')}) In [220]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red') In [221]: df Out[221]: col1 col2 color 0 A Z green 1 B Z green 2 B X red 3 C Y red
如果有多个条件,你可以使用numpy.select()
来实现。假设对应三个条件有三种颜色选择,第四种颜色作为备用,你可以这样做。
In [222]: conditions = [ .....: (df['col2'] == 'Z') & (df['col1'] == 'A'), .....: (df['col2'] == 'Z') & (df['col1'] == 'B'), .....: (df['col1'] == 'B') .....: ] .....: In [223]: choices = ['yellow', 'blue', 'purple'] In [224]: df['color'] = np.select(conditions, choices, default='black') In [225]: df Out[225]: col1 col2 color 0 A Z yellow 1 B Z blue 2 B X purple 3 C Y black
Pandas 2.2 中文官方教程和指南(十一·二)(3)https://developer.aliyun.com/article/1509871