Pandas 2.2 中文官方教程和指南(十一·一)(2)

简介: Pandas 2.2 中文官方教程和指南(十一·一)

Pandas 2.2 中文官方教程和指南(十一·一)(1)https://developer.aliyun.com/article/1509860

按标签选择

警告

在设置操作中,返回的是副本还是引用,可能取决于上下文。有时这被称为链式赋值,应该避免。请参阅返回视图与副本。

警告

当您提供与索引类型不兼容(或可转换)的切片器时,.loc是严格的。例如,在DatetimeIndex中使用整数。这将引发TypeError

In [41]: dfl = pd.DataFrame(np.random.randn(5, 4),
 ....:                   columns=list('ABCD'),
 ....:                   index=pd.date_range('20130101', periods=5))
 ....: 
In [42]: dfl
Out[42]: 
 A         B         C         D
2013-01-01  1.075770 -0.109050  1.643563 -1.469388
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05  0.895717  0.805244 -1.206412  2.565646
In [43]: dfl.loc[2:3]
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[43], line 1
----> 1 dfl.loc[2:3]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:682, in DatetimeIndex.slice_indexer(self, start, end, step)
  674 # GH#33146 if start and end are combinations of str and None and Index is not
  675 # monotonic, we can not use Index.slice_indexer because it does not honor the
  676 # actual elements, is only searching for start and end
  677 if (
  678     check_str_or_none(start)
  679     or check_str_or_none(end)
  680     or self.is_monotonic_increasing
  681 ):
--> 682     return Index.slice_indexer(self, start, end, step)
  684 mask = np.array(True)
  685 in_index = True
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6794, in Index.get_slice_bound(self, label, side)
  6790 original_label = label
  6792 # For datetime indices label may be a string that has to be converted
  6793 # to datetime boundary according to its resolution.
-> 6794 label = self._maybe_cast_slice_bound(label, side)
  6796 # we need to look up the label
  6797 try:
File ~/work/pandas/pandas/pandas/core/indexes/datetimes.py:642, in DatetimeIndex._maybe_cast_slice_bound(self, label, side)
  637 if isinstance(label, dt.date) and not isinstance(label, dt.datetime):
  638     # Pandas supports slicing with dates, treated as datetimes at midnight.
  639     # https://github.com/pandas-dev/pandas/issues/31501
  640     label = Timestamp(label).to_pydatetime()
--> 642 label = super()._maybe_cast_slice_bound(label, side)
  643 self._data._assert_tzawareness_compat(label)
  644 return Timestamp(label)
File ~/work/pandas/pandas/pandas/core/indexes/datetimelike.py:378, in DatetimeIndexOpsMixin._maybe_cast_slice_bound(self, label, side)
  376     return lower if side == "left" else upper
  377 elif not isinstance(label, self._data._recognized_scalars):
--> 378     self._raise_invalid_indexer("slice", label)
  380 return label
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4301, in Index._raise_invalid_indexer(self, form, key, reraise)
  4299 if reraise is not lib.no_default:
  4300     raise TypeError(msg) from reraise
-> 4301 raise TypeError(msg)
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [2] of type int 

字符串切片中的可以转换为索引类型,并导致自然切片。

In [44]: dfl.loc['20130102':'20130104']
Out[44]: 
 A         B         C         D
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061 

pandas 提供了一套方法,以实现纯标签索引。这是一个严格的包含协议。每个要求的标签必须在索引中,否则将引发KeyError。在切片时,如果存在于索引中,则起始边界停止边界都包括。整数是有效的标签,但它们指的是标签而不是位置

.loc属性是主要的访问方法。以下是有效的输入:

  • 一个单个标签,例如5'a'(请注意,5被解释为索引的标签。这种用法不是索引上的整数位置。)。
  • 一个标签列表或数组['a', 'b', 'c']
  • 一个带有标签'a':'f'的切片对象(请注意,与通常的 Python 切片相反,当索引中存在时,起始和停止都包括在内!请参见使用标签切片)。
  • 一个布尔数组。
  • 一个callable,请参见通过 callable 选择。
In [45]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
In [46]: s1
Out[46]: 
a    1.431256
b    1.340309
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64
In [47]: s1.loc['c':]
Out[47]: 
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64
In [48]: s1.loc['b']
Out[48]: 1.3403088497993827 

请注意,设置也可以:

In [49]: s1.loc['c':] = 0
In [50]: s1
Out[50]: 
a    1.431256
b    1.340309
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64 

使用 DataFrame:

In [51]: df1 = pd.DataFrame(np.random.randn(6, 4),
 ....:                   index=list('abcdef'),
 ....:                   columns=list('ABCD'))
 ....: 
In [52]: df1
Out[52]: 
 A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
c  1.024180  0.569605  0.875906 -2.211372
d  0.974466 -2.006747 -0.410001 -0.078638
e  0.545952 -1.219217 -1.226825  0.769804
f -1.281247 -0.727707 -0.121306 -0.097883
In [53]: df1.loc[['a', 'b', 'd'], :]
Out[53]: 
 A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
d  0.974466 -2.006747 -0.410001 -0.078638 

通过标签切片访问:

In [54]: df1.loc['d':, 'A':'C']
Out[54]: 
 A         B         C
d  0.974466 -2.006747 -0.410001
e  0.545952 -1.219217 -1.226825
f -1.281247 -0.727707 -0.121306 

通过标签获取交叉部分(等同于df.xs('a')):

In [55]: df1.loc['a']
Out[55]: 
A    0.132003
B   -0.827317
C   -0.076467
D   -1.187678
Name: a, dtype: float64 

通过布尔数组获取值:

In [56]: df1.loc['a'] > 0
Out[56]: 
A     True
B    False
C    False
D    False
Name: a, dtype: bool
In [57]: df1.loc[:, df1.loc['a'] > 0]
Out[57]: 
 A
a  0.132003
b  1.130127
c  1.024180
d  0.974466
e  0.545952
f -1.281247 

布尔数组中的 NA 值传播为False

In [58]: mask = pd.array([True, False, True, False, pd.NA, False], dtype="boolean")
In [59]: mask
Out[59]: 
<BooleanArray>
[True, False, True, False, <NA>, False]
Length: 6, dtype: boolean
In [60]: df1[mask]
Out[60]: 
 A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
c  1.024180  0.569605  0.875906 -2.211372 

明确获取一个值:

# this is also equivalent to ``df1.at['a','A']``
In [61]: df1.loc['a', 'A']
Out[61]: 0.13200317033032932 

使用标签切片

使用切片与.loc一起使��时,如果起始和停止标签都存在于索引中,则返回两者之间(包括它们)的元素:

In [62]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
In [63]: s.loc[3:5]
Out[63]: 
3    b
2    c
5    d
dtype: object 

如果两者中至少有一个缺失,但索引已排序,并且可以与起始和停止标签进行比较,则切片仍将按预期工作,通过选择介于两者之间的标签:

In [64]: s.sort_index()
Out[64]: 
0    a
2    c
3    b
4    e
5    d
dtype: object
In [65]: s.sort_index().loc[1:6]
Out[65]: 
2    c
3    b
4    e
5    d
dtype: object 

但是,如果两者中至少有一个缺失索引未排序,则会引发错误(因为否则会在计算上昂贵,以及对于混合类型索引可能会产生歧义)。例如,在上面的示例中,s.loc[1:6]会引发KeyError

有关此行为背后的原理,请参见端点是包容的。

In [66]: s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])
In [67]: s.loc[3:5]
Out[67]: 
3    b
2    c
5    d
dtype: object 

此外,如果索引具有重复标签起始或停止标签重复,则会引发错误。例如,在上面的示例中,s.loc[2:5]会引发KeyError

有关重复标签的更多信息,请参见重复标签。## 通过位置选择

警告

对于设置操作返回的是副本还是引用,可能取决于上下文。有时这被称为chained assignment,应该避免。请参见返回视图与副本。

pandas 提供了一套方法,以便获得纯整数索引。语义紧随 Python 和 NumPy 的切片。这些是0-based索引。在切片时,起始边界是包含的,而上限是排除的。尝试使用非整数,即使是有效标签也会引发IndexError

.iloc属性是主要访问方法。以下是有效的输入:

  • 一个整数,例如5
  • 一个整数列表或数组[4, 3, 0]
  • 一个带有整数1:7的切片对象。
  • 一个布尔数组。
  • 一个callable,请参见通过 callable 选择。
  • 一个包含上述类型之一的行(和列)索引的元组。
In [68]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
In [69]: s1
Out[69]: 
0    0.695775
2    0.341734
4    0.959726
6   -1.110336
8   -0.619976
dtype: float64
In [70]: s1.iloc[:3]
Out[70]: 
0    0.695775
2    0.341734
4    0.959726
dtype: float64
In [71]: s1.iloc[3]
Out[71]: -1.110336102891167 

请注意,设置也可以:

In [72]: s1.iloc[:3] = 0
In [73]: s1
Out[73]: 
0    0.000000
2    0.000000
4    0.000000
6   -1.110336
8   -0.619976
dtype: float64 

使用 DataFrame:

In [74]: df1 = pd.DataFrame(np.random.randn(6, 4),
 ....:                   index=list(range(0, 12, 2)),
 ....:                   columns=list(range(0, 8, 2)))
 ....: 
In [75]: df1
Out[75]: 
 0         2         4         6
0   0.149748 -0.732339  0.687738  0.176444
2   0.403310 -0.154951  0.301624 -2.179861
4  -1.369849 -0.954208  1.462696 -1.743161
6  -0.826591 -0.345352  1.314232  0.690579
8   0.995761  2.396780  0.014871  3.357427
10 -0.317441 -1.236269  0.896171 -0.487602 

通过整数切片选择:

In [76]: df1.iloc[:3]
Out[76]: 
 0         2         4         6
0  0.149748 -0.732339  0.687738  0.176444
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161
In [77]: df1.iloc[1:5, 2:4]
Out[77]: 
 4         6
2  0.301624 -2.179861
4  1.462696 -1.743161
6  1.314232  0.690579
8  0.014871  3.357427 

通过整数列表选择:

In [78]: df1.iloc[[1, 3, 5], [1, 3]]
Out[78]: 
 2         6
2  -0.154951 -2.179861
6  -0.345352  0.690579
10 -1.236269 -0.487602 
In [79]: df1.iloc[1:3, :]
Out[79]: 
 0         2         4         6
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161 
In [80]: df1.iloc[:, 1:3]
Out[80]: 
 2         4
0  -0.732339  0.687738
2  -0.154951  0.301624
4  -0.954208  1.462696
6  -0.345352  1.314232
8   2.396780  0.014871
10 -1.236269  0.896171 
# this is also equivalent to ``df1.iat[1,1]``
In [81]: df1.iloc[1, 1]
Out[81]: -0.1549507744249032 

通过整数位置获取交叉部分(等同于df.xs(1)):

In [82]: df1.iloc[1]
Out[82]: 
0    0.403310
2   -0.154951
4    0.301624
6   -2.179861
Name: 2, dtype: float64 

超出范围的切片索引会像在 Python/NumPy 中一样得到很好的处理。

# these are allowed in Python/NumPy.
In [83]: x = list('abcdef')
In [84]: x
Out[84]: ['a', 'b', 'c', 'd', 'e', 'f']
In [85]: x[4:10]
Out[85]: ['e', 'f']
In [86]: x[8:10]
Out[86]: []
In [87]: s = pd.Series(x)
In [88]: s
Out[88]: 
0    a
1    b
2    c
3    d
4    e
5    f
dtype: object
In [89]: s.iloc[4:10]
Out[89]: 
4    e
5    f
dtype: object
In [90]: s.iloc[8:10]
Out[90]: Series([], dtype: object) 

注意,使用超出边界的切片可能导致一个空轴(例如返回一个空的 DataFrame)。

In [91]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [92]: dfl
Out[92]: 
 A         B
0 -0.082240 -2.182937
1  0.380396  0.084844
2  0.432390  1.519970
3 -0.493662  0.600178
4  0.274230  0.132885
In [93]: dfl.iloc[:, 2:3]
Out[93]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
In [94]: dfl.iloc[:, 1:3]
Out[94]: 
 B
0 -2.182937
1  0.084844
2  1.519970
3  0.600178
4  0.132885
In [95]: dfl.iloc[4:6]
Out[95]: 
 A         B
4  0.27423  0.132885 

一个超出边界的单个索引器将引发IndexError。任何元素超出边界的索引器列表将引发IndexError

In [96]: dfl.iloc[[4, 5, 6]]
---------------------------------------------------------------------------
IndexError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexing.py:1714, in _iLocIndexer._get_list_axis(self, key, axis)
  1713 try:
-> 1714     return self.obj._take_with_is_copy(key, axis=axis)
  1715 except IndexError as err:
  1716     # re-raise with different error message, e.g. test_getitem_ndarray_3d
File ~/work/pandas/pandas/pandas/core/generic.py:4153, in NDFrame._take_with_is_copy(self, indices, axis)
  4144  """
  4145 Internal version of the `take` method that sets the `_is_copy`
  4146 attribute to keep track of the parent dataframe (using in indexing
 (...)
  4151 See the docstring of `take` for full explanation of the parameters.
  4152 """
-> 4153 result = self.take(indices=indices, axis=axis)
  4154 # Maybe set copy if we didn't actually change the index.
File ~/work/pandas/pandas/pandas/core/generic.py:4133, in NDFrame.take(self, indices, axis, **kwargs)
  4129     indices = np.arange(
  4130         indices.start, indices.stop, indices.step, dtype=np.intp
  4131     )
-> 4133 new_data = self._mgr.take(
  4134     indices,
  4135     axis=self._get_block_manager_axis(axis),
  4136     verify=True,
  4137 )
  4138 return self._constructor_from_mgr(new_data, axes=new_data.axes).__finalize__(
  4139     self, method="take"
  4140 )
File ~/work/pandas/pandas/pandas/core/internals/managers.py:891, in BaseBlockManager.take(self, indexer, axis, verify)
  890 n = self.shape[axis]
--> 891 indexer = maybe_convert_indices(indexer, n, verify=verify)
  893 new_labels = self.axes[axis].take(indexer)
File ~/work/pandas/pandas/pandas/core/indexers/utils.py:282, in maybe_convert_indices(indices, n, verify)
  281     if mask.any():
--> 282         raise IndexError("indices are out-of-bounds")
  283 return indices
IndexError: indices are out-of-bounds
The above exception was the direct cause of the following exception:
IndexError  Traceback (most recent call last)
Cell In[96], line 1
----> 1 dfl.iloc[[4, 5, 6]]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1743, in _iLocIndexer._getitem_axis(self, key, axis)
  1741 # a list of integers
  1742 elif is_list_like_indexer(key):
-> 1743     return self._get_list_axis(key, axis=axis)
  1745 # a single integer
  1746 else:
  1747     key = item_from_zerodim(key)
File ~/work/pandas/pandas/pandas/core/indexing.py:1717, in _iLocIndexer._get_list_axis(self, key, axis)
  1714     return self.obj._take_with_is_copy(key, axis=axis)
  1715 except IndexError as err:
  1716     # re-raise with different error message, e.g. test_getitem_ndarray_3d
-> 1717     raise IndexError("positional indexers are out-of-bounds") from err
IndexError: positional indexers are out-of-bounds 
In [97]: dfl.iloc[:, 4]
---------------------------------------------------------------------------
IndexError  Traceback (most recent call last)
Cell In[97], line 1
----> 1 dfl.iloc[:, 4]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1690, in _iLocIndexer._getitem_tuple(self, tup)
  1689 def _getitem_tuple(self, tup: tuple):
-> 1690     tup = self._validate_tuple_indexer(tup)
  1691     with suppress(IndexingError):
  1692         return self._getitem_lowerdim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:966, in _LocationIndexer._validate_tuple_indexer(self, key)
  964 for i, k in enumerate(key):
  965     try:
--> 966         self._validate_key(k, i)
  967     except ValueError as err:
  968         raise ValueError(
  969             "Location based indexing can only have "
  970             f"[{self._valid_types}] types"
  971         ) from err
File ~/work/pandas/pandas/pandas/core/indexing.py:1592, in _iLocIndexer._validate_key(self, key, axis)
  1590     return
  1591 elif is_integer(key):
-> 1592     self._validate_integer(key, axis)
  1593 elif isinstance(key, tuple):
  1594     # a tuple should already have been caught by this point
  1595     # so don't treat a tuple as a valid indexer
  1596     raise IndexingError("Too many indexers")
File ~/work/pandas/pandas/pandas/core/indexing.py:1685, in _iLocIndexer._validate_integer(self, key, axis)
  1683 len_axis = len(self.obj._get_axis(axis))
  1684 if key >= len_axis or key < -len_axis:
-> 1685     raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds 
```## 通过可调用进行选择
`.loc`、`.iloc`,还有`[]`索引可以接受`callable`作为索引器。`callable`必须是一个带有一个参数(调用的 Series 或 DataFrame)的函数,返回用于索引的有效输出。
注意
对于`.iloc`索引,不支持从可调用函数返回元组,因为在应用可调用函数之前会发生行和列索引的元组解构。
```py
In [98]: df1 = pd.DataFrame(np.random.randn(6, 4),
 ....:                   index=list('abcdef'),
 ....:                   columns=list('ABCD'))
 ....: 
In [99]: df1
Out[99]: 
 A         B         C         D
a -0.023688  2.410179  1.450520  0.206053
b -0.251905 -2.213588  1.063327  1.266143
c  0.299368 -0.863838  0.408204 -1.048089
d -0.025747 -0.988387  0.094055  1.262731
e  1.289997  0.082423 -0.055758  0.536580
f -0.489682  0.369374 -0.034571 -2.484478
In [100]: df1.loc[lambda df: df['A'] > 0, :]
Out[100]: 
 A         B         C         D
c  0.299368 -0.863838  0.408204 -1.048089
e  1.289997  0.082423 -0.055758  0.536580
In [101]: df1.loc[:, lambda df: ['A', 'B']]
Out[101]: 
 A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374
In [102]: df1.iloc[:, lambda df: [0, 1]]
Out[102]: 
 A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374
In [103]: df1[lambda df: df.columns[0]]
Out[103]: 
a   -0.023688
b   -0.251905
c    0.299368
d   -0.025747
e    1.289997
f   -0.489682
Name: A, dtype: float64 

你可以在Series中使用可调用的索引。

In [104]: df1['A'].loc[lambda s: s > 0]
Out[104]: 
c    0.299368
e    1.289997
Name: A, dtype: float64 

使用这些方法/索引器,你可以在不使用临时变量的情况下链接数据选择操作。

In [105]: bb = pd.read_csv('data/baseball.csv', index_col='id')
In [106]: (bb.groupby(['year', 'team']).sum(numeric_only=True)
 .....:   .loc[lambda df: df['r'] > 100])
 .....: 
Out[106]: 
 stint    g    ab    r    h  X2b  ...     so   ibb   hbp    sh    sf  gidp
year team                                   ... 
2007 CIN       6  379   745  101  203   35  ...  127.0  14.0   1.0   1.0  15.0  18.0
 DET       5  301  1062  162  283   54  ...  176.0   3.0  10.0   4.0   8.0  28.0
 HOU       4  311   926  109  218   47  ...  212.0   3.0   9.0  16.0   6.0  17.0
 LAN      11  413  1021  153  293   61  ...  141.0   8.0   9.0   3.0   8.0  29.0
 NYN      13  622  1854  240  509  101  ...  310.0  24.0  23.0  18.0  15.0  48.0
 SFN       5  482  1305  198  337   67  ...  188.0  51.0   8.0  16.0   6.0  41.0
 TEX       2  198   729  115  200   40  ...  140.0   4.0   5.0   2.0   8.0  16.0
 TOR       4  459  1408  187  378   96  ...  265.0  16.0  12.0   4.0  16.0  38.0
[8 rows x 18 columns] 
```## 结合位置和基于标签的索引
如果你希望从‘A’列的索引中获取第 0 和第 2 个元素,可以这样做:
```py
In [107]: dfd = pd.DataFrame({'A': [1, 2, 3],
 .....:                    'B': [4, 5, 6]},
 .....:                   index=list('abc'))
 .....: 
In [108]: dfd
Out[108]: 
 A  B
a  1  4
b  2  5
c  3  6
In [109]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[109]: 
a    1
c    3
Name: A, dtype: int64 

这也可以使用.iloc来表示,通过在索引器上明确获取位置,并使用位置索引来选择内容。

In [110]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[110]: 
a    1
c    3
Name: A, dtype: int64 

要获取多个索引器,可以使用.get_indexer

In [111]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
Out[111]: 
 A  B
a  1  4
c  3  6 

重新索引

实现选择可能未找到元素的惯用方法是通过.reindex()。另请参阅关于重新索引的部分。

In [112]: s = pd.Series([1, 2, 3])
In [113]: s.reindex([1, 2, 3])
Out[113]: 
1    2.0
2    3.0
3    NaN
dtype: float64 

或者,如果你只想选择有效的键,下面是惯用且高效的方法;它保证保留所选内容的 dtype。

In [114]: labels = [1, 2, 3]
In [115]: s.loc[s.index.intersection(labels)]
Out[115]: 
1    2
2    3
dtype: int64 

有重复索引将为.reindex()引发错误:

In [116]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
In [117]: labels = ['c', 'd']
In [118]: s.reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[118], line 1
----> 1 s.reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels 

通常,你可以将所需的标签与当前轴相交,然后重新索引。

In [119]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[119]: 
c    3.0
d    NaN
dtype: float64 

但是,如果你的结果索引重复,这仍然会引发错误。

In [120]: labels = ['a', 'd']
In [121]: s.loc[s.index.intersection(labels)].reindex(labels)
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[121], line 1
----> 1 s.loc[s.index.intersection(labels)].reindex(labels)
File ~/work/pandas/pandas/pandas/core/series.py:5153, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
  5136 @doc(
  5137     NDFrame.reindex,  # type: ignore[has-type]
  5138     klass=_shared_doc_kwargs["klass"],
   (...)
  5151     tolerance=None,
  5152 ) -> Series:
-> 5153     return super().reindex(
  5154         index=index,
  5155         method=method,
  5156         copy=copy,
  5157         level=level,
  5158         fill_value=fill_value,
  5159         limit=limit,
  5160         tolerance=tolerance,
  5161     )
File ~/work/pandas/pandas/pandas/core/generic.py:5610, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
  5607     return self._reindex_multi(axes, copy, fill_value)
  5609 # perform the reindex on the axes
-> 5610 return self._reindex_axes(
  5611     axes, level, limit, tolerance, method, fill_value, copy
  5612 ).__finalize__(self, method="reindex")
File ~/work/pandas/pandas/pandas/core/generic.py:5633, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  5630     continue
  5632 ax = self._get_axis(a)
-> 5633 new_index, indexer = ax.reindex(
  5634     labels, level=level, limit=limit, tolerance=tolerance, method=method
  5635 )
  5637 axis = self._get_axis_number(a)
  5638 obj = obj._reindex_with_indexers(
  5639     {axis: [new_index, indexer]},
  5640     fill_value=fill_value,
  5641     copy=copy,
  5642     allow_dups=False,
  5643 )
File ~/work/pandas/pandas/pandas/core/indexes/base.py:4429, in Index.reindex(self, target, method, level, limit, tolerance)
  4426     raise ValueError("cannot handle a non-unique multi-index!")
  4427 elif not self.is_unique:
  4428     # GH#42568
-> 4429     raise ValueError("cannot reindex on an axis with duplicate labels")
  4430 else:
  4431     indexer, _ = self.get_indexer_non_unique(target)
ValueError: cannot reindex on an axis with duplicate labels 
```## 选择随机样本
从 Series 或 DataFrame 中随机选择行或列的方法是使用`sample()`方法。该方法默认会抽样行,并接受要返回的特定行/列数,或行的比例。
```py
In [122]: s = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1 row.
In [123]: s.sample()
Out[123]: 
4    4
dtype: int64
# One may specify either a number of rows:
In [124]: s.sample(n=3)
Out[124]: 
0    0
4    4
1    1
dtype: int64
# Or a fraction of the rows:
In [125]: s.sample(frac=0.5)
Out[125]: 
5    5
3    3
1    1
dtype: int64 

默认情况下,sample将最多返回每行一次,但也可以使用replace选项进行替换抽样:

In [126]: s = pd.Series([0, 1, 2, 3, 4, 5])
# Without replacement (default):
In [127]: s.sample(n=6, replace=False)
Out[127]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64
# With replacement:
In [128]: s.sample(n=6, replace=True)
Out[128]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64 

默认情况下,每行被选择的概率相等,但如果你希望行具有不同的概率,可以将sample函数的抽样权重作为weights传递。这些权重可以是列表、NumPy 数组或 Series,但它们必须与你要抽样的对象的长度相同。缺失值将被视为权重为零,不允许 inf 值。如果权重不总和为 1,它们将通过将所有权重除以权重总和来重新规范化。例如:

In [129]: s = pd.Series([0, 1, 2, 3, 4, 5])
In [130]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [131]: s.sample(n=3, weights=example_weights)
Out[131]: 
5    5
4    4
3    3
dtype: int64
# Weights will be re-normalized automatically
In [132]: example_weights2 = [0.5, 0, 0, 0, 0, 0]
In [133]: s.sample(n=1, weights=example_weights2)
Out[133]: 
0    0
dtype: int64 

当应用于 DataFrame 时,你可以通过简单地将列的名称作为字符串传递来使用 DataFrame 的列作为抽样权重(前提是你正在抽样行而不是列)。

In [134]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
 .....:                    'weight_column': [0.5, 0.4, 0.1, 0]})
 .....: 
In [135]: df2.sample(n=3, weights='weight_column')
Out[135]: 
 col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1 

sample还允许用户使用axis参数而不是行来对列进行抽样。

In [136]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
In [137]: df3.sample(n=1, axis=1)
Out[137]: 
 col1
0     1
1     2
2     3 

最后,您还可以使用random_state参数为sample的随机数生成器设置种子,该参数将接受整数(作为种子)或 NumPy RandomState 对象。

In [138]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
# With a given seed, the sample will always draw the same rows.
In [139]: df4.sample(n=2, random_state=2)
Out[139]: 
 col1  col2
2     3     4
1     2     3
In [140]: df4.sample(n=2, random_state=2)
Out[140]: 
 col1  col2
2     3     4
1     2     3 

Pandas 2.2 中文官方教程和指南(十一·一)(3)https://developer.aliyun.com/article/1509862

相关文章
|
8天前
|
SQL 数据采集 数据挖掘
Pandas 教程
10月更文挑战第25天
20 2
|
3月前
|
存储 JSON 数据格式
Pandas 使用教程 CSV - CSV 转 JSON
Pandas 使用教程 CSV - CSV 转 JSON
33 0
|
3月前
|
JSON 数据格式 Python
Pandas 使用教程 JSON
Pandas 使用教程 JSON
39 0
|
3月前
|
SQL 数据采集 JSON
Pandas 使用教程 Series、DataFrame
Pandas 使用教程 Series、DataFrame
55 0
|
5月前
|
数据采集 存储 数据可视化
Pandas高级教程:数据清洗、转换与分析
Pandas是Python的数据分析库,提供Series和DataFrame数据结构及数据分析工具,便于数据清洗、转换和分析。本教程涵盖Pandas在数据清洗(如缺失值、重复值和异常值处理)、转换(数据类型转换和重塑)和分析(如描述性统计、分组聚合和可视化)的应用。通过学习Pandas,用户能更高效地处理和理解数据,为数据分析任务打下基础。
585 3
|
6月前
|
存储 SQL 索引
Pandas 2.2 中文官方教程和指南(十一·二)(4)
Pandas 2.2 中文官方教程和指南(十一·二)
61 1
|
6月前
|
测试技术 索引 Python
Pandas 2.2 中文官方教程和指南(十一·二)(3)
Pandas 2.2 中文官方教程和指南(十一·二)
49 1
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(十一·二)(2)
Pandas 2.2 中文官方教程和指南(十一·二)
47 1
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(一)(4)
Pandas 2.2 中文官方教程和指南(一)
55 0
|
6月前
|
存储 SQL JSON
Pandas 2.2 中文官方教程和指南(一)(3)
Pandas 2.2 中文官方教程和指南(一)
92 0
下一篇
无影云桌面