Pandas 2.2 中文官方教程和指南(十二·二)(1)https://developer.aliyun.com/article/1509799
交叉部分
DataFrame
的xs()
方法另外接受一个级别参数,使得在MultiIndex
的特定级别选择数据更容易。
In [70]: df Out[70]: A B C first second bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747 In [71]: df.xs("one", level="second") Out[71]: A B C first bar 0.895717 0.410835 -1.413681 baz -1.206412 0.132003 1.024180 foo 1.431256 -0.076467 0.875906 qux -1.170299 1.130127 0.974466
# using the slicers In [72]: df.loc[(slice(None), "one"), :] Out[72]: A B C first second bar one 0.895717 0.410835 -1.413681 baz one -1.206412 0.132003 1.024180 foo one 1.431256 -0.076467 0.875906 qux one -1.170299 1.130127 0.974466
你也可以通过提供轴参数在列上使用xs
进行选择。
In [73]: df = df.T In [74]: df.xs("one", level="second", axis=1) Out[74]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466
# using the slicers In [75]: df.loc[:, (slice(None), "one")] Out[75]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466
xs
还允许使用多个键进行选择。
In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1) Out[76]: first bar second one A 0.895717 B 0.410835 C -1.413681
# using the slicers In [77]: df.loc[:, ("bar", "one")] Out[77]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64
你可以传递drop_level=False
给xs
以保留所选的级别。
In [78]: df.xs("one", level="second", axis=1, drop_level=False) Out[78]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466
将上述与使用drop_level=True
(默认值)的结果进行比较。
In [79]: df.xs("one", level="second", axis=1, drop_level=True) Out[79]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466
高级重新索引和对齐
在 pandas 对象的reindex()
和align()
方法中使用参数level
对值进行广播是很有用的。例如:
In [80]: midx = pd.MultiIndex( ....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]] ....: ) ....: In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx) In [82]: df Out[82]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [83]: df2 = df.groupby(level=0).mean() In [84]: df2 Out[84]: 0 1 one 1.060074 -0.109716 zero 1.271532 0.713416 In [85]: df2.reindex(df.index, level=0) Out[85]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416 # aligning In [86]: df_aligned, df2_aligned = df.align(df2, level=0) In [87]: df_aligned Out[87]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [88]: df2_aligned Out[88]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416
使用swaplevel
交换级别
swaplevel()
方法可以交换两个级别的顺序:
In [89]: df[:5] Out[89]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [90]: df[:5].swaplevel(0, 1, axis=0) Out[90]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520
使用reorder_levels
重新排序级别
reorder_levels()
方法推广了swaplevel
方法,允许您一次性对分层索引级别进行排列:
In [91]: df[:5].reorder_levels([1, 0], axis=0) Out[91]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520
重命名Index
或MultiIndex
的名称
rename()
方法用于重命名MultiIndex
的标签,通常用于重命名DataFrame
的列。rename
的columns
参数允许指定一个字典,其中只包含您希望重命名的列。
In [92]: df.rename(columns={0: "col0", 1: "col1"}) Out[92]: col0 col1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520
这种方法也可以用于重命名DataFrame
主索引的特定标签。
In [93]: df.rename(index={"one": "two", "y": "z"}) Out[93]: 0 1 two z 1.519970 -0.493662 x 0.600178 0.274230 zero z 0.132885 -0.023688 x 2.410179 1.450520
rename_axis()
方法用于重命名Index
或MultiIndex
的名称。特别是,可以指定MultiIndex
级别的名称,这在稍后使用reset_index()
将值从MultiIndex
移动到列时非常有用。
In [94]: df.rename_axis(index=["abc", "def"]) Out[94]: 0 1 abc def one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520
注意,DataFrame
的列是一个索引,因此使用 rename_axis
时使用 columns
参数会更改该索引的名称。
In [95]: df.rename_axis(columns="Cols").columns Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')
rename
和rename_axis
都支持指定字典、Series
或映射函数来将标签/名称映射到新值。
当直接使用Index
对象而不是通过DataFrame
时,可以使用Index.set_names()
来更改名称。
In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"]) In [97]: mi.names Out[97]: FrozenList(['x', 'y']) In [98]: mi2 = mi.rename("new name", level=0) In [99]: mi2 Out[99]: MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['new name', 'y'])
不能通过级别设置 MultiIndex 的名称。
In [100]: mi.levels[0].name = "name via level" --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[100], line 1 ----> 1 mi.levels[0].name = "name via level" File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value) 1686 @name.setter 1687 def name(self, value: Hashable) -> None: 1688 if self._no_setting_name: 1689 # Used in MultiIndex.levels to avoid silently ignoring name updates. -> 1690 raise RuntimeError( 1691 "Cannot set name on a level of a MultiIndex. Use " 1692 "'MultiIndex.set_names' instead." 1693 ) 1694 maybe_extract_name(value, None, type(self)) 1695 self._name = value RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
而是使用 Index.set_names()
。
对MultiIndex
进行排序
为了有效地对MultiIndex
对象进行索引和切片,它们需要被排序。与任何索引一样,您可以使用sort_index()
。
In [101]: import random In [102]: random.shuffle(tuples) In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples)) In [104]: s Out[104]: qux two 0.206053 bar one -0.251905 foo one -2.213588 qux one 1.063327 foo two 1.266143 baz two 0.299368 bar two -0.863838 baz one 0.408204 dtype: float64 In [105]: s.sort_index() Out[105]: bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [106]: s.sort_index(level=0) Out[106]: bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [107]: s.sort_index(level=1) Out[107]: bar one -0.251905 baz one 0.408204 foo one -2.213588 qux one 1.063327 bar two -0.863838 baz two 0.299368 foo two 1.266143 qux two 0.206053 dtype: float64
如果MultiIndex
的级别已命名,还可以将级别名称传递给sort_index
。
In [108]: s.index = s.index.set_names(["L1", "L2"]) In [109]: s.sort_index(level="L1") Out[109]: L1 L2 bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [110]: s.sort_index(level="L2") Out[110]: L1 L2 bar one -0.251905 baz one 0.408204 foo one -2.213588 qux one 1.063327 bar two -0.863838 baz two 0.299368 foo two 1.266143 qux two 0.206053 dtype: float64
在更高维度的对象上,如果它们具有MultiIndex
,则可以按级别对任何其他轴进行排序:
In [111]: df.T.sort_index(level=1, axis=1) Out[111]: one zero one zero x x y y 0 0.600178 2.410179 1.519970 0.132885 1 0.274230 1.450520 -0.493662 -0.023688
即使数据未排序,索引也会起作用,但效率会非常低(并显示PerformanceWarning
)。它还将返回数据的副本而不是视图:
In [112]: dfm = pd.DataFrame( .....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)} .....: ) .....: In [113]: dfm = dfm.set_index(["jim", "joe"]) In [114]: dfm Out[114]: jolie jim joe 0 x 0.490671 x 0.120248 1 z 0.537020 y 0.110968 In [115]: dfm.loc[(1, 'z')] Out[115]: jolie jim joe 1 z 0.53702
此外,如果尝试索引未完全按字典排序的内容,可能会引发:
In [116]: dfm.loc[(0, 'y'):(1, 'z')] --------------------------------------------------------------------------- UnsortedIndexError Traceback (most recent call last) Cell In[116], line 1 ----> 1 dfm.loc[(0, 'y'):(1, 'z')] File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice): File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step) 2852 """ 2853 For an ordered MultiIndex, compute the slice locations for input 2854 labels. (...) 2900 sequence of such. 2901 """ 2902 # This function adds nothing to its parent implementation (the magic 2903 # happens in get_slice_bound method), but it adds meaningful doc. -> 2904 return super().slice_locs(start, end, step) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step) 6877 start_slice = None 6878 if start is not None: -> 6879 start_slice = self.get_slice_bound(start, "left") 6880 if start_slice is None: 6881 start_slice = 0 File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side) 2846 if not isinstance(label, tuple): 2847 label = (label,) -> 2848 return self._partial_tup_index(label, side=side) File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side) 2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"): 2907 if len(tup) > self._lexsort_depth: -> 2908 raise UnsortedIndexError( 2909 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth " 2910 f"({self._lexsort_depth})" 2911 ) 2913 n = len(tup) 2914 start, end = 0, len(self) UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
MultiIndex
上的is_monotonic_increasing()
方法显示索引是否已排序:
In [117]: dfm.index.is_monotonic_increasing Out[117]: False
In [118]: dfm = dfm.sort_index() In [119]: dfm Out[119]: jolie jim joe 0 x 0.490671 x 0.120248 1 y 0.110968 z 0.537020 In [120]: dfm.index.is_monotonic_increasing Out[120]: True
现在选择按预期运行。
In [121]: dfm.loc[(0, "y"):(1, "z")] Out[121]: jolie jim joe 1 y 0.110968 z 0.537020
采取方法
与 NumPy ndarrays 类似,pandas 的Index
、Series
和DataFrame
还提供了take()
方法,该方法在给定轴上以给定索引检索元素。给定的索引必须是整数索引位置的列表或 ndarray。take
还将接受负整数作为相对于对象末尾的位置。
In [122]: index = pd.Index(np.random.randint(0, 1000, 10)) In [123]: index Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64') In [124]: positions = [0, 9, 3] In [125]: index[positions] Out[125]: Index([214, 329, 567], dtype='int64') In [126]: index.take(positions) Out[126]: Index([214, 329, 567], dtype='int64') In [127]: ser = pd.Series(np.random.randn(10)) In [128]: ser.iloc[positions] Out[128]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64 In [129]: ser.take(positions) Out[129]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64
对于 DataFrame,给定的索引应该是指定行或列位置的一维列表或 ndarray。
In [130]: frm = pd.DataFrame(np.random.randn(5, 3)) In [131]: frm.take([1, 4, 3]) Out[131]: 0 1 2 1 -1.237881 0.106854 -1.276829 4 0.629675 -1.425966 1.857704 3 0.979542 -1.633678 0.615855 In [132]: frm.take([0, 2], axis=1) Out[132]: 0 2 0 0.595974 0.601544 1 -1.237881 -1.276829 2 -0.767101 1.499591 3 0.979542 0.615855 4 0.629675 1.857704
重要的是要注意,pandas 对象上的take
方法不适用于布尔索引,可能会返回意外的结果。
In [133]: arr = np.random.randn(10) In [134]: arr.take([False, False, True, True]) Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775]) In [135]: arr[[0, 1]] Out[135]: array([-1.1935, 0.6775]) In [136]: ser = pd.Series(np.random.randn(10)) In [137]: ser.take([False, False, True, True]) Out[137]: 0 0.233141 0 0.233141 1 -0.223540 1 -0.223540 dtype: float64 In [138]: ser.iloc[[0, 1]] Out[138]: 0 0.233141 1 -0.223540 dtype: float64
最后,关于性能的一个小提示,因为take
方法处理的输入范围较窄,所以它的性能可能比花式索引快得多。
In [139]: arr = np.random.randn(10000, 5) In [140]: indexer = np.arange(10000) In [141]: random.shuffle(indexer) In [142]: %timeit arr[indexer] .....: %timeit arr.take(indexer, axis=0) .....: 257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each) 79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0]) In [144]: %timeit ser.iloc[indexer] .....: %timeit ser.take(indexer) .....: 144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
Pandas 2.2 中文官方教程和指南(十二·二)(3)https://developer.aliyun.com/article/1509801