Pandas 2.2 中文官方教程和指南(十二·二)(2)

简介: Pandas 2.2 中文官方教程和指南(十二·二)

Pandas 2.2 中文官方教程和指南(十二·二)(1)https://developer.aliyun.com/article/1509799

交叉部分

DataFramexs()方法另外接受一个级别参数,使得在MultiIndex的特定级别选择数据更容易。

In [70]: df
Out[70]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747
In [71]: df.xs("one", level="second")
Out[71]: 
 A         B         C
first 
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466 
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466 

你也可以通过提供轴参数在列上使用xs进行选择。

In [73]: df = df.T
In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

xs还允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681 
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64 

你可以传递drop_level=Falsexs以保留所选的级别。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

将上述与使用drop_level=True(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 

高级重新索引和对齐

在 pandas 对象的reindex()align()方法中使用参数level对值进行广播是很有用的。例如:

In [80]: midx = pd.MultiIndex(
 ....:    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
 ....: )
 ....: 
In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
In [82]: df
Out[82]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520
In [83]: df2 = df.groupby(level=0).mean()
In [84]: df2
Out[84]: 
 0         1
one   1.060074 -0.109716
zero  1.271532  0.713416
In [85]: df2.reindex(df.index, level=0)
Out[85]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416
# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)
In [87]: df_aligned
Out[87]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520
In [88]: df2_aligned
Out[88]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416 

使用swaplevel交换级别

swaplevel()方法可以交换两个级别的顺序:

In [89]: df[:5]
Out[89]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520
In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

使用reorder_levels重新排序级别

reorder_levels()方法推广了swaplevel方法,允许您一次性对分层索引级别进行排列:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

重命名IndexMultiIndex的名称

rename()方法用于重命名MultiIndex的标签,通常用于重命名DataFrame的列。renamecolumns参数允许指定一个字典,其中只包含您希望重命名的列。

In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
 col0      col1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520 

这种方法也可以用于重命名DataFrame主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
 0         1
two  z  1.519970 -0.493662
 x  0.600178  0.274230
zero z  0.132885 -0.023688
 x  2.410179  1.450520 

rename_axis()方法用于重命名IndexMultiIndex的名称。特别是,可以指定MultiIndex级别的名称,这在稍后使用reset_index()将值从MultiIndex移动到列时非常有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
 0         1
abc  def 
one  y    1.519970 -0.493662
 x    0.600178  0.274230
zero y    0.132885 -0.023688
 x    2.410179  1.450520

注意,DataFrame 的列是一个索引,因此使用 rename_axis 时使用 columns 参数会更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols') 

renamerename_axis都支持指定字典、Series或映射函数来将标签/名称映射到新值。

当直接使用Index对象而不是通过DataFrame时,可以使用Index.set_names()来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])
In [98]: mi2 = mi.rename("new name", level=0)
In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['new name', 'y']) 

不能通过级别设置 MultiIndex 的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError  Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"
File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
  1686 @name.setter
  1687 def name(self, value: Hashable) -> None:
  1688     if self._no_setting_name:
  1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
  1691             "Cannot set name on a level of a MultiIndex. Use "
  1692             "'MultiIndex.set_names' instead."
  1693         )
  1694     maybe_extract_name(value, None, type(self))
  1695     self._name = value
RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead. 

而是使用 Index.set_names()

MultiIndex进行排序

为了有效地对MultiIndex对象进行索引和切片,它们需要被排序。与任何索引一样,您可以使用sort_index()

In [101]: import random
In [102]: random.shuffle(tuples)
In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [104]: s
Out[104]: 
qux  two    0.206053
bar  one   -0.251905
foo  one   -2.213588
qux  one    1.063327
foo  two    1.266143
baz  two    0.299368
bar  two   -0.863838
baz  one    0.408204
dtype: float64
In [105]: s.sort_index()
Out[105]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64
In [106]: s.sort_index(level=0)
Out[106]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64
In [107]: s.sort_index(level=1)
Out[107]: 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

如果MultiIndex的级别已命名,还可以将级别名称传递给sort_index

In [108]: s.index = s.index.set_names(["L1", "L2"])
In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64
In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

在更高维度的对象上,如果它们具有MultiIndex,则可以按级别对任何其他轴进行排序:

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
 one      zero       one      zero
 x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688 

即使数据未排序,索引也会起作用,但效率会非常低(并显示PerformanceWarning)。它还将返回数据的副本而不是视图:

In [112]: dfm = pd.DataFrame(
 .....:    {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
 .....: )
 .....: 
In [113]: dfm = dfm.set_index(["jim", "joe"])
In [114]: dfm
Out[114]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   z    0.537020
 y    0.110968
In [115]: dfm.loc[(1, 'z')]
Out[115]: 
 jolie
jim joe 
1   z    0.53702 

此外,如果尝试索引未完全按字典排序的内容,可能会引发:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError  Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
  2852  """
  2853 For an ordered MultiIndex, compute the slice locations for input
  2854 labels.
 (...)
  2900 sequence of such.
  2901 """
  2902 # This function adds nothing to its parent implementation (the magic
  2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
  2846 if not isinstance(label, tuple):
  2847     label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
  2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
  2907     if len(tup) > self._lexsort_depth:
-> 2908         raise UnsortedIndexError(
  2909             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
  2910             f"({self._lexsort_depth})"
  2911         )
  2913     n = len(tup)
  2914     start, end = 0, len(self)
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' 

MultiIndex上的is_monotonic_increasing()方法显示索引是否已排序:

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False 
In [118]: dfm = dfm.sort_index()
In [119]: dfm
Out[119]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   y    0.110968
 z    0.537020
In [120]: dfm.index.is_monotonic_increasing
Out[120]: True 

现在选择按预期运行。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
 jolie
jim joe 
1   y    0.110968
 z    0.537020 

采取方法

与 NumPy ndarrays 类似,pandas 的IndexSeriesDataFrame还提供了take()方法,该方法在给定轴上以给定索引检索元素。给定的索引必须是整数索引位置的列表或 ndarray。take还将接受负整数作为相对于对象末尾的位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))
In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [124]: positions = [0, 9, 3]
In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')
In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')
In [127]: ser = pd.Series(np.random.randn(10))
In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64
In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64 

对于 DataFrame,给定的索引应该是指定行或列位置的一维列表或 ndarray。

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))
In [131]: frm.take([1, 4, 3])
Out[131]: 
 0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855
In [132]: frm.take([0, 2], axis=1)
Out[132]: 
 0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704 

重要的是要注意,pandas 对象上的take方法不适用于布尔索引,可能会返回意外的结果。

In [133]: arr = np.random.randn(10)
In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])
In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])
In [136]: ser = pd.Series(np.random.randn(10))
In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64
In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64 

最后,关于性能的一个小提示,因为take方法处理的输入范围较窄,所以它的性能可能比花式索引快得多。

In [139]: arr = np.random.randn(10000, 5)
In [140]: indexer = np.arange(10000)
In [141]: random.shuffle(indexer)
In [142]: %timeit arr[indexer]
 .....: %timeit arr.take(indexer, axis=0)
 .....: 
257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 
In [143]: ser = pd.Series(arr[:, 0])
In [144]: %timeit ser.iloc[indexer]
 .....: %timeit ser.take(indexer)
 .....: 
144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 


Pandas 2.2 中文官方教程和指南(十二·二)(3)https://developer.aliyun.com/article/1509801

相关文章
|
2天前
|
SQL 数据采集 数据挖掘
Pandas 教程
10月更文挑战第25天
9 2
|
3月前
|
存储 JSON 数据格式
Pandas 使用教程 CSV - CSV 转 JSON
Pandas 使用教程 CSV - CSV 转 JSON
33 0
|
3月前
|
JSON 数据格式 Python
Pandas 使用教程 JSON
Pandas 使用教程 JSON
39 0
|
3月前
|
SQL 数据采集 JSON
Pandas 使用教程 Series、DataFrame
Pandas 使用教程 Series、DataFrame
54 0
|
5月前
|
数据采集 存储 数据可视化
Pandas高级教程:数据清洗、转换与分析
Pandas是Python的数据分析库,提供Series和DataFrame数据结构及数据分析工具,便于数据清洗、转换和分析。本教程涵盖Pandas在数据清洗(如缺失值、重复值和异常值处理)、转换(数据类型转换和重塑)和分析(如描述性统计、分组聚合和可视化)的应用。通过学习Pandas,用户能更高效地处理和理解数据,为数据分析任务打下基础。
568 3
|
6月前
|
索引 Python
Pandas 2.2 中文官方教程和指南(一)(4)
Pandas 2.2 中文官方教程和指南(一)
55 0
|
6月前
|
存储 SQL JSON
Pandas 2.2 中文官方教程和指南(一)(3)
Pandas 2.2 中文官方教程和指南(一)
91 0
|
6月前
|
XML 关系型数据库 PostgreSQL
Pandas 2.2 中文官方教程和指南(一)(2)
Pandas 2.2 中文官方教程和指南(一)
162 0
|
6月前
|
XML 关系型数据库 MySQL
Pandas 2.2 中文官方教程和指南(一)(1)
Pandas 2.2 中文官方教程和指南(一)
162 0
|
6月前
|
C++ 索引 Python
Pandas 2.2 中文官方教程和指南(五)(4)
Pandas 2.2 中文官方教程和指南(五)
45 0