重塑和透视表
pandas 提供了用于操作Series
和DataFrame
的方法,以改变数据的表示形式,以便进行进一步的数据处理或数据汇总。
pivot()
和pivot_table()
:在一个或多个离散类别中对唯一值进行分组。stack()
和unstack()
:分别将列或行级别的数据透视到相反的轴上。melt()
和wide_to_long()
:将宽格式的DataFrame
转换为长格式。get_dummies()
和from_dummies()
:使用指示变量进行转换。explode()
:将类似列表的值的列转换为单独的行。crosstab()
:计算多个一维因子数组的交叉制表。cut()
:将连续变量转换为离散的分类值。factorize()
:将一维变量编码为整数标签。
pivot()
和 pivot_table()
pivot()
数据通常以所谓的“堆叠”或“记录”格式存储。在“记录”或“宽”格式中,通常每个主题都有一行。在“堆叠”或“长”格式中,每个主题(适用时)可能有多行。
In [1]: data = { ...: "value": range(12), ...: "variable": ["A"] * 3 + ["B"] * 3 + ["C"] * 3 + ["D"] * 3, ...: "date": pd.to_datetime(["2020-01-03", "2020-01-04", "2020-01-05"] * 4) ...: } ...: In [2]: df = pd.DataFrame(data)
要对每个唯一变量执行时间序列操作,更好的表示形式是columns
是唯一变量,index
是日期标识个别观察。为了将数据重塑为这种形式,我们使用DataFrame.pivot()
方法(也实现为顶级函数pivot()
):
In [3]: pivoted = df.pivot(index="date", columns="variable", values="value") In [4]: pivoted Out[4]: variable A B C D date 2020-01-03 0 3 6 9 2020-01-04 1 4 7 10 2020-01-05 2 5 8 11
如果省略了values
参数,并且输入的DataFrame
具有多个未用作列或索引输入的值列,则生成的“透视”DataFrame
将具有分层列,其最顶层指示相应的值列:
In [5]: df["value2"] = df["value"] * 2 In [6]: pivoted = df.pivot(index="date", columns="variable") In [7]: pivoted Out[7]: value value2 variable A B C D A B C D date 2020-01-03 0 3 6 9 0 6 12 18 2020-01-04 1 4 7 10 2 8 14 20 2020-01-05 2 5 8 11 4 10 16 22
你可以从透视后的DataFrame
中选择子集:
In [8]: pivoted["value2"] Out[8]: variable A B C D date 2020-01-03 0 6 12 18 2020-01-04 2 8 14 20 2020-01-05 4 10 16 22
请注意,这将返回基础数据的视图,如果数据是同质类型的。
注意
pivot()
只能处理由index
和columns
指定的唯一行。如果您的数据包含重复项,请使用pivot_table()
。
pivot_table()
虽然pivot()
提供了各种数据类型的通用透视功能,但 pandas 还提供了用于对数值数据进行聚合的pivot_table()
或pivot_table()
。
函数pivot_table()
可用于创建类似电子表格的透视表。查看食谱以获取一些高级策略。
In [9]: import datetime In [10]: df = pd.DataFrame( ....: { ....: "A": ["one", "one", "two", "three"] * 6, ....: "B": ["A", "B", "C"] * 8, ....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4, ....: "D": np.random.randn(24), ....: "E": np.random.randn(24), ....: "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)] ....: + [datetime.datetime(2013, i, 15) for i in range(1, 13)], ....: } ....: ) ....: In [11]: df Out[11]: A B C D E F 0 one A foo 0.469112 0.404705 2013-01-01 1 one B foo -0.282863 0.577046 2013-02-01 2 two C foo -1.509059 -1.715002 2013-03-01 3 three A bar -1.135632 -1.039268 2013-04-01 4 one B bar 1.212112 -0.370647 2013-05-01 .. ... .. ... ... ... ... 19 three B foo -1.087401 -0.472035 2013-08-15 20 one C foo -0.673690 -0.013960 2013-09-15 21 one A bar 0.113648 -0.362543 2013-10-15 22 two B bar -1.478427 -0.006154 2013-11-15 23 three C bar 0.524988 -0.923061 2013-12-15 [24 rows x 6 columns] In [12]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"]) Out[12]: C bar foo A B one A -0.995460 0.595334 B 0.393570 -0.494817 C 0.196903 -0.767769 three A -0.431886 NaN B NaN -1.065818 C 0.798396 NaN two A NaN 0.197720 B -0.986678 NaN C NaN -1.274317 In [13]: pd.pivot_table( ....: df, values=["D", "E"], ....: index=["B"], ....: columns=["A", "C"], ....: aggfunc="sum", ....: ) ....: Out[13]: D ... E A one three ... three two C bar foo bar ... foo bar foo B ... A -1.990921 1.190667 -0.863772 ... NaN NaN -1.067650 B 0.787140 -0.989634 NaN ... 0.372851 1.63741 NaN C 0.393806 -1.535539 1.596791 ... NaN NaN -3.491906 [3 rows x 12 columns] In [14]: pd.pivot_table( ....: df, values="E", ....: index=["B", "C"], ....: columns=["A"], ....: aggfunc=["sum", "mean"], ....: ) ....: Out[14]: sum mean A one three two one three two B C A bar -0.471593 -2.008182 NaN -0.235796 -1.004091 NaN foo 0.761726 NaN -1.067650 0.380863 NaN -0.533825 B bar -1.665170 NaN 1.637410 -0.832585 NaN 0.818705 foo -0.097554 0.372851 NaN -0.048777 0.186425 NaN C bar -0.744154 -2.392449 NaN -0.372077 -1.196224 NaN foo 1.061810 NaN -3.491906 0.530905 NaN -1.745953
结果是一个可能在索引或列上具有MultiIndex
的DataFrame
。如果未提供values
列名,则透视表将在列中包含所有数据的额外层次结构:
In [15]: pd.pivot_table(df[["A", "B", "C", "D", "E"]], index=["A", "B"], columns=["C"]) Out[15]: D E C bar foo bar foo A B one A -0.995460 0.595334 -0.235796 0.380863 B 0.393570 -0.494817 -0.832585 -0.048777 C 0.196903 -0.767769 -0.372077 0.530905 three A -0.431886 NaN -1.004091 NaN B NaN -1.065818 NaN 0.186425 C 0.798396 NaN -1.196224 NaN two A NaN 0.197720 NaN -0.533825 B -0.986678 NaN 0.818705 NaN C NaN -1.274317 NaN -1.745953
同样,您可以在index
和columns
关键字中使用Grouper
。有关Grouper
的详细信息,请参见使用 Grouper 规范进行分组。
In [16]: pd.pivot_table(df, values="D", index=pd.Grouper(freq="ME", key="F"), columns="C") Out[16]: C bar foo F 2013-01-31 NaN 0.595334 2013-02-28 NaN -0.494817 2013-03-31 NaN -1.274317 2013-04-30 -0.431886 NaN 2013-05-31 0.393570 NaN 2013-06-30 0.196903 NaN 2013-07-31 NaN 0.197720 2013-08-31 NaN -1.065818 2013-09-30 NaN -0.767769 2013-10-31 -0.995460 NaN 2013-11-30 -0.986678 NaN 2013-12-31 0.798396 NaN
添加边距
将margins=True
传递给pivot_table()
将在行和列上添加一个带有All
标签的行和列,其中包含跨行和列���别的部分组聚合:
In [17]: table = df.pivot_table( ....: index=["A", "B"], ....: columns="C", ....: values=["D", "E"], ....: margins=True, ....: aggfunc="std" ....: ) ....: In [18]: table Out[18]: D E C bar foo All bar foo All A B one A 1.568517 0.178504 1.293926 0.179247 0.033718 0.371275 B 1.157593 0.299748 0.860059 0.653280 0.885047 0.779837 C 0.523425 0.133049 0.638297 1.111310 0.770555 0.938819 three A 0.995247 NaN 0.995247 0.049748 NaN 0.049748 B NaN 0.030522 0.030522 NaN 0.931203 0.931203 C 0.386657 NaN 0.386657 0.386312 NaN 0.386312 two A NaN 0.111032 0.111032 NaN 1.146201 1.146201 B 0.695438 NaN 0.695438 1.166526 NaN 1.166526 C NaN 0.331975 0.331975 NaN 0.043771 0.043771 All 1.014073 0.713941 0.871016 0.881376 0.984017 0.923568
此外,您可以调用DataFrame.stack()
来显示一个透视的 DataFrame,具有多级索引:
In [19]: table.stack(future_stack=True) Out[19]: D E A B C one A bar 1.568517 0.179247 foo 0.178504 0.033718 All 1.293926 0.371275 B bar 1.157593 0.653280 foo 0.299748 0.885047 ... ... ... two C foo 0.331975 0.043771 All 0.331975 0.043771 All bar 1.014073 0.881376 foo 0.713941 0.984017 All 0.871016 0.923568 [30 rows x 2 columns] ```## `stack()` 和 `unstack()`  与`pivot()`方法密切相关的是`Series`和`DataFrame`上可用的相关`stack()`和`unstack()`方法。这些方法旨在与`MultiIndex`对象一起使用(请参阅分层索引部分)。 + `stack()`:将(可能是分层的)列标签的一个级别“枢轴”,返回一个带有新的最内层行标签的`DataFrame`。 + `unstack()`:(`stack()`的逆操作)将(可能是分层的)行索引的一个级别“枢轴”到列轴,生成一个重新塑造的带有新的最内层列标签的`DataFrame`。  ```py In [20]: tuples = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] ....: In [21]: index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"]) In [22]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"]) In [23]: df2 = df[:4] In [24]: df2 Out[24]: A B first second bar one 0.895717 0.805244 two -1.206412 2.565646 baz one 1.431256 1.340309 two -1.170299 -0.226169
stack()
函数将 DataFrame
中的一个级别“压缩”为以下之一:
- 在列中具有
Index
的情况下的Series
。 - 在列中具有
MultiIndex
的情况下的DataFrame
。
如果列具有MultiIndex
,您可以选择堆叠哪个级别。堆叠的级别将成为列中MultiIndex
的新最低级别:
In [25]: stacked = df2.stack(future_stack=True) In [26]: stacked Out[26]: first second bar one A 0.895717 B 0.805244 two A -1.206412 B 2.565646 baz one A 1.431256 B 1.340309 two A -1.170299 B -0.226169 dtype: float64
使用“堆叠”DataFrame
或 Series
(具有 MultiIndex
作为 index
)的逆操作是 unstack()
,默认情况下取消堆叠最后一个级别:
In [27]: stacked.unstack() Out[27]: A B first second bar one 0.895717 0.805244 two -1.206412 2.565646 baz one 1.431256 1.340309 two -1.170299 -0.226169 In [28]: stacked.unstack(1) Out[28]: second one two first bar A 0.895717 -1.206412 B 0.805244 2.565646 baz A 1.431256 -1.170299 B 1.340309 -0.226169 In [29]: stacked.unstack(0) Out[29]: first bar baz second one A 0.895717 1.431256 B 0.805244 1.340309 two A -1.206412 -1.170299 B 2.565646 -0.226169
如果索引有名称,则可以使用级别名称而不是指定级别编号:
In [30]: stacked.unstack("second") Out[30]: second one two first bar A 0.895717 -1.206412 B 0.805244 2.565646 baz A 1.431256 -1.170299 B 1.340309 -0.226169
请注意,stack()
和 unstack()
方法隐式地对涉及的索引级别进行排序。因此,对 stack()
和然后 unstack()
或反之的调用,将导致原始 DataFrame
或 Series
的排序副本:
In [31]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]]) In [32]: df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"]) In [33]: df Out[33]: A 2 a -1.413681 b 1.607920 1 a 1.024180 b 0.569605 In [34]: all(df.unstack().stack(future_stack=True) == df.sort_index()) Out[34]: True
多个级别
您还可以通过传递级别列表一次堆叠或取消堆叠多个级别,此时最终结果就像列表中的每个级别都单独处理一样。
In [35]: columns = pd.MultiIndex.from_tuples( ....: [ ....: ("A", "cat", "long"), ....: ("B", "cat", "long"), ....: ("A", "dog", "short"), ....: ("B", "dog", "short"), ....: ], ....: names=["exp", "animal", "hair_length"], ....: ) ....: In [36]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns) In [37]: df Out[37]: exp A B A B animal cat cat dog dog hair_length long long short short 0 0.875906 -2.211372 0.974466 -2.006747 1 -0.410001 -0.078638 0.545952 -1.219217 2 -1.226825 0.769804 -1.281247 -0.727707 3 -0.121306 -0.097883 0.695775 0.341734 In [38]: df.stack(level=["animal", "hair_length"], future_stack=True) Out[38]: exp A B animal hair_length 0 cat long 0.875906 -2.211372 dog short 0.974466 -2.006747 1 cat long -0.410001 -0.078638 dog short 0.545952 -1.219217 2 cat long -1.226825 0.769804 dog short -1.281247 -0.727707 3 cat long -0.121306 -0.097883 dog short 0.695775 0.341734
级别列表可以包含级别名称或级别编号,但不能混合使用两者。
# df.stack(level=['animal', 'hair_length'], future_stack=True) # from above is equivalent to: In [39]: df.stack(level=[1, 2], future_stack=True) Out[39]: exp A B animal hair_length 0 cat long 0.875906 -2.211372 dog short 0.974466 -2.006747 1 cat long -0.410001 -0.078638 dog short 0.545952 -1.219217 2 cat long -1.226825 0.769804 dog short -1.281247 -0.727707 3 cat long -0.121306 -0.097883 dog short 0.695775 0.341734
缺失数据
如果子组没有相同的标签集,则取消堆叠可能会导致缺失值。 默认情况下,缺失值将替换为该数据类型的默认填充值。
In [40]: columns = pd.MultiIndex.from_tuples( ....: [ ....: ("A", "cat"), ....: ("B", "dog"), ....: ("B", "cat"), ....: ("A", "dog"), ....: ], ....: names=["exp", "animal"], ....: ) ....: In [41]: index = pd.MultiIndex.from_product( ....: [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"] ....: ) ....: In [42]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns) In [43]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]] In [44]: df3 Out[44]: exp B animal dog cat first second bar one -1.110336 -0.619976 two 0.687738 0.176444 foo one 1.314232 0.690579 qux two 0.380396 0.084844 In [45]: df3.unstack() Out[45]: exp B animal dog cat second one two one two first bar -1.110336 0.687738 -0.619976 0.176444 foo 1.314232 NaN 0.690579 NaN qux NaN 0.380396 NaN 0.084844
可以使用fill_value
参数将缺失值填充为特定值。
In [46]: df3.unstack(fill_value=-1e9) Out[46]: exp B animal dog cat second one two one two first bar -1.110336e+00 6.877384e-01 -6.199759e-01 1.764443e-01 foo 1.314232e+00 -1.000000e+09 6.905793e-01 -1.000000e+09 qux -1.000000e+09 3.803956e-01 -1.000000e+09 8.484421e-02 ```## `melt()` 和 `wide_to_long()`  顶级`melt()`函数及其对应的`DataFrame.melt()`对于将`DataFrame`整理成一种格式非常有用,其中一个或多个列是*标识符变量*,而所有其他列,被视为*测量变量*,被“解开”到行轴上,仅留下两个非标识符列,“变量”和“值”。 这些列的名称可以通过提供`var_name`和`value_name`参数进行自定义。 ```py In [47]: cheese = pd.DataFrame( ....: { ....: "first": ["John", "Mary"], ....: "last": ["Doe", "Bo"], ....: "height": [5.5, 6.0], ....: "weight": [130, 150], ....: } ....: ) ....: In [48]: cheese Out[48]: first last height weight 0 John Doe 5.5 130 1 Mary Bo 6.0 150 In [49]: cheese.melt(id_vars=["first", "last"]) Out[49]: first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0 In [50]: cheese.melt(id_vars=["first", "last"], var_name="quantity") Out[50]: first last quantity value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0
在使用melt()
转换 DataFrame 时,索引将被忽略。 可以通过将ignore_index=False
参数设置为False
(默认为True
)来保留原始索引值。 ignore_index=False
会导致索引值重复。
In [51]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")]) In [52]: cheese = pd.DataFrame( ....: { ....: "first": ["John", "Mary"], ....: "last": ["Doe", "Bo"], ....: "height": [5.5, 6.0], ....: "weight": [130, 150], ....: }, ....: index=index, ....: ) ....: In [53]: cheese Out[53]: first last height weight person A John Doe 5.5 130 B Mary Bo 6.0 150 In [54]: cheese.melt(id_vars=["first", "last"]) Out[54]: first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0 In [55]: cheese.melt(id_vars=["first", "last"], ignore_index=False) Out[55]: first last variable value person A John Doe height 5.5 B Mary Bo height 6.0 A John Doe weight 130.0 B Mary Bo weight 150.0
wide_to_long()
类似于melt()
,但具有更多的列匹配自定义功能。
In [56]: dft = pd.DataFrame( ....: { ....: "A1970": {0: "a", 1: "b", 2: "c"}, ....: "A1980": {0: "d", 1: "e", 2: "f"}, ....: "B1970": {0: 2.5, 1: 1.2, 2: 0.7}, ....: "B1980": {0: 3.2, 1: 1.3, 2: 0.1}, ....: "X": dict(zip(range(3), np.random.randn(3))), ....: } ....: ) ....: In [57]: dft["id"] = dft.index In [58]: dft Out[58]: A1970 A1980 B1970 B1980 X id 0 a d 2.5 3.2 1.519970 0 1 b e 1.2 1.3 -0.493662 1 2 c f 0.7 0.1 0.600178 2 In [59]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year") Out[59]: X A B id year 0 1970 1.519970 a 2.5 1 1970 -0.493662 b 1.2 2 1970 0.600178 c 0.7 0 1980 1.519970 d 3.2 1 1980 -0.493662 e 1.3 2 1980 0.600178 f 0.1 ```## `get_dummies()` 和 `from_dummies()` 要将`Series`的分类变量转换为“虚拟”或“指示符”,`get_dummies()`会创建一个新的`DataFrame`,其中包含唯一变量的列和表示每行中变量存在的值。 ```py In [60]: df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)}) In [61]: pd.get_dummies(df["key"]) Out[61]: a b c 0 False True False 1 False True False 2 True False False 3 False False True 4 True False False 5 False True False In [62]: df["key"].str.get_dummies() Out[62]: a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
prefix
将前缀添加到列名中,这对将结果与原始DataFrame
合并很有用:
In [63]: dummies = pd.get_dummies(df["key"], prefix="key") In [64]: dummies Out[64]: key_a key_b key_c 0 False True False 1 False True False 2 True False False 3 False False True 4 True False False 5 False True False In [65]: df[["data1"]].join(dummies) Out[65]: data1 key_a key_b key_c 0 0 False True False 1 1 False True False 2 2 True False False 3 3 False False True 4 4 True False False 5 5 False True False
这个函数通常与像cut()
这样的离散化函数一起使用:
In [66]: values = np.random.randn(10) In [67]: values Out[67]: array([ 0.2742, 0.1329, -0.0237, 2.4102, 1.4505, 0.2061, -0.2519, -2.2136, 1.0633, 1.2661]) In [68]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1] In [69]: pd.get_dummies(pd.cut(values, bins)) Out[69]: (0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] 0 False True False False False 1 True False False False False 2 False False False False False 3 False False False False False 4 False False False False False 5 False True False False False 6 False False False False False 7 False False False False False 8 False False False False False 9 False False False False False
get_dummies()
也接受一个DataFrame
。默认情况下,object
、string
或categorical
类型的列将被编码为虚拟变量,而其他列保持不变。
In [70]: df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]}) In [71]: pd.get_dummies(df) Out[71]: C A_a A_b B_b B_c 0 1 True False False True 1 2 False True False True 2 3 True False True False
指定columns
关键字将对任何类型的列进行编码。
In [72]: pd.get_dummies(df, columns=["A"]) Out[72]: B C A_a A_b 0 c 1 True False 1 c 2 False True 2 b 3 True False
Pandas 2.2 中文官方教程和指南(十四)(2)https://developer.aliyun.com/article/1509844