食谱
这是一个简短而精炼的示例和链接存储库,包含有用的 pandas 示例。我们鼓励用户为此文档添加内容。
在这一部分添加有趣的链接和/或内联示例是一个很好的首次拉取请求。
在可能的情况下,已插入简化、精简、适合新用户的内联示例,以补充 Stack-Overflow 和 GitHub 链接。许多链接包含了比内联示例提供的更详细的信息。
pandas(pd)和 NumPy(np)是唯一两个缩写导入的模块。其余模块都明确导入,以供新用户使用。
习语
这些都是一些很棒的 pandas 习语
对一列进行 if-then/if-then-else 条件判断,并对另一列或多列进行赋值:
In [1]: df = pd.DataFrame( ...: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ...: ) ...: In [2]: df Out[2]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50
if-then…
对一列进行 if-then 条件判断
In [3]: df.loc[df.AAA >= 5, "BBB"] = -1 In [4]: df Out[4]: AAA BBB CCC 0 4 10 100 1 5 -1 50 2 6 -1 -30 3 7 -1 -50
一个带有对 2 列赋值的 if-then:
In [5]: df.loc[df.AAA >= 5, ["BBB", "CCC"]] = 555 In [6]: df Out[6]: AAA BBB CCC 0 4 10 100 1 5 555 555 2 6 555 555 3 7 555 555
添加另一行具有不同逻辑,以执行-else
In [7]: df.loc[df.AAA < 5, ["BBB", "CCC"]] = 2000 In [8]: df Out[8]: AAA BBB CCC 0 4 2000 2000 1 5 555 555 2 6 555 555 3 7 555 555
或者在设置了掩码之后使用 pandas where
In [9]: df_mask = pd.DataFrame( ...: {"AAA": [True] * 4, "BBB": [False] * 4, "CCC": [True, False] * 2} ...: ) ...: In [10]: df.where(df_mask, -1000) Out[10]: AAA BBB CCC 0 4 -1000 2000 1 5 -1000 -1000 2 6 -1000 555 3 7 -1000 -1000
使用 NumPy 的 where()进行 if-then-else 条件判断
In [11]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [12]: df Out[12]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [13]: df["logic"] = np.where(df["AAA"] > 5, "high", "low") In [14]: df Out[14]: AAA BBB CCC logic 0 4 10 100 low 1 5 20 50 low 2 6 30 -30 high 3 7 40 -50 high
分割
In [15]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [16]: df Out[16]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [17]: df[df.AAA <= 5] Out[17]: AAA BBB CCC 0 4 10 100 1 5 20 50 In [18]: df[df.AAA > 5] Out[18]: AAA BBB CCC 2 6 30 -30 3 7 40 -50
构建条件
In [19]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [20]: df Out[20]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50
…并且(不带赋值返回一个 Series)
In [21]: df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"] Out[21]: 0 4 1 5 Name: AAA, dtype: int64
…或者(不带赋值返回一个 Series)
In [22]: df.loc[(df["BBB"] > 25) | (df["CCC"] >= -40), "AAA"] Out[22]: 0 4 1 5 2 6 3 7 Name: AAA, dtype: int64
…或者(带有赋值修改 DataFrame。)
In [23]: df.loc[(df["BBB"] > 25) | (df["CCC"] >= 75), "AAA"] = 999 In [24]: df Out[24]: AAA BBB CCC 0 999 10 100 1 5 20 50 2 999 30 -30 3 999 40 -50
In [25]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [26]: df Out[26]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [27]: aValue = 43.0 In [28]: df.loc[(df.CCC - aValue).abs().argsort()] Out[28]: AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
In [29]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [30]: df Out[30]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [31]: Crit1 = df.AAA <= 5.5 In [32]: Crit2 = df.BBB == 10.0 In [33]: Crit3 = df.CCC > -40.0
可以硬编码:
In [34]: AllCrit = Crit1 & Crit2 & Crit3
…或者可以使用一个动态构建的条件列表
In [35]: import functools In [36]: CritList = [Crit1, Crit2, Crit3] In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList) In [38]: df[AllCrit] Out[38]: AAA BBB CCC 0 4 10 100
选择
数据框
索引文档。
In [39]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [40]: df Out[40]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] Out[41]: AAA BBB CCC 0 4 10 100 2 6 30 -30
使用 loc 进行基于标签的切片和 iloc 进行基于位置的切片 GH 2904
In [42]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}, ....: index=["foo", "bar", "boo", "kar"], ....: ) ....:
有两种明确的切片方法,还有第三种通用情况
- 位置导向(Python 切片样式:不包括结束位置)
- 标签导向(非 Python 切片样式:包括结束位置)
- 通用(取决于切片样式:取决于切片是否包含标签或位置)
In [43]: df.loc["bar":"kar"] # Label Out[43]: AAA BBB CCC bar 5 20 50 boo 6 30 -30 kar 7 40 -50 # Generic In [44]: df[0:3] Out[44]: AAA BBB CCC foo 4 10 100 bar 5 20 50 boo 6 30 -30 In [45]: df["bar":"kar"] Out[45]: AAA BBB CCC bar 5 20 50 boo 6 30 -30 kar 7 40 -50
当索引由具有非零起始或非单位增量的整数组成时,就会出现歧义。
In [46]: data = {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1. In [48]: df2.iloc[1:3] # Position-oriented Out[48]: AAA BBB CCC 2 5 20 50 3 6 30 -30 In [49]: df2.loc[1:3] # Label-oriented Out[49]: AAA BBB CCC 1 4 10 100 2 5 20 50 3 6 30 -30
In [50]: df = pd.DataFrame( ....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ....: ) ....: In [51]: df Out[51]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))] Out[52]: AAA BBB CCC 1 5 20 50 3 7 40 -50
新列
高效动态地使用 DataFrame.map(之前称为 applymap)创建新列
In [53]: df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]}) In [54]: df Out[54]: AAA BBB CCC 0 1 1 2 1 2 1 1 2 1 2 3 3 3 2 1 In [55]: source_cols = df.columns # Or some subset would work too In [56]: new_cols = [str(x) + "_cat" for x in source_cols] In [57]: categories = {1: "Alpha", 2: "Beta", 3: "Charlie"} In [58]: df[new_cols] = df[source_cols].map(categories.get) In [59]: df Out[59]: AAA BBB CCC AAA_cat BBB_cat CCC_cat 0 1 1 2 Alpha Alpha Beta 1 2 1 1 Beta Alpha Alpha 2 1 2 3 Alpha Beta Charlie 3 3 2 1 Charlie Beta Alpha
In [60]: df = pd.DataFrame( ....: {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]} ....: ) ....: In [61]: df Out[61]: AAA BBB 0 1 2 1 1 1 2 1 3 3 2 4 4 2 5 5 2 1 6 3 2 7 3 3
方法 1:使用 idxmin() 获取最小值的索引
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()] Out[62]: AAA BBB 1 1 1 5 2 1 6 3 2
方法 2:先排序再取每个的第一个
In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first() Out[63]: AAA BBB 0 1 1 1 2 1 2 3 2
注意相同的结果,除了索引。 ## 多级索引
多级索引 文档。
In [64]: df = pd.DataFrame( ....: { ....: "row": [0, 1, 2], ....: "One_X": [1.1, 1.1, 1.1], ....: "One_Y": [1.2, 1.2, 1.2], ....: "Two_X": [1.11, 1.11, 1.11], ....: "Two_Y": [1.22, 1.22, 1.22], ....: } ....: ) ....: In [65]: df Out[65]: row One_X One_Y Two_X Two_Y 0 0 1.1 1.2 1.11 1.22 1 1 1.1 1.2 1.11 1.22 2 2 1.1 1.2 1.11 1.22 # As Labelled Index In [66]: df = df.set_index("row") In [67]: df Out[67]: One_X One_Y Two_X Two_Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # With Hierarchical Columns In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns]) In [69]: df Out[69]: One Two X Y X Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # Now stack & Reset In [70]: df = df.stack(0, future_stack=True).reset_index(1) In [71]: df Out[71]: level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22 # And fix the labels (Notice the label 'level_1' got added automatically) In [72]: df.columns = ["Sample", "All_X", "All_Y"] In [73]: df Out[73]: Sample All_X All_Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22
算术
In [74]: cols = pd.MultiIndex.from_tuples( ....: [(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]] ....: ) ....: In [75]: df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols) In [76]: df Out[76]: A B C O I O I O I n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215 m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804 In [77]: df = df.div(df["C"], level=1) In [78]: df Out[78]: A B C O I O I O I n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0 m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0
切片
In [79]: coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB", "six")] In [80]: index = pd.MultiIndex.from_tuples(coords) In [81]: df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"]) In [82]: df Out[82]: MyData AA one 11 six 22 BB one 33 two 44 six 55
要获取索引的第一个级别和第一个轴的交叉部分:
# Note : level and axis are optional, and default to zero In [83]: df.xs("BB", level=0, axis=0) Out[83]: MyData one 33 two 44 six 55
…现在是第一个轴的第二级。
In [84]: df.xs("six", level=1, axis=0) Out[84]: MyData AA 22 BB 55
In [85]: import itertools In [86]: index = list(itertools.product(["Ada", "Quinn", "Violet"], ["Comp", "Math", "Sci"])) In [87]: headr = list(itertools.product(["Exams", "Labs"], ["I", "II"])) In [88]: indx = pd.MultiIndex.from_tuples(index, names=["Student", "Course"]) In [89]: cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)] In [91]: df = pd.DataFrame(data, indx, cols) In [92]: df Out[92]: Exams Labs I II I II Student Course Ada Comp 70 71 72 73 Math 71 73 75 74 Sci 72 75 75 75 Quinn Comp 73 74 75 76 Math 74 76 78 77 Sci 75 78 78 78 Violet Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 In [93]: All = slice(None) In [94]: df.loc["Violet"] Out[94]: Exams Labs I II I II Course Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 In [95]: df.loc[(All, "Math"), All] Out[95]: Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 Violet Math 77 79 81 80 In [96]: df.loc[(slice("Ada", "Quinn"), "Math"), All] Out[96]: Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 In [97]: df.loc[(All, "Math"), ("Exams")] Out[97]: I II Student Course Ada Math 71 73 Quinn Math 74 76 Violet Math 77 79 In [98]: df.loc[(All, "Math"), (All, "II")] Out[98]: Exams Labs II II Student Course Ada Math 73 74 Quinn Math 76 77 Violet Math 79 80
排序
In [99]: df.sort_values(by=("Labs", "II"), ascending=False) Out[99]: Exams Labs I II I II Student Course Violet Sci 78 81 81 81 Math 77 79 81 80 Comp 76 77 78 79 Quinn Sci 75 78 78 78 Math 74 76 78 77 Comp 73 74 75 76 Ada Sci 72 75 75 75 Math 71 73 75 74 Comp 70 71 72 73
部分选择,需要排序 GH 2995
层次
展平分层列 ## 缺失数据
缺失数据 文档。
向前填充反向时间序列
In [100]: df = pd.DataFrame( .....: np.random.randn(6, 1), .....: index=pd.date_range("2013-08-01", periods=6, freq="B"), .....: columns=list("A"), .....: ) .....: In [101]: df.loc[df.index[3], "A"] = np.nan In [102]: df Out[102]: A 2013-08-01 0.721555 2013-08-02 -0.706771 2013-08-05 -1.039575 2013-08-06 NaN 2013-08-07 -0.424972 2013-08-08 0.567020 In [103]: df.bfill() Out[103]: A 2013-08-01 0.721555 2013-08-02 -0.706771 2013-08-05 -1.039575 2013-08-06 -0.424972 2013-08-07 -0.424972 2013-08-08 0.567020
Pandas 2.2 中文官方教程和指南(二十五·一)(2)https://developer.aliyun.com/article/1508892