Pandas 2.2 中文官方教程和指南(二十五·二)(1)https://developer.aliyun.com/article/1509409
In [120]: df = pd.DataFrame( .....: { .....: "code": ["foo", "bar", "baz"] * 2, .....: "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62], .....: "flag": [False, True] * 3, .....: } .....: ) .....: In [121]: code_groups = df.groupby("code") In [122]: agg_n_sort_order = code_groups[["data"]].transform("sum").sort_values(by="data") In [123]: sorted_df = df.loc[agg_n_sort_order.index] In [124]: sorted_df Out[124]: code data flag 1 bar -0.21 True 4 bar -0.59 False 0 foo 0.16 False 3 foo 0.45 True 2 baz 0.33 False 5 baz 0.62 True
In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq="2min") In [126]: ts = pd.Series(data=list(range(10)), index=rng) In [127]: def MyCust(x): .....: if len(x) > 2: .....: return x.iloc[1] * 1.234 .....: return pd.NaT .....: In [128]: mhc = {"Mean": "mean", "Max": "max", "Custom": MyCust} In [129]: ts.resample("5min").apply(mhc) Out[129]: Mean Max Custom 2014-10-07 00:00:00 1.0 2 1.234 2014-10-07 00:05:00 3.5 4 NaT 2014-10-07 00:10:00 6.0 7 7.404 2014-10-07 00:15:00 8.5 9 NaT In [130]: ts Out[130]: 2014-10-07 00:00:00 0 2014-10-07 00:02:00 1 2014-10-07 00:04:00 2 2014-10-07 00:06:00 3 2014-10-07 00:08:00 4 2014-10-07 00:10:00 5 2014-10-07 00:12:00 6 2014-10-07 00:14:00 7 2014-10-07 00:16:00 8 2014-10-07 00:18:00 9 Freq: 2min, dtype: int64
In [131]: df = pd.DataFrame( .....: {"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]} .....: ) .....: In [132]: df Out[132]: Color Value 0 Red 100 1 Red 150 2 Red 50 3 Blue 50 In [133]: df["Counts"] = df.groupby(["Color"]).transform(len) In [134]: df Out[134]: Color Value Counts 0 Red 100 3 1 Red 150 3 2 Red 50 3 3 Blue 50 1
In [135]: df = pd.DataFrame( .....: {"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88, 100]}, .....: index=[ .....: "Last Gunfighter", .....: "Last Gunfighter", .....: "Last Gunfighter", .....: "Paynter", .....: "Paynter", .....: "Paynter", .....: ], .....: ) .....: In [136]: df Out[136]: line_race beyer Last Gunfighter 10 99 Last Gunfighter 10 102 Last Gunfighter 8 103 Paynter 10 103 Paynter 10 88 Paynter 8 100 In [137]: df["beyer_shifted"] = df.groupby(level=0)["beyer"].shift(1) In [138]: df Out[138]: line_race beyer beyer_shifted Last Gunfighter 10 99 NaN Last Gunfighter 10 102 99.0 Last Gunfighter 8 103 102.0 Paynter 10 103 NaN Paynter 10 88 103.0 Paynter 8 100 88.0
In [139]: df = pd.DataFrame( .....: { .....: "host": ["other", "other", "that", "this", "this"], .....: "service": ["mail", "web", "mail", "mail", "web"], .....: "no": [1, 2, 1, 2, 1], .....: } .....: ).set_index(["host", "service"]) .....: In [140]: mask = df.groupby(level=0).agg("idxmax") In [141]: df_count = df.loc[mask["no"]].reset_index() In [142]: df_count Out[142]: host service no 0 other web 2 1 that mail 1 2 this mail 2
类似于 Python 的 itertools.groupby 的分组
In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=["A"]) In [144]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).groups Out[144]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]} In [145]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).cumsum() Out[145]: 0 0 1 1 2 0 3 1 4 2 5 3 6 0 7 1 8 2 Name: A, dtype: int64
扩展数据
分割
创建一个数据框列表,根据包含在行中的逻辑进行分割。
In [146]: df = pd.DataFrame( .....: data={ .....: "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"], .....: "Data": np.random.randn(9), .....: } .....: ) .....: In [147]: dfs = list( .....: zip( .....: *df.groupby( .....: (1 * (df["Case"] == "B")) .....: .cumsum() .....: .rolling(window=3, min_periods=1) .....: .median() .....: ) .....: ) .....: )[-1] .....: In [148]: dfs[0] Out[148]: Case Data 0 A 0.276232 1 A -1.087401 2 A -0.673690 3 B 0.113648 In [149]: dfs[1] Out[149]: Case Data 4 A -1.478427 5 A 0.524988 6 B 0.404705 In [150]: dfs[2] Out[150]: Case Data 7 A 0.577046 8 A -1.715002
透视表
���视表 文档。
In [151]: df = pd.DataFrame( .....: data={ .....: "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"], .....: "City": [ .....: "Toronto", .....: "Montreal", .....: "Vancouver", .....: "Calgary", .....: "Edmonton", .....: "Winnipeg", .....: "Windsor", .....: ], .....: "Sales": [13, 6, 16, 8, 4, 3, 1], .....: } .....: ) .....: In [152]: table = pd.pivot_table( .....: df, .....: values=["Sales"], .....: index=["Province"], .....: columns=["City"], .....: aggfunc="sum", .....: margins=True, .....: ) .....: In [153]: table.stack("City", future_stack=True) Out[153]: Sales Province City AL Calgary 8.0 Edmonton 4.0 Montreal NaN Toronto NaN Vancouver NaN ... ... All Toronto 13.0 Vancouver 16.0 Windsor 1.0 Winnipeg 3.0 All 51.0 [48 rows x 1 columns]
In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78] In [155]: df = pd.DataFrame( .....: { .....: "ID": ["x%d" % r for r in range(10)], .....: "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"], .....: "ExamYear": [ .....: "2007", .....: "2007", .....: "2007", .....: "2008", .....: "2008", .....: "2008", .....: "2008", .....: "2009", .....: "2009", .....: "2009", .....: ], .....: "Class": [ .....: "algebra", .....: "stats", .....: "bio", .....: "algebra", .....: "algebra", .....: "stats", .....: "stats", .....: "algebra", .....: "bio", .....: "bio", .....: ], .....: "Participated": [ .....: "yes", .....: "yes", .....: "yes", .....: "yes", .....: "no", .....: "yes", .....: "yes", .....: "yes", .....: "yes", .....: "yes", .....: ], .....: "Passed": ["yes" if x > 50 else "no" for x in grades], .....: "Employed": [ .....: True, .....: True, .....: True, .....: False, .....: False, .....: False, .....: False, .....: True, .....: True, .....: False, .....: ], .....: "Grade": grades, .....: } .....: ) .....: In [156]: df.groupby("ExamYear").agg( .....: { .....: "Participated": lambda x: x.value_counts()["yes"], .....: "Passed": lambda x: sum(x == "yes"), .....: "Employed": lambda x: sum(x), .....: "Grade": lambda x: sum(x) / len(x), .....: } .....: ) .....: Out[156]: Participated Passed Employed Grade ExamYear 2007 3 2 3 74.000000 2008 3 3 0 68.500000 2009 3 2 2 60.666667
创建年份和月份交叉表:
In [157]: df = pd.DataFrame( .....: {"value": np.random.randn(36)}, .....: index=pd.date_range("2011-01-01", freq="ME", periods=36), .....: ) .....: In [158]: pd.pivot_table( .....: df, index=df.index.month, columns=df.index.year, values="value", aggfunc="sum" .....: ) .....: Out[158]: 2011 2012 2013 1 -1.039268 -0.968914 2.565646 2 -0.370647 -1.294524 1.431256 3 -1.157892 0.413738 1.340309 4 -1.344312 0.276662 -1.170299 5 0.844885 -0.472035 -0.226169 6 1.075770 -0.013960 0.410835 7 -0.109050 -0.362543 0.813850 8 1.643563 -0.006154 0.132003 9 -1.469388 -0.923061 -0.827317 10 0.357021 0.895717 -0.076467 11 -0.674600 0.805244 -1.187678 12 -1.776904 -1.206412 1.130127
应用
In [159]: df = pd.DataFrame( .....: data={ .....: "A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]], .....: "B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]], .....: }, .....: index=["I", "II", "III"], .....: ) .....: In [160]: def SeriesFromSubList(aList): .....: return pd.Series(aList) .....: In [161]: df_orgz = pd.concat( .....: {ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()} .....: ) .....: In [162]: df_orgz Out[162]: 0 1 2 3 I A 2 4 8 16.0 B a b c NaN II A 100 200 NaN NaN B jj kk NaN NaN III A 10 20.0 30.0 NaN B ccc NaN NaN NaN
滚动应用到多列,其中函数在返回系列之前计算系列的标量
In [163]: df = pd.DataFrame( .....: data=np.random.randn(2000, 2) / 10000, .....: index=pd.date_range("2001-01-01", periods=2000), .....: columns=["A", "B"], .....: ) .....: In [164]: df Out[164]: A B 2001-01-01 -0.000144 -0.000141 2001-01-02 0.000161 0.000102 2001-01-03 0.000057 0.000088 2001-01-04 -0.000221 0.000097 2001-01-05 -0.000201 -0.000041 ... ... ... 2006-06-19 0.000040 -0.000235 2006-06-20 -0.000123 -0.000021 2006-06-21 -0.000113 0.000114 2006-06-22 0.000136 0.000109 2006-06-23 0.000027 0.000030 [2000 rows x 2 columns] In [165]: def gm(df, const): .....: v = ((((df["A"] + df["B"]) + 1).cumprod()) - 1) * const .....: return v.iloc[-1] .....: In [166]: s = pd.Series( .....: { .....: df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5) .....: for i in range(len(df) - 50) .....: } .....: ) .....: In [167]: s Out[167]: 2001-01-01 0.000930 2001-01-02 0.002615 2001-01-03 0.001281 2001-01-04 0.001117 2001-01-05 0.002772 ... 2006-04-30 0.003296 2006-05-01 0.002629 2006-05-02 0.002081 2006-05-03 0.004247 2006-05-04 0.003928 Length: 1950, dtype: float64
滚动应用于多列,其中函数返回标量(成交量加权平均价格)
In [168]: rng = pd.date_range(start="2014-01-01", periods=100) In [169]: df = pd.DataFrame( .....: { .....: "Open": np.random.randn(len(rng)), .....: "Close": np.random.randn(len(rng)), .....: "Volume": np.random.randint(100, 2000, len(rng)), .....: }, .....: index=rng, .....: ) .....: In [170]: df Out[170]: Open Close Volume 2014-01-01 -1.611353 -0.492885 1219 2014-01-02 -3.000951 0.445794 1054 2014-01-03 -0.138359 -0.076081 1381 2014-01-04 0.301568 1.198259 1253 2014-01-05 0.276381 -0.669831 1728 ... ... ... ... 2014-04-06 -0.040338 0.937843 1188 2014-04-07 0.359661 -0.285908 1864 2014-04-08 0.060978 1.714814 941 2014-04-09 1.759055 -0.455942 1065 2014-04-10 0.138185 -1.147008 1453 [100 rows x 3 columns] In [171]: def vwap(bars): .....: return (bars.Close * bars.Volume).sum() / bars.Volume.sum() .....: In [172]: window = 5 In [173]: s = pd.concat( .....: [ .....: (pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i + window]])) .....: for i in range(len(df) - window) .....: ] .....: ) .....: In [174]: s.round(2) Out[174]: 2014-01-06 0.02 2014-01-07 0.11 2014-01-08 0.10 2014-01-09 0.07 2014-01-10 -0.29 ... 2014-04-06 -0.63 2014-04-07 -0.02 2014-04-08 -0.03 2014-04-09 0.34 2014-04-10 0.29 Length: 95, dtype: float64
扩展数据
分割
创建一个数据框列表,根据行中包含的逻辑进行分割。
In [146]: df = pd.DataFrame( .....: data={ .....: "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"], .....: "Data": np.random.randn(9), .....: } .....: ) .....: In [147]: dfs = list( .....: zip( .....: *df.groupby( .....: (1 * (df["Case"] == "B")) .....: .cumsum() .....: .rolling(window=3, min_periods=1) .....: .median() .....: ) .....: ) .....: )[-1] .....: In [148]: dfs[0] Out[148]: Case Data 0 A 0.276232 1 A -1.087401 2 A -0.673690 3 B 0.113648 In [149]: dfs[1] Out[149]: Case Data 4 A -1.478427 5 A 0.524988 6 B 0.404705 In [150]: dfs[2] Out[150]: Case Data 7 A 0.577046 8 A -1.715002
Pandas 2.2 中文官方教程和指南(二十五·二)(3)https://developer.aliyun.com/article/1509411