Pandas 2.2 中文官方教程和指南（二十五·一）（2）-阿里云开发者社区

Pandas 2.2 中文官方教程和指南（二十五·一）（1）https://developer.aliyun.com/article/1508891

替换

分组文档。

与 agg 不同，apply 的可调用函数会传递一个子 DataFrame，这样你就可以访问所有的列

In [104]: df = pd.DataFrame(
 .....:    {
 .....:        "animal": "cat dog cat fish dog cat cat".split(),
 .....:        "size": list("SSMMMLL"),
 .....:        "weight": [8, 10, 11, 1, 20, 12, 12],
 .....:        "adult": [False] * 5 + [True] * 2,
 .....:    }
 .....: )
 .....: 
In [105]: df
Out[105]: 
 animal size  weight  adult
0    cat    S       8  False
1    dog    S      10  False
2    cat    M      11  False
3   fish    M       1  False
4    dog    M      20  False
5    cat    L      12   True
6    cat    L      12   True
# List the size of the animals with the highest weight.
In [106]: df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()], include_groups=False)
Out[106]: 
animal
cat     L
dog     M
fish    M
dtype: object

使用 get_group

In [107]: gb = df.groupby("animal")
In [108]: gb.get_group("cat")
Out[108]: 
 animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True

对组中的不同项应用

In [109]: def GrowUp(x):
 .....:    avg_weight = sum(x[x["size"] == "S"].weight * 1.5)
 .....:    avg_weight += sum(x[x["size"] == "M"].weight * 1.25)
 .....:    avg_weight += sum(x[x["size"] == "L"].weight)
 .....:    avg_weight /= len(x)
 .....:    return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
 .....: 
In [110]: expected_df = gb.apply(GrowUp, include_groups=False)
In [111]: expected_df
Out[111]: 
 size   weight  adult
animal 
cat       L  12.4375   True
dog       L  20.0000   True
fish      L   1.2500   True

扩展 apply

In [112]: S = pd.Series([i / 100.0 for i in range(1, 11)])
In [113]: def cum_ret(x, y):
 .....:    return x * (1 + y)
 .....: 
In [114]: def red(x):
 .....:    return functools.reduce(cum_ret, x, 1.0)
 .....: 
In [115]: S.expanding().apply(red, raw=True)
Out[115]: 
0    1.010000
1    1.030200
2    1.061106
3    1.103550
4    1.158728
5    1.228251
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64

用组平均值替换一些值

In [116]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, -1, 1, 2]})
In [117]: gb = df.groupby("A")
In [118]: def replace(g):
 .....:    mask = g < 0
 .....:    return g.where(~mask, g[~mask].mean())
 .....: 
In [119]: gb.transform(replace)
Out[119]: 
 B
0  1
1  1
2  1
3  2

按聚合数据排序组

In [120]: df = pd.DataFrame(
 .....:    {
 .....:        "code": ["foo", "bar", "baz"] * 2,
 .....:        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
 .....:        "flag": [False, True] * 3,
 .....:    }
 .....: )
 .....: 
In [121]: code_groups = df.groupby("code")
In [122]: agg_n_sort_order = code_groups[["data"]].transform("sum").sort_values(by="data")
In [123]: sorted_df = df.loc[agg_n_sort_order.index]
In [124]: sorted_df
Out[124]: 
 code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

创建多个聚合列

In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq="2min")
In [126]: ts = pd.Series(data=list(range(10)), index=rng)
In [127]: def MyCust(x):
 .....:    if len(x) > 2:
 .....:        return x.iloc[1] * 1.234
 .....:    return pd.NaT
 .....: 
In [128]: mhc = {"Mean": "mean", "Max": "max", "Custom": MyCust}
In [129]: ts.resample("5min").apply(mhc)
Out[129]: 
 Mean  Max Custom
2014-10-07 00:00:00   1.0    2  1.234
2014-10-07 00:05:00   3.5    4    NaT
2014-10-07 00:10:00   6.0    7  7.404
2014-10-07 00:15:00   8.5    9    NaT
In [130]: ts
Out[130]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2min, dtype: int64

创建一个值计数列并重新分配回 DataFrame

In [131]: df = pd.DataFrame(
 .....:    {"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]}
 .....: )
 .....: 
In [132]: df
Out[132]: 
 Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50
In [133]: df["Counts"] = df.groupby(["Color"]).transform(len)
In [134]: df
Out[134]: 
 Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

根据索引将列中的值组进行移动

In [135]: df = pd.DataFrame(
 .....:    {"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88, 100]},
 .....:    index=[
 .....:        "Last Gunfighter",
 .....:        "Last Gunfighter",
 .....:        "Last Gunfighter",
 .....:        "Paynter",
 .....:        "Paynter",
 .....:        "Paynter",
 .....:    ],
 .....: )
 .....: 
In [136]: df
Out[136]: 
 line_race  beyer
Last Gunfighter         10     99
Last Gunfighter         10    102
Last Gunfighter          8    103
Paynter                 10    103
Paynter                 10     88
Paynter                  8    100
In [137]: df["beyer_shifted"] = df.groupby(level=0)["beyer"].shift(1)
In [138]: df
Out[138]: 
 line_race  beyer  beyer_shifted
Last Gunfighter         10     99            NaN
Last Gunfighter         10    102           99.0
Last Gunfighter          8    103          102.0
Paynter                 10    103            NaN
Paynter                 10     88          103.0
Paynter                  8    100           88.0

从每个组中选择具有最大值的行

In [139]: df = pd.DataFrame(
 .....:    {
 .....:        "host": ["other", "other", "that", "this", "this"],
 .....:        "service": ["mail", "web", "mail", "mail", "web"],
 .....:        "no": [1, 2, 1, 2, 1],
 .....:    }
 .....: ).set_index(["host", "service"])
 .....: 
In [140]: mask = df.groupby(level=0).agg("idxmax")
In [141]: df_count = df.loc[mask["no"]].reset_index()
In [142]: df_count
Out[142]: 
 host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

类似于 Python 的 itertools.groupby 的分组

In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=["A"])
In [144]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).groups
Out[144]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]}
In [145]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).cumsum()
Out[145]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

扩展数据

对齐和截止日期

基于值而不是计数的滚动计算窗口

按时间间隔计算滚动均值

分割

分割一个框架

创建一个数据框列表，根据包含在行中的逻辑进行分割。

In [146]: df = pd.DataFrame(
 .....:    data={
 .....:        "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
 .....:        "Data": np.random.randn(9),
 .....:    }
 .....: )
 .....: 
In [147]: dfs = list(
 .....:    zip(
 .....:        *df.groupby(
 .....:            (1 * (df["Case"] == "B"))
 .....:            .cumsum()
 .....:            .rolling(window=3, min_periods=1)
 .....:            .median()
 .....:        )
 .....:    )
 .....: )[-1]
 .....: 
In [148]: dfs[0]
Out[148]: 
 Case      Data
0    A  0.276232
1    A -1.087401
2    A -0.673690
3    B  0.113648
In [149]: dfs[1]
Out[149]: 
 Case      Data
4    A -1.478427
5    A  0.524988
6    B  0.404705
In [150]: dfs[2]
Out[150]: 
 Case      Data
7    A  0.577046
8    A -1.715002

透视表

Pivot 文档。

部分和和小计

In [151]: df = pd.DataFrame(
 .....:    data={
 .....:        "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
 .....:        "City": [
 .....:            "Toronto",
 .....:            "Montreal",
 .....:            "Vancouver",
 .....:            "Calgary",
 .....:            "Edmonton",
 .....:            "Winnipeg",
 .....:            "Windsor",
 .....:        ],
 .....:        "Sales": [13, 6, 16, 8, 4, 3, 1],
 .....:    }
 .....: )
 .....: 
In [152]: table = pd.pivot_table(
 .....:    df,
 .....:    values=["Sales"],
 .....:    index=["Province"],
 .....:    columns=["City"],
 .....:    aggfunc="sum",
 .....:    margins=True,
 .....: )
 .....: 
In [153]: table.stack("City", future_stack=True)
Out[153]: 
 Sales
Province City 
AL       Calgary      8.0
 Edmonton     4.0
 Montreal     NaN
 Toronto      NaN
 Vancouver    NaN
...                   ...
All      Toronto     13.0
 Vancouver   16.0
 Windsor      1.0
 Winnipeg     3.0
 All         51.0
[48 rows x 1 columns]

类似于 R 中 plyr 的频率表

In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
In [155]: df = pd.DataFrame(
 .....:    {
 .....:        "ID": ["x%d" % r for r in range(10)],
 .....:        "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
 .....:        "ExamYear": [
 .....:            "2007",
 .....:            "2007",
 .....:            "2007",
 .....:            "2008",
 .....:            "2008",
 .....:            "2008",
 .....:            "2008",
 .....:            "2009",
 .....:            "2009",
 .....:            "2009",
 .....:        ],
 .....:        "Class": [
 .....:            "algebra",
 .....:            "stats",
 .....:            "bio",
 .....:            "algebra",
 .....:            "algebra",
 .....:            "stats",
 .....:            "stats",
 .....:            "algebra",
 .....:            "bio",
 .....:            "bio",
 .....:        ],
 .....:        "Participated": [
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "no",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:        ],
 .....:        "Passed": ["yes" if x > 50 else "no" for x in grades],
 .....:        "Employed": [
 .....:            True,
 .....:            True,
 .....:            True,
 .....:            False,
 .....:            False,
 .....:            False,
 .....:            False,
 .....:            True,
 .....:            True,
 .....:            False,
 .....:        ],
 .....:        "Grade": grades,
 .....:    }
 .....: )
 .....: 
In [156]: df.groupby("ExamYear").agg(
 .....:    {
 .....:        "Participated": lambda x: x.value_counts()["yes"],
 .....:        "Passed": lambda x: sum(x == "yes"),
 .....:        "Employed": lambda x: sum(x),
 .....:        "Grade": lambda x: sum(x) / len(x),
 .....:    }
 .....: )
 .....: 
Out[156]: 
 Participated  Passed  Employed      Grade
ExamYear 
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667

使用年度数据绘制 pandas DataFrame 图表

创建年份和月份交叉表：

In [157]: df = pd.DataFrame(
 .....:    {"value": np.random.randn(36)},
 .....:    index=pd.date_range("2011-01-01", freq="ME", periods=36),
 .....: )
 .....: 
In [158]: pd.pivot_table(
 .....:    df, index=df.index.month, columns=df.index.year, values="value", aggfunc="sum"
 .....: )
 .....: 
Out[158]: 
 2011      2012      2013
1  -1.039268 -0.968914  2.565646
2  -0.370647 -1.294524  1.431256
3  -1.157892  0.413738  1.340309
4  -1.344312  0.276662 -1.170299
5   0.844885 -0.472035 -0.226169
6   1.075770 -0.013960  0.410835
7  -0.109050 -0.362543  0.813850
8   1.643563 -0.006154  0.132003
9  -1.469388 -0.923061 -0.827317
10  0.357021  0.895717 -0.076467
11 -0.674600  0.805244 -1.187678
12 -1.776904 -1.206412  1.130127

应用

滚动应用以组织 - 将嵌套列表转换为 MultiIndex 框架

In [159]: df = pd.DataFrame(
 .....:    data={
 .....:        "A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
 .....:        "B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]],
 .....:    },
 .....:    index=["I", "II", "III"],
 .....: )
 .....: 
In [160]: def SeriesFromSubList(aList):
 .....:    return pd.Series(aList)
 .....: 
In [161]: df_orgz = pd.concat(
 .....:    {ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
 .....: )
 .....: 
In [162]: df_orgz
Out[162]: 
 0     1     2     3
I   A    2     4     8  16.0
 B    a     b     c   NaN
II  A  100   200   NaN   NaN
 B   jj    kk   NaN   NaN
III A   10  20.0  30.0   NaN
 B  ccc   NaN   NaN   NaN

使用 DataFrame 返回 Series 的滚动应用

滚动应用于多列，其中函数在返回 Series 之前计算 Series

In [163]: df = pd.DataFrame(
 .....:    data=np.random.randn(2000, 2) / 10000,
 .....:    index=pd.date_range("2001-01-01", periods=2000),
 .....:    columns=["A", "B"],
 .....: )
 .....: 
In [164]: df
Out[164]: 
 A         B
2001-01-01 -0.000144 -0.000141
2001-01-02  0.000161  0.000102
2001-01-03  0.000057  0.000088
2001-01-04 -0.000221  0.000097
2001-01-05 -0.000201 -0.000041
...              ...       ...
2006-06-19  0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113  0.000114
2006-06-22  0.000136  0.000109
2006-06-23  0.000027  0.000030
[2000 rows x 2 columns]
In [165]: def gm(df, const):
 .....:    v = ((((df["A"] + df["B"]) + 1).cumprod()) - 1) * const
 .....:    return v.iloc[-1]
 .....: 
In [166]: s = pd.Series(
 .....:    {
 .....:        df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
 .....:        for i in range(len(df) - 50)
 .....:    }
 .....: )
 .....: 
In [167]: s
Out[167]: 
2001-01-01    0.000930
2001-01-02    0.002615
2001-01-03    0.001281
2001-01-04    0.001117
2001-01-05    0.002772
 ... 
2006-04-30    0.003296
2006-05-01    0.002629
2006-05-02    0.002081
2006-05-03    0.004247
2006-05-04    0.003928
Length: 1950, dtype: float64

使用 DataFrame 返回标量的滚动应用

滚动应用于多列，其中函数返回标量（成交量加权平均价格）

In [168]: rng = pd.date_range(start="2014-01-01", periods=100)
In [169]: df = pd.DataFrame(
 .....:    {
 .....:        "Open": np.random.randn(len(rng)),
 .....:        "Close": np.random.randn(len(rng)),
 .....:        "Volume": np.random.randint(100, 2000, len(rng)),
 .....:    },
 .....:    index=rng,
 .....: )
 .....: 
In [170]: df
Out[170]: 
 Open     Close  Volume
2014-01-01 -1.611353 -0.492885    1219
2014-01-02 -3.000951  0.445794    1054
2014-01-03 -0.138359 -0.076081    1381
2014-01-04  0.301568  1.198259    1253
2014-01-05  0.276381 -0.669831    1728
...              ...       ...     ...
2014-04-06 -0.040338  0.937843    1188
2014-04-07  0.359661 -0.285908    1864
2014-04-08  0.060978  1.714814     941
2014-04-09  1.759055 -0.455942    1065
2014-04-10  0.138185 -1.147008    1453
[100 rows x 3 columns]
In [171]: def vwap(bars):
 .....:    return (bars.Close * bars.Volume).sum() / bars.Volume.sum()
 .....: 
In [172]: window = 5
In [173]: s = pd.concat(
 .....:    [
 .....:        (pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i + window]]))
 .....:        for i in range(len(df) - window)
 .....:    ]
 .....: )
 .....: 
In [174]: s.round(2)
Out[174]: 
2014-01-06    0.02
2014-01-07    0.11
2014-01-08    0.10
2014-01-09    0.07
2014-01-10   -0.29
 ... 
2014-04-06   -0.63
2014-04-07   -0.02
2014-04-08   -0.03
2014-04-09    0.34
2014-04-10    0.29
Length: 95, dtype: float64

时间序列

将具有小时列和天行的矩阵转换为连续行序列形式的时间序列。如何重新排列 Python pandas DataFrame？

重新索引时间序列时处理重复值

计算每个 DatetimeIndex 条目的月份第一天

In [175]: dates = pd.date_range("2000-01-01", periods=5)
In [176]: dates.to_period(freq="M").to_timestamp()
Out[176]: 
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
 '2000-01-01'],
 dtype='datetime64[ns]', freq=None)

重新采样

重新采样文档。

使用 Grouper 而不是 TimeGrouper 进行时间分组

带有一些缺失值的时间分组

Grouper 的有效频率参数时间序列

使用 MultiIndex 进行分组

使用 TimeGrouper 和另一个分组创建子组，然后应用自定义函数 GH 3791

使用自定义周期重新采样

重新采样日内框架而不添加新天数

重新采样分钟数据

使用 groupby 重新采样 ## 合并

连接文档。

连接两个具有重叠索引的数据框（模拟 R rbind）

In [177]: rng = pd.date_range("2000-01-01", periods=6)
In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=["A", "B", "C"])
In [179]: df2 = df1.copy()

根据 df 构造，可能需要ignore_index

In [180]: df = pd.concat([df1, df2], ignore_index=True)
In [181]: df
Out[181]: 
 A         B         C
0  -0.870117 -0.479265 -0.790855
1   0.144817  1.726395 -0.464535
2  -0.821906  1.597605  0.187307
3  -0.128342 -1.511638 -0.289858
4   0.399194 -1.430030 -0.639760
5   1.115116 -2.012600  1.810662
6  -0.870117 -0.479265 -0.790855
7   0.144817  1.726395 -0.464535
8  -0.821906  1.597605  0.187307
9  -0.128342 -1.511638 -0.289858
10  0.399194 -1.430030 -0.639760
11  1.115116 -2.012600  1.810662

DataFrame 的自连接 GH 2996

In [182]: df = pd.DataFrame(
 .....:    data={
 .....:        "Area": ["A"] * 5 + ["C"] * 2,
 .....:        "Bins": [110] * 2 + [160] * 3 + [40] * 2,
 .....:        "Test_0": [0, 1, 0, 1, 2, 0, 1],
 .....:        "Data": np.random.randn(7),
 .....:    }
 .....: )
 .....: 
In [183]: df
Out[183]: 
 Area  Bins  Test_0      Data
0    A   110       0 -0.433937
1    A   110       1 -0.160552
2    A   160       0  0.744434
3    A   160       1  1.754213
4    A   160       2  0.000850
5    C    40       0  0.342243
6    C    40       1  1.070599
In [184]: df["Test_1"] = df["Test_0"] - 1
In [185]: pd.merge(
 .....:    df,
 .....:    df,
 .....:    left_on=["Bins", "Area", "Test_0"],
 .....:    right_on=["Bins", "Area", "Test_1"],
 .....:    suffixes=("_L", "_R"),
 .....: )
 .....: 
Out[185]: 
 Area  Bins  Test_0_L    Data_L  Test_1_L  Test_0_R    Data_R  Test_1_R
0    A   110         0 -0.433937        -1         1 -0.160552         0
1    A   160         0  0.744434        -1         1  1.754213         0
2    A   160         1  1.754213         0         2  0.000850         1
3    C    40         0  0.342243        -1         1  1.070599         0

如何设置索引和连接

类似 KDB 的 asof 连接

基于值的条件进行连接

使用 searchsorted 根据范围内的值合并 ## 绘图

绘图文档。

使 Matplotlib 看起来像 R

设置 x 轴主要和次要标签

在 IPython Jupyter 笔记本中绘制多个图表

使用 Pandas、Vincent 和 xlsxwriter 在 Excel 文件中生成嵌入式图表

为分层变量的每个四分位数绘制箱线图

In [186]: df = pd.DataFrame(
 .....:    {
 .....:        "stratifying_var": np.random.uniform(0, 100, 20),
 .....:        "price": np.random.normal(100, 5, 20),
 .....:    }
 .....: )
 .....: 
In [187]: df["quartiles"] = pd.qcut(
 .....:    df["stratifying_var"], 4, labels=["0-25%", "25-50%", "50-75%", "75-100%"]
 .....: )
 .....: 
In [188]: df.boxplot(column="price", by="quartiles")
Out[188]: <Axes: title={'center': 'price'}, xlabel='quartiles'>

Pandas 2.2 中文官方教程和指南（二十五·一）（3）https://developer.aliyun.com/article/1508893

Pandas 2.2 中文官方教程和指南（二十五·一）（2）

替换

扩展数据

分割

透视表

应用

时间序列

重新采样

热门文章

最新文章

相关课程

相关电子书

相关实验场景