Pandas 2.2 中文官方教程和指南(二十五·二)(2)

简介: Pandas 2.2 中文官方教程和指南(二十五·二)

Pandas 2.2 中文官方教程和指南(二十五·二)(1)https://developer.aliyun.com/article/1509409

按聚合数据对组进行排序

In [120]: df = pd.DataFrame(
 .....:    {
 .....:        "code": ["foo", "bar", "baz"] * 2,
 .....:        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
 .....:        "flag": [False, True] * 3,
 .....:    }
 .....: )
 .....: 
In [121]: code_groups = df.groupby("code")
In [122]: agg_n_sort_order = code_groups[["data"]].transform("sum").sort_values(by="data")
In [123]: sorted_df = df.loc[agg_n_sort_order.index]
In [124]: sorted_df
Out[124]: 
 code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True 

创建多个聚合列

In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq="2min")
In [126]: ts = pd.Series(data=list(range(10)), index=rng)
In [127]: def MyCust(x):
 .....:    if len(x) > 2:
 .....:        return x.iloc[1] * 1.234
 .....:    return pd.NaT
 .....: 
In [128]: mhc = {"Mean": "mean", "Max": "max", "Custom": MyCust}
In [129]: ts.resample("5min").apply(mhc)
Out[129]: 
 Mean  Max Custom
2014-10-07 00:00:00   1.0    2  1.234
2014-10-07 00:05:00   3.5    4    NaT
2014-10-07 00:10:00   6.0    7  7.404
2014-10-07 00:15:00   8.5    9    NaT
In [130]: ts
Out[130]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2min, dtype: int64 

创建一个值计数列并重新分配回数据框

In [131]: df = pd.DataFrame(
 .....:    {"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]}
 .....: )
 .....: 
In [132]: df
Out[132]: 
 Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50
In [133]: df["Counts"] = df.groupby(["Color"]).transform(len)
In [134]: df
Out[134]: 
 Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1 

基于索引将列值的组移位

In [135]: df = pd.DataFrame(
 .....:    {"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88, 100]},
 .....:    index=[
 .....:        "Last Gunfighter",
 .....:        "Last Gunfighter",
 .....:        "Last Gunfighter",
 .....:        "Paynter",
 .....:        "Paynter",
 .....:        "Paynter",
 .....:    ],
 .....: )
 .....: 
In [136]: df
Out[136]: 
 line_race  beyer
Last Gunfighter         10     99
Last Gunfighter         10    102
Last Gunfighter          8    103
Paynter                 10    103
Paynter                 10     88
Paynter                  8    100
In [137]: df["beyer_shifted"] = df.groupby(level=0)["beyer"].shift(1)
In [138]: df
Out[138]: 
 line_race  beyer  beyer_shifted
Last Gunfighter         10     99            NaN
Last Gunfighter         10    102           99.0
Last Gunfighter          8    103          102.0
Paynter                 10    103            NaN
Paynter                 10     88          103.0
Paynter                  8    100           88.0 

每个组中选择具有最大值的行

In [139]: df = pd.DataFrame(
 .....:    {
 .....:        "host": ["other", "other", "that", "this", "this"],
 .....:        "service": ["mail", "web", "mail", "mail", "web"],
 .....:        "no": [1, 2, 1, 2, 1],
 .....:    }
 .....: ).set_index(["host", "service"])
 .....: 
In [140]: mask = df.groupby(level=0).agg("idxmax")
In [141]: df_count = df.loc[mask["no"]].reset_index()
In [142]: df_count
Out[142]: 
 host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2 

类似于 Python 的 itertools.groupby 的分组

In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=["A"])
In [144]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).groups
Out[144]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]}
In [145]: df["A"].groupby((df["A"] != df["A"].shift()).cumsum()).cumsum()
Out[145]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64 

扩展数据

对齐和截止日期

基于值而不是计数的滚动计算窗口

时间间隔滚动均值

分割

拆分框架

创建一个数据框列表,根据包含在行中的逻辑进行分割。

In [146]: df = pd.DataFrame(
 .....:    data={
 .....:        "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
 .....:        "Data": np.random.randn(9),
 .....:    }
 .....: )
 .....: 
In [147]: dfs = list(
 .....:    zip(
 .....:        *df.groupby(
 .....:            (1 * (df["Case"] == "B"))
 .....:            .cumsum()
 .....:            .rolling(window=3, min_periods=1)
 .....:            .median()
 .....:        )
 .....:    )
 .....: )[-1]
 .....: 
In [148]: dfs[0]
Out[148]: 
 Case      Data
0    A  0.276232
1    A -1.087401
2    A -0.673690
3    B  0.113648
In [149]: dfs[1]
Out[149]: 
 Case      Data
4    A -1.478427
5    A  0.524988
6    B  0.404705
In [150]: dfs[2]
Out[150]: 
 Case      Data
7    A  0.577046
8    A -1.715002 

透视表

���视表 文档。

部分总和和小计

In [151]: df = pd.DataFrame(
 .....:    data={
 .....:        "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
 .....:        "City": [
 .....:            "Toronto",
 .....:            "Montreal",
 .....:            "Vancouver",
 .....:            "Calgary",
 .....:            "Edmonton",
 .....:            "Winnipeg",
 .....:            "Windsor",
 .....:        ],
 .....:        "Sales": [13, 6, 16, 8, 4, 3, 1],
 .....:    }
 .....: )
 .....: 
In [152]: table = pd.pivot_table(
 .....:    df,
 .....:    values=["Sales"],
 .....:    index=["Province"],
 .....:    columns=["City"],
 .....:    aggfunc="sum",
 .....:    margins=True,
 .....: )
 .....: 
In [153]: table.stack("City", future_stack=True)
Out[153]: 
 Sales
Province City 
AL       Calgary      8.0
 Edmonton     4.0
 Montreal     NaN
 Toronto      NaN
 Vancouver    NaN
...                   ...
All      Toronto     13.0
 Vancouver   16.0
 Windsor      1.0
 Winnipeg     3.0
 All         51.0
[48 rows x 1 columns] 

类似于 R 中的 plyr 的频率表

In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
In [155]: df = pd.DataFrame(
 .....:    {
 .....:        "ID": ["x%d" % r for r in range(10)],
 .....:        "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
 .....:        "ExamYear": [
 .....:            "2007",
 .....:            "2007",
 .....:            "2007",
 .....:            "2008",
 .....:            "2008",
 .....:            "2008",
 .....:            "2008",
 .....:            "2009",
 .....:            "2009",
 .....:            "2009",
 .....:        ],
 .....:        "Class": [
 .....:            "algebra",
 .....:            "stats",
 .....:            "bio",
 .....:            "algebra",
 .....:            "algebra",
 .....:            "stats",
 .....:            "stats",
 .....:            "algebra",
 .....:            "bio",
 .....:            "bio",
 .....:        ],
 .....:        "Participated": [
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "no",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:            "yes",
 .....:        ],
 .....:        "Passed": ["yes" if x > 50 else "no" for x in grades],
 .....:        "Employed": [
 .....:            True,
 .....:            True,
 .....:            True,
 .....:            False,
 .....:            False,
 .....:            False,
 .....:            False,
 .....:            True,
 .....:            True,
 .....:            False,
 .....:        ],
 .....:        "Grade": grades,
 .....:    }
 .....: )
 .....: 
In [156]: df.groupby("ExamYear").agg(
 .....:    {
 .....:        "Participated": lambda x: x.value_counts()["yes"],
 .....:        "Passed": lambda x: sum(x == "yes"),
 .....:        "Employed": lambda x: sum(x),
 .....:        "Grade": lambda x: sum(x) / len(x),
 .....:    }
 .....: )
 .....: 
Out[156]: 
 Participated  Passed  Employed      Grade
ExamYear 
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667 

使用年度数据绘制 pandas 数据框

创建年份和月份交叉表:

In [157]: df = pd.DataFrame(
 .....:    {"value": np.random.randn(36)},
 .....:    index=pd.date_range("2011-01-01", freq="ME", periods=36),
 .....: )
 .....: 
In [158]: pd.pivot_table(
 .....:    df, index=df.index.month, columns=df.index.year, values="value", aggfunc="sum"
 .....: )
 .....: 
Out[158]: 
 2011      2012      2013
1  -1.039268 -0.968914  2.565646
2  -0.370647 -1.294524  1.431256
3  -1.157892  0.413738  1.340309
4  -1.344312  0.276662 -1.170299
5   0.844885 -0.472035 -0.226169
6   1.075770 -0.013960  0.410835
7  -0.109050 -0.362543  0.813850
8   1.643563 -0.006154  0.132003
9  -1.469388 -0.923061 -0.827317
10  0.357021  0.895717 -0.076467
11 -0.674600  0.805244 -1.187678
12 -1.776904 -1.206412  1.130127 

应用

滚动应用以组织 - 将嵌套列表转换为多索引框架

In [159]: df = pd.DataFrame(
 .....:    data={
 .....:        "A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
 .....:        "B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]],
 .....:    },
 .....:    index=["I", "II", "III"],
 .....: )
 .....: 
In [160]: def SeriesFromSubList(aList):
 .....:    return pd.Series(aList)
 .....: 
In [161]: df_orgz = pd.concat(
 .....:    {ind: row.apply(SeriesFromSubList) for ind, row in df.iterrows()}
 .....: )
 .....: 
In [162]: df_orgz
Out[162]: 
 0     1     2     3
I   A    2     4     8  16.0
 B    a     b     c   NaN
II  A  100   200   NaN   NaN
 B   jj    kk   NaN   NaN
III A   10  20.0  30.0   NaN
 B  ccc   NaN   NaN   NaN

使用返回系列的数据框的滚动应用

滚动应用到多列,其中函数在返回系列之前计算系列的标量

In [163]: df = pd.DataFrame(
 .....:    data=np.random.randn(2000, 2) / 10000,
 .....:    index=pd.date_range("2001-01-01", periods=2000),
 .....:    columns=["A", "B"],
 .....: )
 .....: 
In [164]: df
Out[164]: 
 A         B
2001-01-01 -0.000144 -0.000141
2001-01-02  0.000161  0.000102
2001-01-03  0.000057  0.000088
2001-01-04 -0.000221  0.000097
2001-01-05 -0.000201 -0.000041
...              ...       ...
2006-06-19  0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113  0.000114
2006-06-22  0.000136  0.000109
2006-06-23  0.000027  0.000030
[2000 rows x 2 columns]
In [165]: def gm(df, const):
 .....:    v = ((((df["A"] + df["B"]) + 1).cumprod()) - 1) * const
 .....:    return v.iloc[-1]
 .....: 
In [166]: s = pd.Series(
 .....:    {
 .....:        df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
 .....:        for i in range(len(df) - 50)
 .....:    }
 .....: )
 .....: 
In [167]: s
Out[167]: 
2001-01-01    0.000930
2001-01-02    0.002615
2001-01-03    0.001281
2001-01-04    0.001117
2001-01-05    0.002772
 ... 
2006-04-30    0.003296
2006-05-01    0.002629
2006-05-02    0.002081
2006-05-03    0.004247
2006-05-04    0.003928
Length: 1950, dtype: float64 

使用 DataFrame 返回标量的滚动应用

滚动应用于多列,其中函数返回标量(成交量加权平均价格)

In [168]: rng = pd.date_range(start="2014-01-01", periods=100)
In [169]: df = pd.DataFrame(
 .....:    {
 .....:        "Open": np.random.randn(len(rng)),
 .....:        "Close": np.random.randn(len(rng)),
 .....:        "Volume": np.random.randint(100, 2000, len(rng)),
 .....:    },
 .....:    index=rng,
 .....: )
 .....: 
In [170]: df
Out[170]: 
 Open     Close  Volume
2014-01-01 -1.611353 -0.492885    1219
2014-01-02 -3.000951  0.445794    1054
2014-01-03 -0.138359 -0.076081    1381
2014-01-04  0.301568  1.198259    1253
2014-01-05  0.276381 -0.669831    1728
...              ...       ...     ...
2014-04-06 -0.040338  0.937843    1188
2014-04-07  0.359661 -0.285908    1864
2014-04-08  0.060978  1.714814     941
2014-04-09  1.759055 -0.455942    1065
2014-04-10  0.138185 -1.147008    1453
[100 rows x 3 columns]
In [171]: def vwap(bars):
 .....:    return (bars.Close * bars.Volume).sum() / bars.Volume.sum()
 .....: 
In [172]: window = 5
In [173]: s = pd.concat(
 .....:    [
 .....:        (pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i + window]]))
 .....:        for i in range(len(df) - window)
 .....:    ]
 .....: )
 .....: 
In [174]: s.round(2)
Out[174]: 
2014-01-06    0.02
2014-01-07    0.11
2014-01-08    0.10
2014-01-09    0.07
2014-01-10   -0.29
 ... 
2014-04-06   -0.63
2014-04-07   -0.02
2014-04-08   -0.03
2014-04-09    0.34
2014-04-10    0.29
Length: 95, dtype: float64 

扩展数据

对齐和截止日期

基于值而不是计数的滚动计算窗口

按时间间隔计算滚动均值

分割

分割一个框架

创建一个数据框列表,根据行中包含的逻辑进行分割。

In [146]: df = pd.DataFrame(
 .....:    data={
 .....:        "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
 .....:        "Data": np.random.randn(9),
 .....:    }
 .....: )
 .....: 
In [147]: dfs = list(
 .....:    zip(
 .....:        *df.groupby(
 .....:            (1 * (df["Case"] == "B"))
 .....:            .cumsum()
 .....:            .rolling(window=3, min_periods=1)
 .....:            .median()
 .....:        )
 .....:    )
 .....: )[-1]
 .....: 
In [148]: dfs[0]
Out[148]: 
 Case      Data
0    A  0.276232
1    A -1.087401
2    A -0.673690
3    B  0.113648
In [149]: dfs[1]
Out[149]: 
 Case      Data
4    A -1.478427
5    A  0.524988
6    B  0.404705
In [150]: dfs[2]
Out[150]: 
 Case      Data
7    A  0.577046
8    A -1.715002 

Pandas 2.2 中文官方教程和指南(二十五·二)(3)https://developer.aliyun.com/article/1509411

相关文章
|
SQL 数据采集 数据挖掘
Pandas 教程
10月更文挑战第25天
238 2
|
存储 JSON 数据格式
Pandas 使用教程 CSV - CSV 转 JSON
Pandas 使用教程 CSV - CSV 转 JSON
147 0
|
JSON 数据格式 Python
Pandas 使用教程 JSON
Pandas 使用教程 JSON
189 0
|
SQL 数据采集 JSON
Pandas 使用教程 Series、DataFrame
Pandas 使用教程 Series、DataFrame
275 0
|
数据采集 存储 数据可视化
Pandas高级教程:数据清洗、转换与分析
Pandas是Python的数据分析库,提供Series和DataFrame数据结构及数据分析工具,便于数据清洗、转换和分析。本教程涵盖Pandas在数据清洗(如缺失值、重复值和异常值处理)、转换(数据类型转换和重塑)和分析(如描述性统计、分组聚合和可视化)的应用。通过学习Pandas,用户能更高效地处理和理解数据,为数据分析任务打下基础。
1408 3
|
索引 Python
Pandas 2.2 中文官方教程和指南(一)(4)
Pandas 2.2 中文官方教程和指南(一)
169 0
|
存储 SQL JSON
Pandas 2.2 中文官方教程和指南(一)(3)
Pandas 2.2 中文官方教程和指南(一)
227 0
|
XML 关系型数据库 PostgreSQL
Pandas 2.2 中文官方教程和指南(一)(2)
Pandas 2.2 中文官方教程和指南(一)
477 0
|
XML 关系型数据库 MySQL
Pandas 2.2 中文官方教程和指南(一)(1)
Pandas 2.2 中文官方教程和指南(一)
707 0
|
C++ 索引 Python
Pandas 2.2 中文官方教程和指南(五)(4)
Pandas 2.2 中文官方教程和指南(五)
128 0

热门文章

最新文章