任何groupby操作都会涉及到下面的三个操作之一:
- Splitting:分割数据
- Applying:应用一个函数
- Combining:合并结果
在许多情况下,我们将数据分成几组,并在每个子集上应用一些功能。在应用中,我们可以执行以下操作:
- Aggregation :计算一些摘要统计
- Transformation :执行一些特定组的操作
- Filtration:根据某些条件下丢弃数据
1 加载数据
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data)
print(df)
Team Rank Year Points 0 Riders 1 2014 876 1 Riders 2 2015 789 2 Devils 2 2014 863 3 Devils 3 2015 673 4 Kings 3 2014 741 5 kings 4 2015 812 6 Kings 1 2016 756 7 Kings 1 2017 788 8 Riders 2 2016 694 9 Royals 4 2014 701 10 Royals 1 2015 804 11 Riders 2 2017 690
2 数据分组
Pandas对象可以拆分为任何对象。分割对象的方法有多种:
- obj.groupby('key')
- obj.groupby(['key1','key2'])
- obj.groupby(key,axis=1)
现在让我们看看如何将分组对象应用于DataFrame对象
2.1 根据某一列分组
df.groupby('Team')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001B33FFA0DA0>
# 查看分组 df.groupby('Team').groups
{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
2.2 根绝多列进行分组
df.groupby(['Team','Year']).groups
{('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')}
2.3 遍历分组
grouped = df.groupby('Team') for name,group in grouped: print(name) print(group)
Devils Team Rank Year Points 2 Devils 2 2014 863 3 Devils 3 2015 673 Kings Team Rank Year Points 4 Kings 3 2014 741 6 Kings 1 2016 756 7 Kings 1 2017 788 Riders Team Rank Year Points 0 Riders 1 2014 876 1 Riders 2 2015 789 8 Riders 2 2016 694 11 Riders 2 2017 690 Royals Team Rank Year Points 9 Royals 4 2014 701 10 Royals 1 2015 804 kings Team Rank Year Points 5 kings 4 2015 812
默认情况下,groupby对象标签名称与组名称相同,看下面的例子就清楚了
2.4 选取某一个分组
使用get_group()方法,我们可以选择一个组。
grouped = df.groupby('Year') print(grouped.get_group(2014))
Team Rank Year Points 0 Riders 1 2014 876 2 Devils 2 2014 863 4 Kings 3 2014 741 9 Royals 4 2014 701
3 Aggregations(聚合)
聚合函数返回每个组的单个聚合值。一旦创建了group by对象,就可以对分组数据执行多个聚合操作。
3.1 常见的是通过agg方法来实现aggregation
grouped = df.groupby('Year') print(grouped['Points'].agg(np.mean))
Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64
3.2 查看每个组大小的另一种方法是应用size()函数
grouped = df.groupby('Team') print(grouped.agg(np.size))
Rank Year Points Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1
3.3 一次应用多个聚合函数
grouped = df.groupby('Team') print(grouped['Points'].agg([np.sum, np.mean, np.std]))
sum mean std Team Devils 1536 768.000000 134.350288 Kings 2285 761.666667 24.006943 Riders 3049 762.250000 88.567771 Royals 1505 752.500000 72.831998 kings 812 812.000000 NaN
4 Transformations
对组或列的转换将返回一个对象,该对象的索引大小与正在分组的对象的大小相同。因此,转换返回与组块大小相同的结果。
grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print(grouped.transform(score))
Rank Year Points 0 -15.000000 -11.618950 12.843272 1 5.000000 -3.872983 3.020286 2 -7.071068 -7.071068 7.071068 3 7.071068 7.071068 -7.071068 4 11.547005 -10.910895 -8.608621 5 NaN NaN NaN 6 -5.773503 2.182179 -2.360428 7 -5.773503 8.728716 10.969049 8 5.000000 3.872983 -7.705963 9 7.071068 -7.071068 -7.071068 10 -7.071068 7.071068 7.071068 11 5.000000 11.618950 -8.157595
5 Filtration
过滤数据
print(df.groupby('Team').filter(lambda x: len(x) >= 3))
Team Rank Year Points 0 Riders 1 2014 876 1 Riders 2 2015 789 4 Kings 3 2014 741 6 Kings 1 2016 756 7 Kings 1 2017 788 8 Riders 2 2016 694 11 Riders 2 2017 690
6 参考
https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm