常见聚合方法
方法 |
说明 |
count |
计数 |
describe |
给出各列的常用统计量 |
min,max |
最大最小值 |
argmin,argmax |
最大最小值的索引位置(整数) |
idxmin,idxmax |
最大最小值的索引值 |
quantile |
计算样本分位数 |
sum,mean |
对列求和,均值 |
mediam |
中位数 |
mad |
根据平均值计算平均绝对离差 |
var,std |
方差,标准差 |
skew |
偏度(三阶矩) |
Kurt |
峰度(四阶矩) |
cumsum |
累积和 |
Cummins,cummax |
累计组大致和累计最小值 |
cumprod |
累计积 |
diff |
一阶差分 |
pct_change |
计算百分数变化 |
1 清洗无效数据
df[df.isnull()]
df[df.notnull()]
df.dropna()
df.dropna(axis=1,thresh=3)
df.dropna(how='ALL')
df.dropna()与data[data.notnull()]
df3.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
df.fillna({1:0, 2:0.5})
df.fillna(method='ffill')
2 drop函数使用
drop函数的使用:删除行、删除列
print frame.drop(['a'])
print frame.drop(['Ohio'], axis = 1)
drop函数默认删除行,列需要加axis = 1
drop函数的使用:inplace参数
采用drop方法,有下面三种等价的表达式:
1. DF= DF.drop('column_name', axis=1);
2. DF.drop('column_name',axis=1, inplace=True)
3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)
注意:凡是会对原数组作出修改并返回一个新数组的,往往都有一个 inplace可选参数。如果手动设定为True(默认为False),那么原数组直接就被替换。也就是说,采用inplace=True之后,原数组名(如2和3情况所示)对应的内存值直接改变;
而采用inplace=False之后,原数组名对应的内存值并不改变,需要将新的结果赋给一个新的数组或者覆盖原数组的内存位置(如1情况所示)。
drop函数的使用:数据类型转换
df['Name'] = df['Name'].astype(np.datetime64)
DataFrame.astype() 方法可对整个DataFrame或某一列进行数据格式转换,支持Python和NumPy的数据类型。
3 pandas数据处理方法
(1) 删除重复数据
df.duplicated() 返回boolean列表,重复为True
df.drop_duplicates() 删除重复元素即值为True的列行
参数列表
- subset : column label or sequence of labels, optional
用来指定特定的列,默认所有列
- keep : {‘first’, ‘last’, False}, default ‘first’
删除重复项并保留第一次出现的项
- inplace : boolean, default False
是直接在原来数据上修改还是保留一个副本
data.duplicated()`
data.drop_duplicated()
data.drop_duplicated(['key1'])
df = DataFrame({'color':['white','white','red','red','white'],
'value':[2,1,3,3,2]})
display(df,df.duplicated(),df.drop_duplicates())
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2
0 False
1 False
2 False
3 True
4 True
dtype: bool
color value
0 white 2
1 white 1
2 red 3
(2) 映射
1 replace() 替换元素 replace({索引键值对})
df = DataFrame({'item':['ball','mug','pen'],
'color':['white','rosso','verde'],
'price':[5.56,4.20,1.30]})
newcolors = {'rosso':'red','verde':'green'}
display(df,df.replace(newcolors))
color item price
0 white ball 5.56
1 rosso mug 4.20
2 verde pen 1.30
color item price
0 white ball 5.56
1 red mug 4.20
2 green pen 1.30
2.replace还经常用来替换NaN元素
df2 = DataFrame({'math':[100,139,np.nan],'English':[146,None,119]},index = ['张三','李四','Tom'])
newvalues = {np.nan:100}
display(df2,df2.replace(newvalues))
English math
张三 146.0 100.0
李四 NaN 139.0
Tom 119.0 NaN
English math
张三 146.0 100.0
李四 100.0 139.0
Tom 119.0 100.0
2 map()函数:新建一列
map(函数,可迭代对象) map(函数/{索引键值对})
map中返回的数据是一个具体值,不能迭代.
df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']})
price = {'red':5.56,'green':3.14,'chemistry':2.79}
df3['price'] = df3['color'].map(price)
display(df3)
color project price
0 red math 5.56
1 green english 3.14
2 blue chemistry NaN
df3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['张三','李四','倩倩'])
display(df3)
display(df3['zs'].map({129:'你好',130:'非常好',34:'不错'}))
display(df3['zs'].map({129:120}))
def mapscore(score):
if score<90:
return 'failed'
elif score>120:
return 'excellent'
else:
return 'pass'
df3['status'] = ddd['zs'].map(mapscore)
df3
输出:
zs ls
张三 129 136
李四 130 98
倩倩 34 8
张三 你好
李四 非常好
倩倩 不错
Name: zs, dtype: object
张三 120.0
李四 NaN
倩倩 NaN
Name: zs, dtype: float64
Out[96]:
ls zs status
张三 136 129 excellent
李四 98 130 excellent
倩倩 8 34 failed
3 rename()函数:替换索引 rename({索引键值对})
df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
display(df4,df4.rename(new_index))
color value
0 white 2
1 gray 0
2 purple 9
3 blue 2
4 green 0
color value
first white 2
two gray 0
three purple 9
four blue 2
five green 0
(3) 异常值检测与过滤
1 使用describe()函数查看每一列的描述性统计量
df = DataFrame(np.random.randint(10,size = 10))
display(df.describe())
0
count 10.000000
mean 5.900000
std 2.685351
min 1.000000
25% 6.000000
50% 7.000000
75% 7.750000
max 8.000000
2 使用std()函数可以求得DataFrame对象每一列的标准差
df.std()
0 3.306559
dtype: float64
3 根据每一列的标准差,对DataFrame元素进行过滤。
借助any()函数,对每一列应用筛选条件,any过滤出所有符合条件的数据
display(df[(df>df.std()*3).any(axis = 1)])
df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True)
display(df,df.shape)
输出:
0 1
2 7 9
6 8 8
9 8 1
0 1
0 5 0
1 3 3
3 3 5
4 2 4
5 7 6
7 1 6
8 7 7
(7, 2)
(4) 排序
使用take()函数排序
可以借助np.random.permutation()函数随机排序
df5 = DataFrame(np.arange(25).reshape(5,5))
new_order = np.random.permutation(5)
display(new_order)
display(df5,df5.take(new_order))
array([4, 2, 3, 1, 0])
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
0 1 2 3 4
4 20 21 22 23 24
2 10 11 12 13 14
3 15 16 17 18 19
1 5 6 7 8 9
0 0 1 2 3 4
(5) 数据分类分组
groupby()函数
import pandas as pd
df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}])
display(df)
display(df.groupby(by='col1').agg({'col2':sum}).reset_index())
display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index())
display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index())
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datetime import datetime
'''
分组groupby
'''
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.arange(5),
'data2':np.arange(5)})
print(df)
'''
根据分组进行计算
'''
grouped=df['data1'].groupby(df['key1'])
print(grouped.mean())
groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean()
print(groupedmean)
print(groupedmean.unstack())
df['key1']
states=np.array(['Oh','Ca','Ca','Oh','Oh'])
years=np.array([2005,2005,2006,2005,2006])
print(df['data1'].groupby([states,years]).mean())
print(df.groupby('key1').mean())
print(df.groupby(['key1','key2']).mean())
print(df.groupby(['key1','key2']).size())
'''
对分组信息进行迭代
'''
for name,group in df.groupby('key1'):
print(name)
print(group)
for (k1,k2),group in df.groupby(['key1','key2']):
print(name)
print(group)
'''
选取一个或一组列,返回的Series的分组对象
'''
print(df.groupby(df['key1'])['data1'])
print(df.groupby(['key1'])[['data1','data2']].mean())
print(df.groupby(['key1','key2'])['data1'].mean())
'''
通过函数进行分组
'''
print(people.groupby(len,axis=1).sum())
key_list=['one','one','one','two','two']
print(people.groupby([len,key_list],axis=1).min())
'''
根据索引级别分组
'''
columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
print(hier_df)
print(hier_df.groupby(level='cty',axis=1).count())
(6) 高级数据聚合
1 可以使用pd.merge()函数包聚合操作的计算结果添加到df的每一行
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
'color':['white','white','red','green','white','white','red','green'],
'weight':np.random.randint(10,size = 8),
'price':np.random.randint(10,size = 8)}
df = DataFrame(d1)
sums = df.groupby('color').sum().add_prefix('total_')
items = df.groupby('item')['price','weight'].sum()
means = items['price']/items['weight']
means = DataFrame(means,columns=['means_price'])
df2 = pd.merge(df,sums,left_on = 'color',right_index = True)
df3 = pd.merge(df2,means,left_on = 'item',right_index = True)
display(df2,df3)
color item price weight
0 white luobo 9 2
1 white baicai 5 9
2 red lajiao 5 8
3 green donggua 1 1
4 white luobo 7 4
5 white baicai 8 0
6 red lajiao 6 8
7 green donggua 4 3
total_price total_weight
color
green 5 4
red 11 16
white 29 15
pandas.core.frame.DataFrame
pandas.core.frame.DataFrame
Out[141]:
color item price weight total_price total_weight
0 white luobo 9 2 29 15
1 white baicai 5 9 29 15
4 white luobo 7 4 29 15
5 white baicai 8 0 29 15
2 red lajiao 5 8 11 16
6 red lajiao 6 8 11 16
3 green donggua 1 1 5 4
7 green donggua 4 3 5 4
2 可以使用transform和apply实现相同功能
使用transform
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
'color':['white','white','red','green','white','white','red','green'],
'weight':np.random.randint(10,size = 8),
'price':np.random.randint(10,size = 8)}
df = DataFrame(d1)
sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_")
sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_')
sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_')
display(sum,df,sum1,sums2,sums3)
输出:
<function sum>
color item price weight
0 white luobo 7 7
1 white baicai 7 7
2 red lajiao 2 7
3 green donggua 6 6
4 white luobo 1 2
5 white baicai 3 6
6 red lajiao 7 0
7 green donggua 0 2
total_price total_weight
color
green 6 8
red 9 7
white 18 22
total_price total_weight
0 18 22
1 18 22
2 9 7
3 6 8
4 18 22
5 18 22
6 9 7
7 6 8
total_price total_weight
0 18 22
1 18 22
2 9 7
3 6 8
4 18 22
5 18 22
6 9 7
7 6 8
使用apply
def sum_price(x):
return x.sum()
sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_')
sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_')
display(df,sums3,sums4)
输出:
color item price weight
0 white luobo 4 4
1 white baicai 0 3
2 red lajiao 0 4
3 green donggua 7 5
4 white luobo 3 1
5 white baicai 3 3
6 red lajiao 0 6
7 green donggua 0 7
total_price total_weight
color
green 7 12
red 0 10
white 10 11
totals_price totals_weight
color
green 7 12
red 0 10
white 10 11