Pandas统计分析
pandas数据的基本统计分析
和numpy的函数近似
import pandas as pd
dates = pd.date_range('20130101',periods=10)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10'],
dtype='datetime64[ns]', freq='D')
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=['A','B','C','D'])
df
|
A |
B |
C |
D |
2013-01-01 |
-1.587560 |
-0.198819 |
0.720054 |
1.921686 |
2013-01-02 |
0.296288 |
1.876570 |
0.338344 |
0.597835 |
2013-01-03 |
-1.832852 |
0.752045 |
2.184984 |
-0.157722 |
2013-01-04 |
-0.650829 |
1.690322 |
-1.145963 |
-0.798702 |
2013-01-05 |
-0.729986 |
-0.494417 |
2.166254 |
1.131232 |
2013-01-06 |
-1.759444 |
-1.104058 |
0.462934 |
2.050315 |
2013-01-07 |
0.760111 |
-1.753986 |
0.104831 |
1.075343 |
2013-01-08 |
0.096572 |
0.383660 |
0.604831 |
0.715224 |
2013-01-09 |
0.126292 |
1.025429 |
0.019330 |
-0.417396 |
2013-01-10 |
-0.179047 |
0.175366 |
0.826219 |
-0.451984 |
df.describe()
|
A |
B |
C |
D |
count |
10.000000 |
10.000000 |
10.000000 |
10.000000 |
mean |
-0.546045 |
0.235211 |
0.628182 |
0.566583 |
std |
0.923341 |
1.164277 |
0.985506 |
1.001821 |
min |
-1.832852 |
-1.753986 |
-1.145963 |
-0.798702 |
25% |
-1.373167 |
-0.420517 |
0.163209 |
-0.352477 |
50% |
-0.414938 |
0.279513 |
0.533883 |
0.656529 |
75% |
0.118862 |
0.957083 |
0.799678 |
1.117260 |
max |
0.760111 |
1.876570 |
2.184984 |
2.050315 |
df.mean()
A -0.546045
B 0.235211
C 0.628182
D 0.566583
dtype: float64
df.mean(1)
2013-01-01 0.213840
2013-01-02 0.777259
2013-01-03 0.236614
2013-01-04 -0.226293
2013-01-05 0.518271
2013-01-06 -0.087563
2013-01-07 0.046575
2013-01-08 0.450072
2013-01-09 0.188414
2013-01-10 0.092638
Freq: D, dtype: float64
基本统计分析函数
- .describe() 针对0轴(列)的统计汇总,计数/平均值/标准差/最小值/四分位数/最大值
- .sum() 计算数据的总和,按0轴计算(各行计算),下同,要按列计算参数1
- .count() 非NaN值数量
- .mean() .median() .mode() 计算数据的算数平均值/算数中位数/众数
- .var() .std() 计算数据的方差/标准差
- .min() .max() 计算数据的最小值/最大值
只适用于series:
- .argmin(),.argmax() 计算数据最大值/最小值所在位置的索引位置(自动索引,用她是因为很容易切片等操作)
- .idxmin(),.idxmax() 计算数据最大值/最小值所在位置的索引(自定义索引)
a = pd.Series([9,8,7,6],index=['a','b','c','d'])
a
a 9
b 8
c 7
d 6
dtype: int64
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
|
0 |
1 |
2 |
3 |
4 |
c |
0 |
1 |
2 |
3 |
4 |
a |
5 |
6 |
7 |
8 |
9 |
d |
10 |
11 |
12 |
13 |
14 |
b |
15 |
16 |
17 |
18 |
19 |
a.describe()
count 4.000000
mean 7.500000
std 1.290994
min 6.000000
25% 6.750000
50% 7.500000
75% 8.250000
max 9.000000
dtype: float64
type(a.describe())
pandas.core.series.Series
a.describe()['count']
4.0
b.describe()
|
0 |
1 |
2 |
3 |
4 |
count |
4.000000 |
4.000000 |
4.000000 |
4.000000 |
4.000000 |
mean |
7.500000 |
8.500000 |
9.500000 |
10.500000 |
11.500000 |
std |
6.454972 |
6.454972 |
6.454972 |
6.454972 |
6.454972 |
min |
0.000000 |
1.000000 |
2.000000 |
3.000000 |
4.000000 |
25% |
3.750000 |
4.750000 |
5.750000 |
6.750000 |
7.750000 |
50% |
7.500000 |
8.500000 |
9.500000 |
10.500000 |
11.500000 |
75% |
11.250000 |
12.250000 |
13.250000 |
14.250000 |
15.250000 |
max |
15.000000 |
16.000000 |
17.000000 |
18.000000 |
19.000000 |
type(b.describe())
pandas.core.frame.DataFrame
b.describe().loc['max']
0 15.0
1 16.0
2 17.0
3 18.0
4 19.0
Name: max, dtype: float64
b.describe().iloc[7]
0 15.0
1 16.0
2 17.0
3 18.0
4 19.0
Name: max, dtype: float64
b.describe()[2]
count 4.000000
mean 9.500000
std 6.454972
min 2.000000
25% 5.750000
50% 9.500000
75% 13.250000
max 17.000000
Name: 2, dtype: float64
b.describe().loc[:,2]
count 4.000000
mean 9.500000
std 6.454972
min 2.000000
25% 5.750000
50% 9.500000
75% 13.250000
max 17.000000
Name: 2, dtype: float64
数据的累计统计分析
- 对序列的前1-n个数累计运算
- 可减少for循环的使用
累计统计分析函数,适用于series和dataframe类型
- .cumsum() 依次给出前1/2/…/n个数的和
- .cumprod() 依次给出前1/2/…/n个数的积
- .cummax() 依次给出前1/2/…/n个数的最大值
- .cummin() 依次给出前1/2/…/n个数的最小值
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
|
0 |
1 |
2 |
3 |
4 |
c |
0 |
1 |
2 |
3 |
4 |
a |
5 |
6 |
7 |
8 |
9 |
d |
10 |
11 |
12 |
13 |
14 |
b |
15 |
16 |
17 |
18 |
19 |
b.cumsum()
|
0 |
1 |
2 |
3 |
4 |
c |
0 |
1 |
2 |
3 |
4 |
a |
5 |
7 |
9 |
11 |
13 |
d |
15 |
18 |
21 |
24 |
27 |
b |
30 |
34 |
38 |
42 |
46 |
b.cumprod()
|
0 |
1 |
2 |
3 |
4 |
c |
0 |
1 |
2 |
3 |
4 |
a |
0 |
6 |
14 |
24 |
36 |
d |
0 |
66 |
168 |
312 |
504 |
b |
0 |
1056 |
2856 |
5616 |
9576 |
滚动计算(窗口计算)函数
适用series/dataframe
- .rolling(w).sum() 依次计算相邻w个元素的和
- .rolling(w).mean() 依次计算相邻w个元素的算数平均值
- .rolling(w).var() 依次计算相邻w个元素的方差
- .rolling(w).std() 依次计算相邻w个元素的标准差
- .rolling(w).min .max() 依次计算相邻w个元素的最小值/最大值
b.rolling(2).sum()
|
0 |
1 |
2 |
3 |
4 |
c |
NaN |
NaN |
NaN |
NaN |
NaN |
a |
5.0 |
7.0 |
9.0 |
11.0 |
13.0 |
d |
15.0 |
17.0 |
19.0 |
21.0 |
23.0 |
b |
25.0 |
27.0 |
29.0 |
31.0 |
33.0 |
b.rolling(3).sum()
|
0 |
1 |
2 |
3 |
4 |
c |
NaN |
NaN |
NaN |
NaN |
NaN |
a |
NaN |
NaN |
NaN |
NaN |
NaN |
d |
15.0 |
18.0 |
21.0 |
24.0 |
27.0 |
b |
30.0 |
33.0 |
36.0 |
39.0 |
42.0 |