Python数据分析
- 🌸个人主页:JoJo的数据分析历险记
- 📝个人介绍:小编大四统计在读,目前保研到统计学top3高校继续攻读统计研究生
- 💌如果文章对你有帮助,欢迎关注、点赞、收藏、订阅专栏
本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析
最近小伙伴问我有什么刷题网站推荐,在这里推荐一下牛客网,里面包含各种面经题库,全是免费的题库,可以全方面提升你的职业竞争力,提升编程实战技巧,赶快来和我一起刷题吧!牛客网链接|python篇
我们介绍了Numpy在数据处理方面的应用,本文介绍一下pandas在数据处理方面的应用,pandas可以是基于numpy构建的,但是可以让数据处理变得更便捷
导入相关库
import numpy as np
import pandas as pd
💮1.Series 对象
pandas主要有两个数据对象,一个是Series,类似于一个向量的形式,另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。
s = pd.Series([12,-4,7,9])
s
0 12
1 -4
2 7
3 9
dtype: int64
🏵️1.1 Series基本操作
#选择内部元素
s[2]
7
#为元素赋值
s[2]=5
s
s['a'] = 4
s
0 12
1 -4
2 5
3 9
a 4
dtype: int64
#用其它对象定义新的series对象
arr = np.array([1,2,3,4])
#此时s2只是原来的一个动态视图,会随数组的改变而改变,例如我们改变原来数组中的第二个元素值
s2 = pd.Series(arr)
s2
arr[1] = 9
s2
0 1
1 9
2 3
3 4
dtype: int32
#筛选元素
s[s>8]
0 12
3 9
dtype: int64
#Series对象的组成元素
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd
white 1
white 0
blue 2
green 1
green 2
yellow 3
dtype: int64
#unique去重 返回一个数组
serd.unique()
array([1, 0, 2, 3], dtype=int64)
#value_counts 去重 返回出现次数
serd.value_counts()
2 2
1 2
3 1
0 1
dtype: int64
#isin 函数,返回布尔值
serd.isin([0,3])
white False
white True
blue False
green False
green False
yellow True
dtype: bool
serd[serd.isin([0,3])]
white 0
yellow 3
dtype: int64
#NaN
s2 = pd.Series([-5,3,np.NaN,14])
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
# 用isnull 和 notnull 来进行判断
s2.isnull()
s2.notnull()
0 True
1 True
2 False
3 True
dtype: bool
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
#用作字典
mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydict)
myseries
red 2000
blue 1000
yellow 500
orange 1000
dtype: int64
当出现缺失值时,会直接用NaN替代
colors = ['red','blue','yellow','orange','green']
myseries = pd.Series(mydict, index = colors)
myseries
red 2000.0
blue 1000.0
yellow 500.0
orange 1000.0
green NaN
dtype: float64
进行运算时有NaN为NaN
mydict2 ={'red':400,'yellow':1000,"black":700}
myseries2 = pd.Series(mydict2)
myseries.fillna(0) + myseries2.fillna(0)
black NaN
blue NaN
green NaN
orange NaN
red 2400.0
yellow 1500.0
dtype: float64
🌹2.DataFrame对象
DataFrame对象是我们在进行数据分析时最常见的数据格式,相当于一个矩阵数据,由不同行不同列组成,通常每一列代表一个变量,每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。
创建DataFrame对象
#DataFrame对象
data = {'color':['blue','green','yellow','red','white'],
'object':['ball','pen','pencil','paper','mug'],
'price':[1.2,1.0,0.6,0.9,1.7]}
frame = pd.DataFrame(data)
frame
|
color |
object |
price |
0 |
blue |
ball |
1.2 |
1 |
green |
pen |
1.0 |
2 |
yellow |
pencil |
0.6 |
3 |
red |
paper |
0.9 |
4 |
white |
mug |
1.7 |
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2
|
object |
price |
0 |
ball |
1.2 |
1 |
pen |
1.0 |
2 |
pencil |
0.6 |
3 |
paper |
0.9 |
4 |
mug |
1.7 |
frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])
frame3
|
color |
object |
price |
one |
blue |
ball |
1.2 |
two |
green |
pen |
1.0 |
three |
yellow |
pencil |
0.6 |
four |
red |
paper |
0.9 |
five |
white |
mug |
1.7 |
#选取元素
#获得所有列的名称
frame.columns
Index(['color', 'object', 'price'], dtype='object')
#获得所有行的名称
frame.index
RangeIndex(start=0, stop=5, step=1)
#获得所有值
frame.values
array([['blue', 'ball', 1.2],
['green', 'pen', 1.0],
['yellow', 'pencil', 0.6],
['red', 'paper', 0.9],
['white', 'mug', 1.7]], dtype=object)
#获得某一列的值
frame['price']
0 1.2
1 1.0
2 0.6
3 0.9
4 1.7
Name: price, dtype: float64
#获得行的值 用ix属性和行的索引项
frame.iloc[2]
color yellow
object pencil
price 0.6
Name: 2, dtype: object
#指定多个索引值能选取多行
frame.iloc[[2,4]]
|
color |
object |
price |
2 |
yellow |
pencil |
0.6 |
4 |
white |
mug |
1.7 |
#可以用frame[0:1]或者frame[0:2]选择行 但切记frame[0]没有数
frame[0:4]
对DataFrame进行行选择时,使用索引frame[0:1]返回第一行数据,[1:2]返回第二行数据
|
color |
object |
price |
0 |
blue |
ball |
1.2 |
1 |
green |
pen |
1.0 |
2 |
yellow |
pencil |
0.6 |
3 |
red |
paper |
0.9 |
#如果要获取其中的一个元素,必须依次指定元素所在的列名称、行的索引值或标签
frame['object'][3]
'paper'
#赋值
frame['new']=12 #直接添加某一列
frame
|
color |
object |
price |
new |
0 |
blue |
ball |
1.2 |
12 |
1 |
green |
pen |
1.0 |
12 |
2 |
yellow |
pencil |
0.6 |
12 |
3 |
red |
paper |
0.9 |
12 |
4 |
white |
mug |
1.7 |
12 |
frame['new']=[1,2,3,4,5]
frame
|
color |
object |
price |
new |
0 |
blue |
ball |
1.2 |
1 |
1 |
green |
pen |
1.0 |
2 |
2 |
yellow |
pencil |
0.6 |
3 |
3 |
red |
paper |
0.9 |
4 |
4 |
white |
mug |
1.7 |
5 |
#修改单个元素的方法
frame['price'][2]=3.3
frame
|
color |
object |
price |
new |
0 |
blue |
ball |
1.2 |
1 |
1 |
green |
pen |
1.0 |
2 |
2 |
yellow |
pencil |
3.3 |
3 |
3 |
red |
paper |
0.9 |
4 |
4 |
white |
mug |
1.7 |
5 |
# 删除一整列的所有数据,用del
frame['new'] = 12
frame
del frame['new']
frame
|
color |
object |
price |
0 |
blue |
ball |
1.2 |
1 |
green |
pen |
1.0 |
2 |
yellow |
pencil |
3.3 |
3 |
red |
paper |
0.9 |
4 |
white |
mug |
1.7 |
#筛选元素
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'],
columns=['ball','pen','pencil','paper'])
frame3
frame3[frame3>12]
|
ball |
pen |
pencil |
paper |
red |
NaN |
NaN |
NaN |
NaN |
white |
NaN |
NaN |
NaN |
NaN |
blue |
NaN |
NaN |
NaN |
NaN |
green |
NaN |
13.0 |
14.0 |
15.0 |
#用嵌套字典生成DataFrame对象 当出现缺失值时用NaN替代
nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}
nestdict
{'red': {2012: 22, 2013: 33},
'white': {2011: 13, 2012: 22, 2013: 16},
'blue': {2011: 17, 2012: 27, 2013: 48}}
frame2 = pd.DataFrame(nestdict)
frame2
|
red |
white |
blue |
2011 |
NaN |
13 |
17 |
2012 |
22.0 |
22 |
27 |
2013 |
33.0 |
16 |
48 |
进行转置
frame2.T
|
2011 |
2012 |
2013 |
red |
NaN |
22.0 |
33.0 |
white |
13.0 |
22.0 |
16.0 |
blue |
17.0 |
27.0 |
48.0 |
#index对象
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index
Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')
ser.idxmax()
'white'
ser.idxmin()
'blue'
#含重复标签的Index
serd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])
serd
white 0
white 1
blue 2
green 3
green 4
yellow 5
dtype: int64
#当一个标签对应多个元素时,返回一个Series对象 而不是单个元素
serd['white']
white 0
white 1
dtype: int64
#判断是否由重复值, is_unique
#索引对象的其他功能
ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])
ser
one 2
two 5
three 7
four 4
dtype: int64
#reindex()函数可以更换series对象的索引,生成一个新的series对象
ser.reindex(['three','one','five','two'])
three 7.0
one 2.0
five NaN
two 5.0
dtype: float64
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3
0 1
3 5
5 6
6 3
dtype: int64
#自动插补
#reindex()函数,method:ffill 表示插补的数为前面的值,bfill表示插补的数为后面的值
ser3.reindex(range(6),method='ffill')
0 1
1 1
2 1
3 5
4 5
5 6
dtype: int64
ser3.reindex(range(8),method='bfill')
0 1.0
1 5.0
2 5.0
3 5.0
4 6.0
5 6.0
6 3.0
7 NaN
dtype: float64
frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])
|
colors |
price |
new |
object |
0 |
blue |
1.2 |
blue |
ball |
1 |
green |
1.0 |
green |
pen |
2 |
yellow |
3.3 |
yellow |
pencil |
3 |
red |
0.9 |
red |
paper |
4 |
white |
1.7 |
white |
mug |
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser
red 0.0
blue 1.0
yellow 2.0
white 3.0
dtype: float64
ser.drop('yellow')
red 0.0
blue 1.0
white 3.0
dtype: float64
ser.drop(['blue','white'])
red 0.0
yellow 2.0
dtype: float64
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
#删除时默认是行 axis指定轴,1为列
frame.drop(['pen'],axis=1)
|
ball |
pencil |
paper |
red |
0 |
2 |
3 |
blue |
4 |
6 |
7 |
yellow |
8 |
10 |
11 |
white |
12 |
14 |
15 |
🥀3.pandas基本数据运算
🌺3.1 算术运算
- 当有两个series或DataFrame对象时,如果一个标签,两个对象都有,则把他们的值相加
- 当一个标签只有一个对象有时,则为NaN
s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])
s1
white 3
yellow 2
green 5
blue 1
dtype: int64
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1 + s2
black NaN
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
# DateFrame对象也一样
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame1
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index = ['blue','yellow','green','white']
,columns=['ball','pen','mug'])
frame2
|
ball |
pen |
mug |
blue |
0 |
1 |
2 |
yellow |
3 |
4 |
5 |
green |
6 |
7 |
8 |
white |
9 |
10 |
11 |
frame3 = frame1+frame2
frame3
|
ball |
mug |
paper |
pen |
pencil |
blue |
4.0 |
NaN |
NaN |
6.0 |
NaN |
green |
NaN |
NaN |
NaN |
NaN |
NaN |
red |
NaN |
NaN |
NaN |
NaN |
NaN |
white |
21.0 |
NaN |
NaN |
23.0 |
NaN |
yellow |
11.0 |
NaN |
NaN |
13.0 |
NaN |
🌻3.2 基本算术运算符
主要的算术运算符如下
- add() frame1.add(frame2) = frame1+frame2
- sub()
- div()
- mul()
下面通过一些案例来说明
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])
ser #与frame 的列名称保持一致,行不可以
ball 0
pen 1
pencil 2
paper 3
dtype: int32
frame-ser
|
ball |
pen |
pencil |
paper |
red |
0 |
0 |
0 |
0 |
blue |
4 |
4 |
4 |
4 |
yellow |
8 |
8 |
8 |
8 |
white |
12 |
12 |
12 |
12 |
当索引项只存在于其中一个数据结构时,那么运算结果会为其产生一个新的索引项,但其值为NaN
具体案例如下,我们给ser增加一列mug
ser['mug'] = 9
ser
ball 0
pen 1
pencil 2
paper 3
mug 9
dtype: int64
frame - ser
|
ball |
mug |
paper |
pen |
pencil |
red |
0 |
NaN |
0 |
0 |
0 |
blue |
4 |
NaN |
4 |
4 |
4 |
yellow |
8 |
NaN |
8 |
8 |
8 |
white |
12 |
NaN |
12 |
12 |
12 |
🌼3.3 函数映射
在dataframe和series数据对象中,可以使用函数对所有元素进行操作
frame
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
# 求所有元素的平方根
np.sqrt(frame)
|
ball |
pen |
pencil |
paper |
red |
0.000000 |
1.000000 |
1.414214 |
1.732051 |
blue |
2.000000 |
2.236068 |
2.449490 |
2.645751 |
yellow |
2.828427 |
3.000000 |
3.162278 |
3.316625 |
white |
3.464102 |
3.605551 |
3.741657 |
3.872983 |
#定义函数
#法一:
f = lambda x:x.max()-x.min()#返回数组取值范围
#法二:
def f(x):
return x.max()-x.min()
# apply函数可以调用定义的函数
frame.apply(f)
ball 12
pen 12
pencil 12
paper 12
dtype: int64
def f(x):
return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f,axis = 1)
# 默认axis=0
|
min |
max |
red |
0 |
3 |
blue |
4 |
7 |
yellow |
8 |
11 |
white |
12 |
15 |
🌷4.统计函数
- 数组大多数统计函数对DataFrame对象有用,故可以直接使用
frame.sum()
ball 24
pen 28
pencil 32
paper 36
dtype: int64
frame.mean()
ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64
# describe()函数可以计算多个统计量
frame.describe()
|
ball |
pen |
pencil |
paper |
count |
4.000000 |
4.000000 |
4.000000 |
4.000000 |
mean |
6.000000 |
7.000000 |
8.000000 |
9.000000 |
std |
5.163978 |
5.163978 |
5.163978 |
5.163978 |
min |
0.000000 |
1.000000 |
2.000000 |
3.000000 |
25% |
3.000000 |
4.000000 |
5.000000 |
6.000000 |
50% |
6.000000 |
7.000000 |
8.000000 |
9.000000 |
75% |
9.000000 |
10.000000 |
11.000000 |
12.000000 |
max |
12.000000 |
13.000000 |
14.000000 |
15.000000 |
ser.rank(method='first')
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
🌱4.1 相关性和协方差
🌲4.1.1 Series对象
- 通常涉及两个数据对象
- 函数分别corr()和cov()
seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq
2006 3
2007 4
2008 3
2009 4
2010 5
2011 4
2012 3
2013 2
dtype: int64
seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2
2006 1
2007 2
2008 3
2009 4
2010 4
2011 3
2012 2
2013 1
dtype: int64
seq.corr(seq2)
0.7745966692414834
seq.cov(seq2)
0.8571428571428571
🌳4.1.2DataFrame对象
DataFrame对象计算相关性和协方差依然返回一个dataframe对象
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2
|
ball |
pen |
pencil |
paper |
red |
1 |
4 |
3 |
6 |
blue |
4 |
5 |
6 |
1 |
yellow |
3 |
3 |
1 |
5 |
white |
4 |
1 |
6 |
4 |
frame2.corr()
|
ball |
pen |
pencil |
paper |
ball |
1.000000 |
-0.276026 |
0.577350 |
-0.763763 |
pen |
-0.276026 |
1.000000 |
-0.079682 |
-0.361403 |
pencil |
0.577350 |
-0.079682 |
1.000000 |
-0.692935 |
paper |
-0.763763 |
-0.361403 |
-0.692935 |
1.000000 |
frame2.cov()
|
ball |
pen |
pencil |
paper |
ball |
2.000000 |
-0.666667 |
2.000000 |
-2.333333 |
pen |
-0.666667 |
2.916667 |
-0.333333 |
-1.333333 |
pencil |
2.000000 |
-0.333333 |
6.000000 |
-3.666667 |
paper |
-2.333333 |
-1.333333 |
-3.666667 |
4.666667 |
🌴4.1.3DataFrame和Series相关性
corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性
ser
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
frame2.corrwith(ser)
ball -0.140028
pen -0.869657
pencil 0.080845
paper 0.595854
dtype: float64
frame2.corrwith(frame)
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
🌵4.2排序和秩
- Series用sort_values()和rank(),默认是升序,使用ascending=False改变为升序,下同
- DataFrame用sort_index(by='')和rank()
对ser排序
ser.sort_values()
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
对ser求秩
ser.rank()
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
安装pen对frame进行排序
frame.sort_values(by='pen')
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
ser.sort_index()
#按字母表顺序升序排列
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
ser.sort_index(ascending=False)
# 改为降序
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
ser.sort_values()
#根据值排序
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
frame
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
frame.sort_index()
|
ball |
pen |
pencil |
paper |
blue |
4 |
5 |
6 |
7 |
red |
0 |
1 |
2 |
3 |
white |
12 |
13 |
14 |
15 |
yellow |
8 |
9 |
10 |
11 |
axis代表轴,1表示纵轴,0表示横轴
frame.sort_index(axis=1)
|
ball |
paper |
pen |
pencil |
red |
0 |
3 |
1 |
2 |
blue |
4 |
7 |
5 |
6 |
yellow |
8 |
11 |
9 |
10 |
white |
12 |
15 |
13 |
14 |
🌾5.Pandas缺失值处理
🌿5.1 创建NaN数据
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser
red 0.0
blue 1.0
yellow 2.0
white NaN
green 9.0
dtype: float64
ser['white']
nan
☘️5.2 删除NaN
- dropna()
- ser[ser.notnull()]
- DataFrame中去除时 为避免删除整行或整列,用how='all'来表示只删除所有元素均为NAN的行或列,如果使用how='any',则只要这一列有缺失值就删除整列
frame3.dropna(how='all')
|
ball |
mug |
paper |
pen |
pencil |
blue |
4.0 |
NaN |
NaN |
6.0 |
NaN |
white |
21.0 |
NaN |
NaN |
23.0 |
NaN |
yellow |
11.0 |
NaN |
NaN |
13.0 |
NaN |
🍀5.3 为NaN元素填充其他值
frame3.fillna(0)
|
ball |
mug |
paper |
pen |
pencil |
blue |
4.0 |
0.0 |
0.0 |
6.0 |
0.0 |
green |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
red |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
white |
21.0 |
0.0 |
0.0 |
23.0 |
0.0 |
yellow |
11.0 |
0.0 |
0.0 |
13.0 |
0.0 |
若要将不同列的NaN换成不同元素,依次指定列名称及要替换成的元素即可
frame3.fillna({'ball':1,"pen":99})
|
ball |
mug |
paper |
pen |
pencil |
blue |
4.0 |
NaN |
NaN |
6.0 |
NaN |
green |
1.0 |
NaN |
NaN |
99.0 |
NaN |
red |
1.0 |
NaN |
NaN |
99.0 |
NaN |
white |
21.0 |
NaN |
NaN |
23.0 |
NaN |
yellow |
11.0 |
NaN |
NaN |
13.0 |
NaN |
🍁6. 层级索引和分层统计
有时候我们需要对数据进行分层级的索引,具体看下面这个例子
mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],
['up','down','right','up','down','up','down','left']])
mser
white up 0.323513
down 0.080292
right 0.503630
blue up 0.201143
down 0.173879
red up 0.866267
down 0.601906
left 0.140885
dtype: float64
mser.index
MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
mser['white']
up 0.323513
down 0.080292
right 0.503630
dtype: float64
mser[:,'up']
white 0.323513
blue 0.201143
red 0.866267
dtype: float64
mser[:,'right']
white 0.50363
dtype: float64
mser['white','up']#可以得到某一特定元素的
0.32351250980575463
🍂6.1 unstack()函数和stack()函数
unstack把等级索引Series对象转换为一个简单的DataFrame对象,把第二列索引转换为相应的列,stack则相反,具体如下
mser.unstack() #将series转换为dataframe
mser.unstack().fillna(0)
|
down |
left |
right |
up |
blue |
0.173879 |
0.000000 |
0.00000 |
0.201143 |
red |
0.601906 |
0.140885 |
0.00000 |
0.866267 |
white |
0.080292 |
0.000000 |
0.50363 |
0.323513 |
frame
|
ball |
pen |
pencil |
paper |
red |
0 |
1 |
2 |
3 |
blue |
4 |
5 |
6 |
7 |
yellow |
8 |
9 |
10 |
11 |
white |
12 |
13 |
14 |
15 |
frame.stack()#将dataframe转换为series对象
red ball 0
pen 1
pencil 2
paper 3
blue ball 4
pen 5
pencil 6
paper 7
yellow ball 8
pen 9
pencil 10
paper 11
white ball 12
pen 13
pencil 14
paper 15
dtype: int32
dataframe对象的行与列也可以定义分层级索引
mframe = pd.DataFrame(np.arange(16).reshape((4,4)),
index = [['white','white','red','red'],['up','down','up','down']],
columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe
|
|
pen |
paper |
|
|
1 |
2 |
1 |
2 |
white |
up |
0 |
1 |
2 |
3 |
down |
4 |
5 |
6 |
7 |
red |
up |
8 |
9 |
10 |
11 |
down |
12 |
13 |
14 |
15 |
🍃6.2调整层级顺序
- swaplevel()函数以要互换位置的两个层级的名称为参数,返回交换位置后的一个新对象,其中的个元素的顺序保持不变
mframe.columns.names = ['object','id']
mframe.index.names = ['colors','status']
mframe
|
object |
pen |
paper |
|
id |
1 |
2 |
1 |
2 |
colors |
status |
|
|
|
|
white |
up |
0 |
1 |
2 |
3 |
down |
4 |
5 |
6 |
7 |
red |
up |
8 |
9 |
10 |
11 |
down |
12 |
13 |
14 |
15 |
mframe.swaplevel('colors','status')
|
object |
pen |
paper |
|
id |
1 |
2 |
1 |
2 |
status |
colors |
|
|
|
|
up |
white |
0 |
1 |
2 |
3 |
down |
white |
4 |
5 |
6 |
7 |
up |
red |
8 |
9 |
10 |
11 |
down |
red |
12 |
13 |
14 |
15 |
🌍6.3按层级统计数据
mframe.sum(level='colors')
object |
pen |
paper |
id |
1 |
2 |
1 |
2 |
colors |
|
|
|
|
white |
4 |
6 |
8 |
10 |
red |
20 |
22 |
24 |
26 |
若想对某一层级的列进行统计,则需要把axis的值设置为1
mframe.sum(level='id', axis=1)
|
id |
1 |
2 |
colors |
status |
|
|
white |
up |
2 |
4 |
down |
10 |
12 |
red |
up |
18 |
20 |
down |
26 |
28 |
🌎7.数据导入
很多时候,我们要分析的数据来自电脑上保存的数据文件,本文介绍一下如何导入我们最常用的csv文件,后续我还会介绍如何导入json
数据、以及连接SQL数据库等其他的方式来导入数据
import pandas as pd # 加载模块
df = pd.read_csv('student.csv')
df
Student ID name age gender
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
|
id |
color |
brand_x |
sid |
brand_y |
0 |
ball |
white |
OMG |
ball |
ABC |
1 |
pencil |
red |
ABC |
pencil |
OMG |
2 |
pencil |
red |
ABC |
pencil |
POD |
3 |
pen |
red |
ABC |
pen |
POD |
🌏8.数据处理
🌐8.1 连接
使用merge()函数 类似sql中的多表连接
🎇8.1.1 内连接
import numpy as np
import pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'price':[12.33,11.44,33.21,13.23,33.62]})
frame1
|
id |
price |
0 |
ball |
12.33 |
1 |
pencil |
11.44 |
2 |
pen |
33.21 |
3 |
mug |
13.23 |
4 |
ashtray |
33.62 |
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
'color':['white','red','red','black']})
frame2
|
id |
color |
0 |
pencil |
white |
1 |
pencil |
red |
2 |
ball |
red |
3 |
pen |
black |
pd.merge(frame1,frame2)
|
id |
price |
color |
0 |
ball |
12.33 |
red |
1 |
pencil |
11.44 |
white |
2 |
pencil |
11.44 |
red |
3 |
pen |
33.21 |
black |
上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成 并且没有指定基于哪一列进行合并,实际应用中通常要指定连接条件, 用on来zhid
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'color':['white','red','red','black','green'],
'brand':['OMG','ABC','ABC','POD','POD']})
frame1
|
id |
color |
brand |
0 |
ball |
white |
OMG |
1 |
pencil |
red |
ABC |
2 |
pen |
red |
ABC |
3 |
mug |
black |
POD |
4 |
ashtray |
green |
POD |
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
'brand':['OMG','POD','ABC','POD']})
frame2
|
id |
brand |
0 |
pencil |
OMG |
1 |
pencil |
POD |
2 |
ball |
ABC |
3 |
pen |
POD |
pd.merge(frame1,frame2,on='id') # 以id进行合并
|
id |
color |
brand_x |
brand_y |
0 |
ball |
white |
OMG |
ABC |
1 |
pencil |
red |
ABC |
OMG |
2 |
pencil |
red |
ABC |
POD |
3 |
pen |
red |
ABC |
POD |
pd.merge(frame1,frame2,on='brand') # 以brand进行合并
|
id_x |
color |
brand |
id_y |
0 |
ball |
white |
OMG |
pencil |
1 |
pencil |
red |
ABC |
ball |
2 |
pen |
red |
ABC |
ball |
3 |
mug |
black |
POD |
pencil |
4 |
mug |
black |
POD |
pen |
5 |
ashtray |
green |
POD |
pencil |
6 |
ashtray |
green |
POD |
pen |
当出现两个列的名称不一致的时候,使用left_on 和 right_on,例如,下面两个表,一个是id
,一个是sid
,我们相当于是用第一个表的id和第二个表的sid连接
frame2.columns = ['sid','brand']
frame2
|
sid |
brand |
0 |
pencil |
OMG |
1 |
pencil |
POD |
2 |
ball |
ABC |
3 |
pen |
POD |
pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')
|
id |
color |
brand_x |
sid |
brand_y |
0 |
ball |
white |
OMG |
ball |
ABC |
1 |
pencil |
red |
ABC |
pencil |
OMG |
2 |
pencil |
red |
ABC |
pencil |
POD |
3 |
pen |
red |
ABC |
pen |
POD |
merge()函数默认的是内连接,上述结果中的键是由交叉操作出来的
🎉8.1.2 外连接
- 连接类型用how选项指定
- 左连接 共有的加上左边的
- 右连接 共有的加上右边的
- 外连接把所有的键整合到一起
frame2.columns=['id','brand']
pd.merge(frame1,frame2,how='outer')
|
id |
color |
brand |
0 |
ball |
white |
OMG |
1 |
pencil |
red |
ABC |
2 |
pen |
red |
ABC |
3 |
mug |
black |
POD |
4 |
ashtray |
green |
POD |
5 |
pencil |
NaN |
OMG |
6 |
pencil |
NaN |
POD |
7 |
ball |
NaN |
ABC |
8 |
pen |
NaN |
POD |
pd.merge(frame1,frame2,how='left')
|
id |
color |
brand |
0 |
ball |
white |
OMG |
1 |
pencil |
red |
ABC |
2 |
pen |
red |
ABC |
3 |
mug |
black |
POD |
4 |
ashtray |
green |
POD |
pd.merge(frame1,frame2,how='right')
|
id |
color |
brand |
0 |
pencil |
NaN |
OMG |
1 |
pencil |
NaN |
POD |
2 |
ball |
NaN |
ABC |
3 |
pen |
NaN |
POD |
要合并多个键,则把多个键给on选项
pd.merge(frame1,frame2,on=['id','brand'],how='outer')
|
id |
color |
brand |
0 |
ball |
white |
OMG |
1 |
pencil |
red |
ABC |
2 |
pen |
red |
ABC |
3 |
mug |
black |
POD |
4 |
ashtray |
green |
POD |
5 |
pencil |
NaN |
OMG |
6 |
pencil |
NaN |
POD |
7 |
ball |
NaN |
ABC |
8 |
pen |
NaN |
POD |
🎊8.1.3 以索引作为键进行连接
#方法一
pd.merge(frame1,frame2,left_index=True,right_index=True)
|
id_x |
color |
brand_x |
id_y |
brand_y |
0 |
ball |
white |
OMG |
pencil |
OMG |
1 |
pencil |
red |
ABC |
pencil |
POD |
2 |
pen |
red |
ABC |
ball |
ABC |
3 |
mug |
black |
POD |
pen |
POD |
#方法二 用join函数 默认左连接
frame2.columns = ['id2','brand2']
frame1.join(frame2)
|
id |
color |
brand |
id2 |
brand2 |
0 |
ball |
white |
OMG |
pencil |
OMG |
1 |
pencil |
red |
ABC |
pencil |
POD |
2 |
pen |
red |
ABC |
ball |
ABC |
3 |
mug |
black |
POD |
pen |
POD |
4 |
ashtray |
green |
POD |
NaN |
NaN |
🎄8.2拼接
- numpy中的concatenation()函数可以用来进行拼接操作
- pandas的concat()函数实现了按轴拼接的功能()
arr1 = np.arange(9).reshape(3,3)
arr1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr2 = np.arange(6,15).reshape(3,3)
arr2
array([[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
np.concatenate([arr1,arr2],axis=1)#默认是axis=0
array([[ 0, 1, 2, 6, 7, 8],
[ 3, 4, 5, 9, 10, 11],
[ 6, 7, 8, 12, 13, 14]])
ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])
ser1
1 0.180191
2 0.061649
3 0.236378
4 0.105309
dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])
ser2
5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2])
1 0.180191
2 0.061649
3 0.236378
4 0.105309
5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2],axis = 1)
|
0 |
1 |
1 |
0.180191 |
NaN |
2 |
0.061649 |
NaN |
3 |
0.236378 |
NaN |
4 |
0.105309 |
NaN |
5 |
NaN |
0.935277 |
6 |
NaN |
0.516146 |
7 |
NaN |
0.210461 |
8 |
NaN |
0.912048 |
默认是外连接
pd.concat([ser1,ser2],axis=1,join='inner')
如果想要创建等级索引,需要用keys选项来完成
pd.concat([ser1,ser2],keys=[1,2])
1 1 0.180191
2 0.061649
3 0.236378
4 0.105309
2 5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2],axis=1,keys=[1,2])
|
1 |
2 |
1 |
0.180191 |
NaN |
2 |
0.061649 |
NaN |
3 |
0.236378 |
NaN |
4 |
0.105309 |
NaN |
5 |
NaN |
0.935277 |
6 |
NaN |
0.516146 |
7 |
NaN |
0.210461 |
8 |
NaN |
0.912048 |
🎋8.3组合
- 当无法通过合并或者拼接方法组合数据用组合函数
- combine_first()函数可以用来组合Series对象,同时对齐数据
ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])
ser1
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])
ser2
2 0.017397
4 0.764295
5 0.407552
6 0.352605
dtype: float64
ser1.combine_first(ser2)
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
6 0.352605
dtype: float64
pd.concat([ser1,ser2])
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
2 0.017397
4 0.764295
5 0.407552
6 0.352605
dtype: float64
🏆文章推荐
Python数据可视化大杀器之Seaborn:学完可实现90%数据分析绘图
Python数据分析大杀器之Numpy详解
最近小伙伴问我有什么刷题网站推荐,在这里推荐一下牛客网,里面包含各种面经题库,全是免费的题库,可以全方面提升你的职业竞争力,提升编程实战技巧,赶快来和我一起刷题吧!牛客网链接|python篇