Python数据分析

🌸个人主页：JoJo的数据分析历险记
📝个人介绍：小编大四统计在读，目前保研到统计学top3高校继续攻读统计研究生
💌如果文章对你有帮助，欢迎关注、点赞、收藏、订阅专栏

本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析

最近小伙伴问我有什么刷题网站推荐，在这里推荐一下牛客网，里面包含各种面经题库，全是免费的题库，可以全方面提升你的职业竞争力，提升编程实战技巧，赶快来和我一起刷题吧！牛客网链接|python篇

我们介绍了Numpy在数据处理方面的应用，本文介绍一下pandas在数据处理方面的应用，pandas可以是基于numpy构建的，但是可以让数据处理变得更便捷

导入相关库

import numpy as np
import pandas as pd

💮1.Series 对象

pandas主要有两个数据对象，一个是Series，类似于一个向量的形式，另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。

s = pd.Series([12,-4,7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

🏵️1.1 Series基本操作

#选择内部元素
s[2]

#为元素赋值
s[2]=5
s
s['a'] = 4
s

0    12
1    -4
2     5
3     9
a     4
dtype: int64

#用其它对象定义新的series对象
arr = np.array([1,2,3,4])
#此时s2只是原来的一个动态视图，会随数组的改变而改变，例如我们改变原来数组中的第二个元素值
s2 = pd.Series(arr)
s2
arr[1] = 9
s2

0    1
1    9
2    3
3    4
dtype: int32

#筛选元素
s[s>8]

0    12
3     9
dtype: int64

#Series对象的组成元素
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

#unique去重 返回一个数组
serd.unique()

array([1, 0, 2, 3], dtype=int64)

#value_counts 去重 返回出现次数
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

#isin 函数，返回布尔值
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64

#NaN
s2 = pd.Series([-5,3,np.NaN,14])
s2

0    -5.0
1     3.0
2     NaN
3    14.0
dtype: float64

# 用isnull 和 notnull 来进行判断
s2.isnull()
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

s2

0    -5.0
1     3.0
2     NaN
3    14.0
dtype: float64

#用作字典
mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydict)
myseries

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

当出现缺失值时，会直接用NaN替代



colors = ['red','blue','yellow','orange','green']
myseries = pd.Series(mydict, index = colors)
myseries

red       2000.0
blue      1000.0
yellow     500.0
orange    1000.0
green        NaN
dtype: float64

进行运算时有NaN为NaN


mydict2 ={'red':400,'yellow':1000,"black":700}
myseries2 = pd.Series(mydict2)
myseries.fillna(0) + myseries2.fillna(0)

black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

🌹2.DataFrame对象

DataFrame对象是我们在进行数据分析时最常见的数据格式，相当于一个矩阵数据，由不同行不同列组成，通常每一列代表一个变量，每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。

创建DataFrame对象

#DataFrame对象
data = {'color':['blue','green','yellow','red','white'],
        'object':['ball','pen','pencil','paper','mug'],
        'price':[1.2,1.0,0.6,0.9,1.7]}
frame = pd.DataFrame(data)
frame

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	0.6
3	red	paper	0.9
4	white	mug	1.7

frame2 = pd.DataFrame(data, columns=['object','price'])
frame2

	object	price
0	ball	1.2
1	pen	1.0
2	pencil	0.6
3	paper	0.9
4	mug	1.7

frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])
frame3

	color	object	price
one	blue	ball	1.2
two	green	pen	1.0
three	yellow	pencil	0.6
four	red	paper	0.9
five	white	mug	1.7

#选取元素
#获得所有列的名称
frame.columns

Index(['color', 'object', 'price'], dtype='object')

#获得所有行的名称
frame.index

RangeIndex(start=0, stop=5, step=1)

#获得所有值
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)

#获得某一列的值
frame['price']

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

#获得行的值 用ix属性和行的索引项
frame.iloc[2]

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

#指定多个索引值能选取多行
frame.iloc[[2,4]]

	color	object	price
2	yellow	pencil	0.6
4	white	mug	1.7

#可以用frame[0:1]或者frame[0:2]选择行  但切记frame[0]没有数
frame[0:4]

对DataFrame进行行选择时，使用索引frame[0:1]返回第一行数据，[1:2]返回第二行数据

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	0.6
3	red	paper	0.9

#如果要获取其中的一个元素，必须依次指定元素所在的列名称、行的索引值或标签
frame['object'][3]

'paper'

#赋值
frame['new']=12 #直接添加某一列
frame

	color	object	price	new
0	blue	ball	1.2	12
1	green	pen	1.0	12
2	yellow	pencil	0.6	12
3	red	paper	0.9	12
4	white	mug	1.7	12

frame['new']=[1,2,3,4,5]
frame

	color	object	price	new
0	blue	ball	1.2	1
1	green	pen	1.0	2
2	yellow	pencil	0.6	3
3	red	paper	0.9	4
4	white	mug	1.7	5

#修改单个元素的方法
frame['price'][2]=3.3
frame

	color	object	price	new
0	blue	ball	1.2	1
1	green	pen	1.0	2
2	yellow	pencil	3.3	3
3	red	paper	0.9	4
4	white	mug	1.7	5

# 删除一整列的所有数据，用del
frame['new'] = 12
frame
del frame['new']
frame

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	3.3
3	red	paper	0.9
4	white	mug	1.7

#筛选元素
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'],
                      columns=['ball','pen','pencil','paper'])
frame3
frame3[frame3>12]

	ball	pen	pencil	paper
red	NaN	NaN	NaN	NaN
white	NaN	NaN	NaN	NaN
blue	NaN	NaN	NaN	NaN
green	NaN	13.0	14.0	15.0

#用嵌套字典生成DataFrame对象 当出现缺失值时用NaN替代
nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}
nestdict

{'red': {2012: 22, 2013: 33},
 'white': {2011: 13, 2012: 22, 2013: 16},
 'blue': {2011: 17, 2012: 27, 2013: 48}}

frame2 = pd.DataFrame(nestdict)
frame2

	red	white	blue
2011	NaN	13	17
2012	22.0	22	27
2013	33.0	16	48

进行转置

frame2.T

	2011	2012	2013
red	NaN	22.0	33.0
white	13.0	22.0	16.0
blue	17.0	27.0	48.0

#index对象
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

ser.idxmax()

'white'

ser.idxmin()

'blue'

#含重复标签的Index
serd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

#当一个标签对应多个元素时，返回一个Series对象 而不是单个元素
serd['white']

white    0
white    1
dtype: int64

#判断是否由重复值， is_unique

#索引对象的其他功能
ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

#reindex()函数可以更换series对象的索引，生成一个新的series对象
ser.reindex(['three','one','five','two'])

three    7.0
one      2.0
five     NaN
two      5.0
dtype: float64

ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

#自动插补
#reindex()函数，method：ffill 表示插补的数为前面的值，bfill表示插补的数为后面的值
ser3.reindex(range(6),method='ffill')

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

ser3.reindex(range(8),method='bfill')

0    1.0
1    5.0
2    5.0
3    5.0
4    6.0
5    6.0
6    3.0
7    NaN
dtype: float64

frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])

	colors	price	new	object
0	blue	1.2	blue	ball
1	green	1.0	green	pen
2	yellow	3.3	yellow	pencil
3	red	0.9	red	paper
4	white	1.7	white	mug

ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

ser.drop(['blue','white'])

red       0.0
yellow    2.0
dtype: float64

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

#删除时默认是行 axis指定轴，1为列
frame.drop(['pen'],axis=1)

	ball	pencil	paper
red	0	2	3
blue	4	6	7
yellow	8	10	11
white	12	14	15

🥀3.pandas基本数据运算

🌺3.1 算术运算

当有两个series或DataFrame对象时，如果一个标签，两个对象都有，则把他们的值相加
当一个标签只有一个对象有时，则为NaN

s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])
s1

white     3
yellow    2
green     5
blue      1
dtype: int64

s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1 + s2

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

# DateFrame对象也一样
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     columns=['ball','pen','pencil','paper'],
                      index = ['red','blue','yellow','white'])
frame1

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index = ['blue','yellow','green','white']
                     ,columns=['ball','pen','mug'])
frame2

	ball	pen	mug
blue	0	1	2
yellow	3	4	5
green	6	7	8
white	9	10	11

frame3 = frame1+frame2
frame3

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
green	NaN	NaN	NaN	NaN	NaN
red	NaN	NaN	NaN	NaN	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🌻3.2 基本算术运算符

主要的算术运算符如下

add() frame1.add(frame2) = frame1+frame2
sub()
div()
mul()

下面通过一些案例来说明

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     columns=['ball','pen','pencil','paper'],
                      index = ['red','blue','yellow','white'])
frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])
ser #与frame 的列名称保持一致，行不可以

ball      0
pen       1
pencil    2
paper     3
dtype: int32

frame-ser

	ball	pen	pencil	paper
red	0	0	0	0
blue	4	4	4	4
yellow	8	8	8	8
white	12	12	12	12

当索引项只存在于其中一个数据结构时，那么运算结果会为其产生一个新的索引项，但其值为NaN

具体案例如下，我们给ser增加一列mug

ser['mug'] = 9
ser

ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

frame - ser

	ball	mug	paper	pen	pencil
red	0	NaN	0	0	0
blue	4	NaN	4	4	4
yellow	8	NaN	8	8	8
white	12	NaN	12	12	12

🌼3.3 函数映射

在dataframe和series数据对象中，可以使用函数对所有元素进行操作

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

# 求所有元素的平方根
np.sqrt(frame)

	ball	pen	pencil	paper
red	0.000000	1.000000	1.414214	1.732051
blue	2.000000	2.236068	2.449490	2.645751
yellow	2.828427	3.000000	3.162278	3.316625
white	3.464102	3.605551	3.741657	3.872983

#定义函数
#法一：
f = lambda x:x.max()-x.min()#返回数组取值范围
#法二：
def f(x):
    return x.max()-x.min()

# apply函数可以调用定义的函数

frame.apply(f)

ball      12
pen       12
pencil    12
paper     12
dtype: int64

def f(x):
    return pd.Series([x.min(),x.max()],index = ['min','max'])

frame.apply(f,axis = 1)
# 默认axis=0

	min	max
red	0	3
blue	4	7
yellow	8	11
white	12	15

🌷4.统计函数

数组大多数统计函数对DataFrame对象有用，故可以直接使用

frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

# describe()函数可以计算多个统计量
frame.describe()

	ball	pen	pencil	paper
count	4.000000	4.000000	4.000000	4.000000
mean	6.000000	7.000000	8.000000	9.000000
std	5.163978	5.163978	5.163978	5.163978
min	0.000000	1.000000	2.000000	3.000000
25%	3.000000	4.000000	5.000000	6.000000
50%	6.000000	7.000000	8.000000	9.000000
75%	9.000000	10.000000	11.000000	12.000000
max	12.000000	13.000000	14.000000	15.000000

ser.rank(method='first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

🌱4.1 相关性和协方差

🌲4.1.1 Series对象

通常涉及两个数据对象
函数分别corr()和cov()

seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq

2006    3
2007    4
2008    3
2009    4
2010    5
2011    4
2012    3
2013    2
dtype: int64

seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2

2006    1
2007    2
2008    3
2009    4
2010    4
2011    3
2012    2
2013    1
dtype: int64

seq.corr(seq2)

0.7745966692414834

seq.cov(seq2)

0.8571428571428571

🌳4.1.2DataFrame对象

DataFrame对象计算相关性和协方差依然返回一个dataframe对象

frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2

	ball	pen	pencil	paper
red	1	4	3	6
blue	4	5	6	1
yellow	3	3	1	5
white	4	1	6	4

frame2.corr()

	ball	pen	pencil	paper
ball	1.000000	-0.276026	0.577350	-0.763763
pen	-0.276026	1.000000	-0.079682	-0.361403
pencil	0.577350	-0.079682	1.000000	-0.692935
paper	-0.763763	-0.361403	-0.692935	1.000000

frame2.cov()

	ball	pen	pencil	paper
ball	2.000000	-0.666667	2.000000	-2.333333
pen	-0.666667	2.916667	-0.333333	-1.333333
pencil	2.000000	-0.333333	6.000000	-3.666667
paper	-2.333333	-1.333333	-3.666667	4.666667

🌴4.1.3DataFrame和Series相关性

corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性

ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

frame2.corrwith(ser)

ball     -0.140028
pen      -0.869657
pencil    0.080845
paper     0.595854
dtype: float64

frame2.corrwith(frame)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

🌵4.2排序和秩

Series用sort_values()和rank()，默认是升序，使用ascending=False改变为升序，下同
DataFrame用sort_index(by='')和rank()

对ser排序

ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

对ser求秩

ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

安装pen对frame进行排序

frame.sort_values(by='pen')

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

ser.sort_index()
#按字母表顺序升序排列

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

ser.sort_index(ascending=False)
# 改为降序

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64

ser.sort_values()
#根据值排序

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame.sort_index()

	ball	pen	pencil	paper
blue	4	5	6	7
red	0	1	2	3
white	12	13	14	15
yellow	8	9	10	11

axis代表轴，1表示纵轴，0表示横轴

frame.sort_index(axis=1)

	ball	paper	pen	pencil
red	0	3	1	2
blue	4	7	5	6
yellow	8	11	9	10
white	12	15	13	14

🌾5.Pandas缺失值处理

🌿5.1 创建NaN数据

为数据赋NaN值用np.nan

ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

ser['white']

nan

☘️5.2 删除NaN

dropna()
ser[ser.notnull()]
DataFrame中去除时为避免删除整行或整列，用how='all'来表示只删除所有元素均为NAN的行或列，如果使用how='any'，则只要这一列有缺失值就删除整列

frame3.dropna(how='all')

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🍀5.3 为NaN元素填充其他值

fillna()函数·

frame3.fillna(0)

	ball	pen
blue	4.0	6.0
green	0.0	0.0
red	0.0	0.0
white	21.0	23.0
yellow	11.0	13.0

若要将不同列的NaN换成不同元素，依次指定列名称及要替换成的元素即可

frame3.fillna({'ball':1,"pen":99})

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
green	1.0	NaN	NaN	99.0	NaN
red	1.0	NaN	NaN	99.0	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🍁6. 层级索引和分层统计

有时候我们需要对数据进行分层级的索引，具体看下面这个例子

mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],
                                         ['up','down','right','up','down','up','down','left']])
mser

white  up       0.323513
       down     0.080292
       right    0.503630
blue   up       0.201143
       down     0.173879
red    up       0.866267
       down     0.601906
       left     0.140885
dtype: float64

mser.index

MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
           codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

mser['white']

up       0.323513
down     0.080292
right    0.503630
dtype: float64

mser[:,'up']

white    0.323513
blue     0.201143
red      0.866267
dtype: float64

mser[:,'right']

white    0.50363
dtype: float64

mser['white','up']#可以得到某一特定元素的

0.32351250980575463

🍂6.1 unstack()函数和stack()函数

unstack把等级索引Series对象转换为一个简单的DataFrame对象，把第二列索引转换为相应的列，stack则相反，具体如下

mser.unstack() #将series转换为dataframe
mser.unstack().fillna(0)

	down	left	right	up
blue	0.173879	0.000000	0.00000	0.201143
red	0.601906	0.140885	0.00000	0.866267
white	0.080292	0.000000	0.50363	0.323513

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame.stack()#将dataframe转换为series对象

red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int32

dataframe对象的行与列也可以定义分层级索引

mframe = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index = [['white','white','red','red'],['up','down','up','down']],
                     columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe

		pen	paper
		1	2	1	2
white	up	0	1	2	3
down	4	5	6	7
red	up	8	9	10	11
down	12	13	14	15

🍃6.2调整层级顺序

swaplevel()函数以要互换位置的两个层级的名称为参数，返回交换位置后的一个新对象，其中的个元素的顺序保持不变

mframe.columns.names = ['object','id']
mframe.index.names = ['colors','status']
mframe

	object	pen	paper
	id	1	2	1	2
colors	status
white	up	0	1	2	3
down	4	5	6	7
red	up	8	9	10	11
down	12	13	14	15

mframe.swaplevel('colors','status')

	object	pen	paper
	id	1	2	1	2
status	colors
up	white	0	1	2	3
down	white	4	5	6	7
up	red	8	9	10	11
down	red	12	13	14	15

🌍6.3按层级统计数据

直接把层级的名称赋给level选项

mframe.sum(level='colors')

object	pen	paper
id	1	2	1	2
colors
white	4	6	8	10
red	20	22	24	26

若想对某一层级的列进行统计，则需要把axis的值设置为1

mframe.sum(level='id', axis=1)

	id	1	2
colors	status
white	up	2	4
down	10	12
red	up	18	20
down	26	28

🌎7.数据导入

很多时候，我们要分析的数据来自电脑上保存的数据文件，本文介绍一下如何导入我们最常用的csv文件，后续我还会介绍如何导入json数据、以及连接SQL数据库等其他的方式来导入数据

import pandas as pd # 加载模块

df = pd.read_csv('student.csv')
df

Student     ID     name   age  gender
11        1111    Dw    3  Female
12        1112     Q   23    Male
13        1113     W   21  Female

	id	color	brand_x	sid	brand_y
0	ball	white	OMG	ball	ABC
1	pencil	red	ABC	pencil	OMG
2	pencil	red	ABC	pencil	POD
3	pen	red	ABC	pen	POD

🌏8.数据处理

表连接
拼接
组合

🌐8.1 连接

使用merge()函数类似sql中的多表连接

🎇8.1.1 内连接

import numpy as np
import pandas as pd

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'price':[12.33,11.44,33.21,13.23,33.62]})
frame1

	id	price
0	ball	12.33
1	pencil	11.44
2	pen	33.21
3	mug	13.23
4	ashtray	33.62

frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
                      'color':['white','red','red','black']})
frame2

	id	color
0	pencil	white
1	pencil	red
2	ball	red
3	pen	black

pd.merge(frame1,frame2)

	id	price	color
0	ball	12.33	red
1	pencil	11.44	white
2	pencil	11.44	red
3	pen	33.21	black

上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成并且没有指定基于哪一列进行合并，实际应用中通常要指定连接条件，用on来zhid

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'color':['white','red','red','black','green'],
                      'brand':['OMG','ABC','ABC','POD','POD']})
frame1

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD

frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
                      'brand':['OMG','POD','ABC','POD']})
frame2

	id	brand
0	pencil	OMG
1	pencil	POD
2	ball	ABC
3	pen	POD

pd.merge(frame1,frame2,on='id') # 以id进行合并

	id	color	brand_x	brand_y
0	ball	white	OMG	ABC
1	pencil	red	ABC	OMG
2	pencil	red	ABC	POD
3	pen	red	ABC	POD

pd.merge(frame1,frame2,on='brand') # 以brand进行合并

	id_x	color	brand	id_y
0	ball	white	OMG	pencil
1	pencil	red	ABC	ball
2	pen	red	ABC	ball
3	mug	black	POD	pencil
4	mug	black	POD	pen
5	ashtray	green	POD	pencil
6	ashtray	green	POD	pen

当出现两个列的名称不一致的时候，使用left_on 和 right_on，例如，下面两个表，一个是id,一个是sid,我们相当于是用第一个表的id和第二个表的sid连接

frame2.columns = ['sid','brand']
frame2

	sid	brand
0	pencil	OMG
1	pencil	POD
2	ball	ABC
3	pen	POD

pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')

	id	color	brand_x	sid	brand_y
0	ball	white	OMG	ball	ABC
1	pencil	red	ABC	pencil	OMG
2	pencil	red	ABC	pencil	POD
3	pen	red	ABC	pen	POD

merge()函数默认的是内连接，上述结果中的键是由交叉操作出来的

🎉8.1.2 外连接

连接类型用how选项指定
左连接共有的加上左边的
右连接共有的加上右边的
外连接把所有的键整合到一起

frame2.columns=['id','brand']

pd.merge(frame1,frame2,how='outer')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD
5	pencil	NaN	OMG
6	pencil	NaN	POD
7	ball	NaN	ABC
8	pen	NaN	POD

pd.merge(frame1,frame2,how='left')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD

pd.merge(frame1,frame2,how='right')

	id	color	brand
0	pencil	NaN	OMG
1	pencil	NaN	POD
2	ball	NaN	ABC
3	pen	NaN	POD

要合并多个键，则把多个键给on选项

pd.merge(frame1,frame2,on=['id','brand'],how='outer')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD
5	pencil	NaN	OMG
6	pencil	NaN	POD
7	ball	NaN	ABC
8	pen	NaN	POD

🎊8.1.3 以索引作为键进行连接

#方法一
pd.merge(frame1,frame2,left_index=True,right_index=True)

	id_x	color	brand_x	id_y	brand_y
0	ball	white	OMG	pencil	OMG
1	pencil	red	ABC	pencil	POD
2	pen	red	ABC	ball	ABC
3	mug	black	POD	pen	POD

#方法二 用join函数 默认左连接
frame2.columns = ['id2','brand2']
frame1.join(frame2)

	id	color	brand	id2	brand2
0	ball	white	OMG	pencil	OMG
1	pencil	red	ABC	pencil	POD
2	pen	red	ABC	ball	ABC
3	mug	black	POD	pen	POD
4	ashtray	green	POD	NaN	NaN

🎄8.2拼接

numpy中的concatenation()函数可以用来进行拼接操作
pandas的concat()函数实现了按轴拼接的功能()

arr1 = np.arange(9).reshape(3,3)
arr1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

arr2 = np.arange(6,15).reshape(3,3)
arr2

array([[ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

np.concatenate([arr1,arr2],axis=1)#默认是axis=0

array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11],
       [ 6,  7,  8, 12, 13, 14]])

ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])
ser1

1    0.180191
2    0.061649
3    0.236378
4    0.105309
dtype: float64

ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])
ser2

5    0.935277
6    0.516146
7    0.210461
8    0.912048
dtype: float64

pd.concat([ser1,ser2])

1    0.180191
2    0.061649
3    0.236378
4    0.105309
5    0.935277
6    0.516146
7    0.210461
8    0.912048
dtype: float64

pd.concat([ser1,ser2],axis = 1)

	0	1
1	0.180191	NaN
2	0.061649	NaN
3	0.236378	NaN
4	0.105309	NaN
5	NaN	0.935277
6	NaN	0.516146
7	NaN	0.210461
8	NaN	0.912048

默认是外连接

pd.concat([ser1,ser2],axis=1,join='inner')

如果想要创建等级索引，需要用keys选项来完成

pd.concat([ser1,ser2],keys=[1,2])

1  1    0.180191
   2    0.061649
   3    0.236378
   4    0.105309
2  5    0.935277
   6    0.516146
   7    0.210461
   8    0.912048
dtype: float64

pd.concat([ser1,ser2],axis=1,keys=[1,2])

	1	2
1	0.180191	NaN
2	0.061649	NaN
3	0.236378	NaN
4	0.105309	NaN
5	NaN	0.935277
6	NaN	0.516146
7	NaN	0.210461
8	NaN	0.912048

🎋8.3组合

当无法通过合并或者拼接方法组合数据用组合函数
combine_first()函数可以用来组合Series对象，同时对齐数据

ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])
ser1

1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
dtype: float64

ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])
ser2

2    0.017397
4    0.764295
5    0.407552
6    0.352605
dtype: float64

ser1.combine_first(ser2)

1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
6    0.352605
dtype: float64

pd.concat([ser1,ser2])

1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
2    0.017397
4    0.764295
5    0.407552
6    0.352605
dtype: float64

🏆文章推荐

Python数据可视化大杀器之Seaborn：学完可实现90%数据分析绘图

Python数据分析大杀器之Numpy详解