1 pandas的数据结构介绍
1.1 Series
说明:Series是一种类似于一维数组的对象,它由一组数据以及一组与之相关的数据标签组成。
1)Service字符串表现形式:索引在左边,值在右边。若没有为数据指定索引,则会自动创建一个0-N-1的整数型索引。
import pandas as pd from pandas import Series,DataFrame obj=pd.Series([4,7,-5,3]) print(obj) ''' 0 4 1 7 2 -5 3 3 dtype: int64 ''' print(obj.values) # [ 4 7 -5 3] print(obj.index) # 如何索引:RangeIndex(start=0, stop=4, step=1) #自己创建索引 obj2=pd.Series([4,7,-5,3],index=['a','b','c','d']) print(obj2) ''' a 4 b 7 c -5 d 3 dtype: int64 ''' # 通过索引获取数组中的值 print(obj2['a']) # 4 print(obj2[['a','c','d']]) ''' a 4 c -5 d 3 dtype: int64 '''
2)使用Numpy函数或者类型Numpy的运算,索引值都不会改变
print(obj2[obj2>0]) ''' a 4 b 7 d 3 dtype: int64 ''' print(obj2*2) ''' a 8 b 14 c -10 d 6 dtype: int64 '''
3)通过字典可以直接创建Service
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} obj3=pd.Series(sdata) # print(obj3) ''' Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 '''
在下述例子中,sdata中跟states索引匹配的值会被找出来并放到相应的位置。由于California所对应的sdata值找不到,则结果返回NaN(非数字not a number,用于表示缺失)。而在sdata中的索引值,在states没有,则会被从结果中抹去。
states=['California','Ohio','Oregon','Texas'] obj4=pd.Series(sdata,index=states) print(obj4) ''' California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64 '''
使用pandas的isnull和notnull用于检测缺少数据:
print(pd.isnull(obj4)) ''' California True Ohio False Oregon False Texas False dtype: bool '''
4)Service还有一个功能是会根据运算的索引标签自动对齐数据
print(obj3) print(obj4) print(obj3+obj4) ############################ Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 ########################## California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 ########################## California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN
1.2 DataFrame
说明:DataFrame是一个表格型的数据结构。它含有一组有序的列,每列可以是不同值的类型。DataFrame既有行索引也有列索引。
1)建立DataFrame,直接传入一个由等长列表或Numpy数组组成的字典
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002,2003], 'pop':[1.5,1.7,3.6,2.4,2.9,3.2]} frame=pd.DataFrame(data) print(frame) state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2 #对于特别大的数据,使用.head()可以直接取前5行 state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9
2)指定列序列,按照指定的顺序进行排列
frame=pd.DataFrame(data,columns=['pop','state','year']) print(frame) frame=pd.DataFrame(data,columns=['pop','year']) print(frame) pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 5 3.2 Nevada 2003 pop year 0 1.5 2000 1 1.7 2001 2 3.6 2002 3 2.4 2001 4 2.9 2002 5 3.2 2003 #若传入的序列在数据中找不到,则返回缺失值NaN frame=pd.DataFrame(data,columns=['pop','state','year','price']) print(frame) pop state year price 0 1.5 Ohio 2000 NaN 1 1.7 Ohio 2001 NaN 2 3.6 Ohio 2002 NaN 3 2.4 Nevada 2001 NaN 4 2.9 Nevada 2002 NaN 5 3.2 Nevada 2003 NaN
3)单独的DaraFrame一列可以看成一个Service
print(frame['state']) 0 Ohio 1 Ohio 2 Ohio 3 Nevada 4 Nevada 5 Nevada Name: state, dtype: object
4)行也可以通过位置或名称利用loc属性的方式进行获取
print(frame.loc[3]) state Nevada year 2001 pop 2.4 Name: 3, dtype: object #获取1,3行的数据 print(frame.loc[[1,3]]) state year pop 1 Ohio 2001 1.7 3 Nevada 2001 2.4 #取行的同时,指定列数据 print(frame.loc[[1,3],['year','state']]) year state 1 2001 Ohio 3 2001 Nevada
5)列可以通过赋值的方式进行修改
frame=pd.DataFrame(data,columns=['state','year','pop','price']) frame['price']=14500 print(frame) state year pop price 0 Ohio 2000 1.5 14500 1 Ohio 2001 1.7 14500 2 Ohio 2002 3.6 14500 3 Nevada 2001 2.4 14500 4 Nevada 2002 2.9 14500 5 Nevada 2003 3.2 14500 #自己分配数据 frame['price']=np.array([a,b]) print(frame) state year pop price 0 Ohio 2000 1.5 a 1 Nevada 2001 1.7 b
6)使用del删除列
#先添加一个新的布尔值列 frame['eastern']=frame.state=='Ohio' print(frame) state year pop price eastern 0 Ohio 2000 1.5 NaN True 1 Nevada 2001 1.7 NaN False del frame['eastern'] print(frame.columns) Index(['state', 'year', 'pop', 'price'], dtype='object')
7)DataFrame中导入嵌套的字典
说明:嵌套的字典会被认为,外层字典的键作为列,内层的键作为行索引
pop={'Nevada':{2001:2.4,2002:2.9,2000:2.5}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame1=pd.DataFrame(pop) print(frame1) Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2000 2.5 1.5 #进行转置,交行行和列 print((frame1.T)) 2001 2002 2000 Nevada 2.4 2.9 2.5 Ohio 1.7 3.6 1.5
2 基本功能
2.1重要索引
1)reindex,创建一个新对象,它的数据符合新的索引
obj=pd.Series([4.5,.2,-5.3,3.6],index=['d','b','a','c']) print(obj) # 根据新的索引值进行重排 obj2=obj.reindex(['a','b','c','d']) print(obj2) ################ d 4.5 b 0.2 a -5.3 c 3.6 dtype: float64 ##################### a -5.3 b 0.2 c 3.6 d 4.5 dtype: float64
2)重新索引会做插值处理,method选项可以完成该目的,“ffill"可以实现向前填充
obj3=pd.Series(['blue','purple','yellow'],index=[0,2,4]) print(obj3) obj31=obj3.reindex(range(6),method='ffill') print(obj31) ########## 0 blue 2 purple 4 yellow ###################### 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow
3)对于DataFrame:reindex可以修改索引的行,列用columns可以重新索引
frame=pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],columns=['Ohio','Texas','California']) print(frame) #################### Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 frame1=frame.reindex(['a','b','c','d']) print(frame1) ################### Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0 frame1=frame.reindex(columns=['Texas','Utah','California']) print(frame1) ######################## Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
2.2 丢弃指定轴上的项
说明:丢弃某条轴上的一个或多个项,只要有一个索引数组或列表即可。drop方法返回的是一个指定轴上删除了指定值的新对象。不过drop属于就地修改对象,不会返回新的对象
1)首先在Service对象上
obj=pd.Series(np.arange(5.),index=['a','b','c','d','e']) print(obj) ################## a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64 new_obj=obj.drop('c') print(new_obj) ################################## a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64 print(obj.drop(['d','c'])) ###################### a 0.0 b 1.0 e 4.0 dtype: float64
2)对于DataFrame也可以删除任意轴上的索引值
data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) # print(data) # 用标签序列调用drop会从行标签删除值 print(data.drop(['Colorado','Ohio'])) ''' one tow three four Utah 8 9 10 11 New York 12 13 14 15 ''' print(data.drop(['two'],axis=1)) print(data.drop(['two'],axis='columns')) ''' one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 '''
2.3 索引、选取和过滤
obj=pd.Series(np.arange(4.),index=['a','b','c','d']) # print(obj) print(obj['b']) print(obj[1]) print(obj[2:4]) #左闭右开 print(obj[['b','a','c']]) print(obj[[1,3]]) print(obj[obj<2]) ############### 1.0 1.0 c 2.0 d 3.0 dtype: float64 b 1.0 a 0.0 c 2.0 dtype: float64 b 1.0 d 3.0 dtype: float64 a 0.0 b 1.0 dtype: float64 Process finished with exit code 0
2.4 loc和iloc进行选取
说明:loc使用轴标签,iloc使用整数标签
data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) print(data.loc['Colorado',['two','three']]) print(data.iloc[1,[1,2]]) two 5 three 6 Name: Colorado, dtype: int32