Pandas教程【国宝级教程,一万八千字总结】（中）-阿里云开发者社区

从DataFrame对象创建

通过 from_frame() 创建分层索引，示例如下：

#首先创建一个 DataFrame。
import pandas as pd
df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
                   ['foo', 'one'], ['foo', 'two']],
                  columns=['first', 'second'])
#然后使用 from_frame()创建分层索引。
index = pd.MultiIndex.from_frame(df)
#将index应用于Series
s=pd.Series(np.random.randn(4), index=index)
print(s)

输出结果：

first second

bar one 1.151928

two -0.694435

foo one -1.701611

two -0.486157

dtype: float64

笛卡尔积创建

笛卡尔积（又称直积）是数学运算的一种方式，下面使用 from_product() 笛卡尔积创建分层索引。

import pandas as pd
#构建数据
numbers = [0, 1, 2]
language = ['Python', 'Java']
#经过笛卡尔积处理后会得到6中组合方式
index = pd.MultiIndex.from_product([numbers, language],names=['number', 'language'])
#将分层索引对象应用于Series
dk_er=pd.Series(np.random.randn(6), index=index)
print(dk_er)

输出结果：

number language

0 Python -0.319739

Java 1.599170

1 Python -0.010520

Java 0.262068

2 Python -0.124177

Java 0.315120

dtype: float64

数组创建分层索引

通过 from_array() 方法，同样可以创建分层索引。示例如下：

import pandas as pd
df=pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'],[1, 2, 1, 2]])
df
输出结果：
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

应用分层索引

下面示例讲解了如何在 DataFrame 中应用分层索引。

import pandas as pd 
import numpy as np
#创建一个数组
arrays = [[0, 0, 1, 1], ['A', 'B', 'A', 'B']]
#从数组创建
index=pd.MultiIndex.from_arrays(arrays, names=('number', 'letter'))
print(index)

输出结果

MultiIndex([(0, ‘A’),

(0, ‘B’),

(1, ‘A’),

(1, ‘B’)],

names=[‘number’, ‘letter’])

上述示例中，第一层为 number，该层有 0、1 两个元素，第二层为 letter，有两个字母 A 和 B。

下面把已经创建的分层索引应用到 DataFrame 中，如下所示：

import pandas as pd 
import numpy as np
#创建一个数组
arrays = [[0, 0, 1, 1], ['A', 'B', 'A', 'B']]
index=pd.MultiIndex.from_arrays(arrays, names=('number', 'letter'))
#在行索引位置应用分层索引
df=pd.DataFrame([{'a':11, 'b':22}], index=index)
print(df)

输出结果：

a b

number letter

0 A 11 22

B 11 22

1 A 11 22

B 11 22

通过 set_index() 可以将 DataFrame 的已有列的标索设置为 index 行索引，示例如下：

import pandas as pd
df= pd.DataFrame({'a': range(5), 'b': range(5, 0, -1),
                  'c': ['one', 'one', 'one', 'two', 'two'],
                  'd': [0, 1, 2, 0, 1]})
print(df)
df1=df.set_index(['a','d'],drop=False)
print(df1)
df1=df.set_index(['a','d'],drop=False,append=Ture)
print(df2)

输出结果：

转换前：

a b c d

0 0 5 one 0

1 1 4 one 1

2 2 3 one 2

3 3 2 two 0

4 4 1 two 1

转换后：

a b c d

a d

0 0 0 5 one 0

1 1 1 4 one 1

2 2 2 3 one 2

3 0 3 2 two 0

4 1 4 1 two 1

带append参数：

a b c d

a d

0 0 0 0 5 one 0

1 1 1 1 4 one 1

2 2 2 2 3 one 2

3 3 0 3 2 two 0

4 4 1 4 1 two 1

通过 set_index() 将列索引转换为了分层行索引，其中 drop=False 表示更新索引的同时，不删除 a、d 列；同时，该函数还提供了一个 append = Ture 参数表示不添加默认的整数索引值（0到4）

分层索引切片取值

下面讲解分层索引切片取值操作，示例如下：

分层行索引操作

import pandas as pd
#构建多层索引
tuple = [('湖人',2008),('步行者',2008),
      ('湖人',2007),('凯尔特人',2007),
   ('篮网',2007),('热火',2008)]
salary = [10000,20000,11000,30000,19000,22000]
#其次应用于DataFrame
index = pd.MultiIndex.from_tuples(tuple)
s = pd.Series(salary, index=index)
print(s)
#切片取值
print(s['湖人',2007])
print(s['湖人'])
print(s[:,2008])
#比较value
print(s[s<=20000])

输出结果：

湖人 2008 10000

步行者 2008 20000

湖人 2007 11000

凯尔特人 2007 30000

篮网 2007 19000

热火 2008 22000

dtype: int64

湖人队2007年工资：

11000

湖人队的工资：

2008 10000

2007 11000

dtype: int64

2008年所有队伍工资：

湖人 10000

步行者 20000

热火 22000

dtype: int64

小于等于20000的年份和队伍：

湖人 2008 10000

步行者 2008 20000

湖人 2007 11000

篮网 2007 19000

dtype: int64

行、列多层索引操作

下面看一种更加复杂的情况，就是行、列同时存在多层索引时候，应该如何通过切片取值。示例如下：

df = pd.DataFrame(np.arange(1,13).reshape((4, 3)),
               index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
               columns=[['Jack', 'Jack', 'Helen'],
              ['Python', 'Java', 'Python']])
#选择同一层级的索引,切记不要写成['Jack','Helen']
print(df[['Jack','Helen']])
#在不同层级分别选择索引
print(df['Jack','Python'])
#iloc整数索引
print(df.iloc[:3,:2])
#loc列标签索引
print(df.loc[:,('Helen','Python')])

输出结果：

Jack Helen

Python Java Python

a 1 1 2 3

2 4 5 6

b 1 7 8 9

2 10 11 12

a 1 1

2 4

b 1 7

2 10

Name: (Jack, Python), dtype: int32

Jack

Python Java

a 1 1 2

2 4 5

b 1 7 8

a 1 3

2 6

b 1 9

2 12

Name: (Helen, Python), dtype: int32

聚合函数应用

通过给level传递参数值，您可以指定在哪个层上进行聚合操作，比如求和、求均值等。示例如下：

import pandas as pd 
df = pd.DataFrame(np.arange(1,13).reshape((4, 3)),
               index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
               columns=[['Jack', 'Jack', 'Helen'],
              ['Python', 'Java', 'Python']])
#第一步，给行列层级起名字
df.index.names=['key1','key2']
df.columns.names=['name','course']
print(df.sum(level='key2'))
print(df.mean(level="course",axis=1))

输出结果：

#对key2层1/2对应的元素值求和

name Jack Helen

course Python Java Python

key2

1 8 10 12

2 14 16 18

#axis=1沿着水平方向求均值

course Python Java

key1 key2

a 1 2 2

2 5 5

b 1 8 8

2 11 11

在数据分析的过程中，我们把大部分时间都花费在数据的准备和预处理上，Pandas 作为一个灵活、高效的数据预处理工具，提供了诸多数据处理的方法，分层索引（Multiple Index）就是其中之一，分层索引（或多层索引）是 Pandas 的基本特性，它能够增强 Pands 数据预处理的能力。

对于 Series 结构来说，通过给index参数传递一个二维数组就可以创建一个具有两层索引的 MultiIndex 对象，示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
print(info)

输出结果：

x obj1 11

obj2 14

obj3 17

obj4 24

y obj1 19

obj2 32

obj3 34

obj4 27

dtype: int64

上述示例，创建了两个层级的索引，即 (x, y) 和 (obj1，…， obj4)，您可以使用 ‘index’ 命令查看索引。

info.index

输出结果：

MultiIndex([(‘x’, ‘obj1’),

(‘x’, ‘obj2’),

(‘x’, ‘obj3’),

(‘x’, ‘obj4’),

(‘y’, ‘obj1’),

(‘y’, ‘obj2’),

(‘y’, ‘obj3’),

(‘y’, ‘obj4’)],

)

此外，您还可以基于内部索引层（也就是’obj’）来选择数据。如下所示：

info [:,‘obj2’ ]

输出结果：

x 14

y 32

dtype: int64

局部索引

局部索引可以理解为：从分层索引中选择特定索引层的一种方法。比如在下列数据中，选择所有’y’索引指定的数据，示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
info['y']

输出结果：

obj1 19

obj2 32

obj3 34

obj4 27

dtype: int64

当然您也可以基于内层索引选择数据。

行索引层转换为列索引

unstack() 用来将行索引转变成列索引，相当于转置操作。通过 unstack() 可以将 Series（一维序列）转变为 DataFrame（二维序列）。示例如下：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
#行索引标签默认是最外层的 x, y
#0代表第一层索引，而1代表第二层
print(info.unstack(0))

输出结果：

x y

obj1 11 19

obj2 14 32

obj3 17 34

obj4 24 27

从示例可以看出，unstack(0) 表示选择第一层索引作为列，unstack(1) 表示选择第二层，如下所示：

import pandas as pd 
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27], 
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'], 
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']]) 
print(info.unstack(1))

输出结果：

obj1 obj2 obj3 obj4

x 11 14 17 24

y 19 32 34 27

Pandas教程【国宝级教程,一万八千字总结】（中）

应用分层索引

分层索引切片取值

聚合函数应用

局部索引

行索引层转换为列索引

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Pandas教程【国宝级教程,一万八千字总结】（中）

应用分层索引

分层索引切片取值

聚合函数应用

局部索引

行索引层转换为列索引

热门文章

最新文章

相关课程

相关电子书