简介
Pandas是用于数据分析的开源Python库,也是目前数据分析最重要的开源库。它能够处理类似电子表格的数据,用于快速数据加载,操作,对齐,合并等。为Python提供这些增强功能,Pandas的数据类型为:Series和DataFrame。DataFrame为整个电子表格或矩形数据,而Series是DataFrame的列。DataFrame也可以被认为是字典或Series的集合。
讨论 钉钉免费群21745728 qq群144081101 567351477
加载数据
load.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: xurongzhong#126.com wechat:pythontesting qq:37391319
# qq群:144081101 591302926 567351477
# CreateDate: 2018-06-07
# load.py
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
print("\n\n查看前五行")
print(df.head())
print("\n\n查看类型")
print(type(df))
print("\n\n查看大小")
print(df.shape)
print("\n\n查看列名")
print(df.columns)
print("\n\n查看dtypes(基于列)")
print(df.dtypes)
print("\n\n查看统计信息")
print(df.info())
执行结果
$ ./load.py
查看前五行
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
查看类型
<class 'pandas.core.frame.DataFrame'>
查看大小
(1704, 6)
查看列名
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
查看dtypes(基于列)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
查看统计信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country 1704 non-null object
continent 1704 non-null object
year 1704 non-null int64
lifeExp 1704 non-null float64
pop 1704 non-null int64
gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None
Pandas类型 | Python类型 |
---|---|
object | string |
int64 | int |
float64 | float |
datetime64 | datetime |
行列与单元格
col.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: xurongzhong#126.com wechat:pythontesting qq:37391319
# qq群:144081101 591302926 567351477
# CreateDate: 2018-06-07
# col.py
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
# 列操作
country_df = df['country'] # 列名选取单列
print("\n\n列首5行")
print(country_df.head())
print("\n\n列尾5行")
print(country_df.tail())
country_df_dot = df.country # 点号的方式选取列
print("\n\n点号的方式选取列")
print(country_df_dot.head())
subset = df[['country', 'continent', 'year']] # 选取多列
print("\n\n选取多列")
print(subset.head())
执行结果
$ ./col.py
列首5行
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Name: country, dtype: object
列尾5行
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, dtype: object
点号的方式选取列
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Name: country, dtype: object
选取多列
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
row.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: xurongzhong#126.com wechat:pythontesting qq:37391319
# qq群:144081101 591302926 567351477
# CreateDate: 2018-06-07
# row.py
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
# 行操作,注意df.loc[-1]是非法的
print("\n\n第一行")
print(df.loc[0])
print("\n\n行数")
number_of_rows = df.shape[0]
print(number_of_rows)
last_row_index = number_of_rows - 1
print("\n\n最后一行")
print(df.loc[last_row_index])
print("\n\ntail的方法输出最后一行")
print(df.tail(n=1))
subset_loc = df.loc[0]
subset_head = df.head(n=1)
print("\n\nloc的类型为序列Series")
print(type(subset_loc))
print("\n\nhead的类型为数据帧DataFrame")
print(type(subset_head))
print("\n\nloc选取三列,类型为数据帧DataFrame")
print(df.loc[[0, 99, 999]])
print(type(df.loc[[0, 99, 999]]))
print("\n\niloc选取第一行")
print(df.iloc[0])
print("\n\niloc选取三行")
print(df.iloc[[0, 99, 999]])
执行结果
$ ./row.py
第一行
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445
Name: 0, dtype: object
行数
1704
最后一行
country Zimbabwe
continent Africa
year 2007
lifeExp 43.487
pop 12311143
gdpPercap 469.709
Name: 1703, dtype: object
tail的方法输出最后一行
country continent year lifeExp pop gdpPercap
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
loc的类型为序列Series
<class 'pandas.core.series.Series'>
head的类型为数据帧DataFrame
<class 'pandas.core.frame.DataFrame'>
loc选取三列,类型为数据帧DataFrame
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
99 Bangladesh Asia 1967 43.453 62821884 721.186086
999 Mongolia Asia 1967 51.253 1149500 1226.041130
<class 'pandas.core.frame.DataFrame'>
iloc选取第一行
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445
Name: 0, dtype: object
iloc选取三行
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
99 Bangladesh Asia 1967 43.453 62821884 721.186086
999 Mongolia Asia 1967 51.253 1149500 1226.041130
mix.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: xurongzhong#126.com wechat:pythontesting qq:37391319
# qq群:144081101 591302926 567351477
# CreateDate: 2018-06-07
# mix.py
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
# 混合选取
print("\n\nloc选取坐标")
print(df.loc[42, 'country'])
print("\n\niloc选取坐标")
print(df.iloc[42, 0])
print("\n\nloc选取子集")
print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])
执行结果
#!python
$ ./mix.py
loc选取坐标
Angola
iloc选取坐标
Angola
loc选取子集
country lifeExp gdpPercap
0 Afghanistan 28.801 779.445314
99 Bangladesh 43.453 721.186086
999 Mongolia 51.253 1226.041130
分组和聚合
group.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author: xurongzhong#126.com wechat:pythontesting qq:37391319
# qq群:144081101 591302926 567351477
# CreateDate: 2018-06-07
# group.py
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
print("\n\n年人均产值")
print(df.groupby('year')['lifeExp'].mean())
print("\n\n基于年分组")
grouped_year_df = df.groupby('year')
print(type(grouped_year_df))
print(grouped_year_df)
print("\n\nlifeExp")
grouped_year_df_lifeExp = grouped_year_df['lifeExp']
print(type(grouped_year_df_lifeExp))
print(grouped_year_df_lifeExp)
print("\n\n年平均产值")
mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
print(mean_lifeExp_by_year)
print("\n\n基于年和洲分组")
print(df.groupby(['year', 'continent'])[['lifeExp',
'gdpPercap']].mean())
print("\n\n统计每个洲的国家数")
print(df.groupby('continent')['country'].nunique())
执行结果
#!python
$ ./group.py
年人均产值
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
基于年分组
<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f0e2b0c89e8>
lifeExp
<class 'pandas.core.groupby.groupby.SeriesGroupBy'>
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f0e151e2f28>
年平均产值
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
基于年和洲分组
lifeExp gdpPercap
year continent
1952 Africa 39.135500 1252.572466
Americas 53.279840 4079.062552
Asia 46.314394 5195.484004
Europe 64.408500 5661.057435
Oceania 69.255000 10298.085650
1957 Africa 41.266346 1385.236062
Americas 55.960280 4616.043733
Asia 49.318544 5787.732940
Europe 66.703067 6963.012816
Oceania 70.295000 11598.522455
1962 Africa 43.319442 1598.078825
Americas 58.398760 4901.541870
Asia 51.563223 5729.369625
Europe 68.539233 8365.486814
Oceania 71.085000 12696.452430
1967 Africa 45.334538 2050.363801
Americas 60.410920 5668.253496
Asia 54.663640 5971.173374
Europe 69.737600 10143.823757
Oceania 71.310000 14495.021790
1972 Africa 47.450942 2339.615674
Americas 62.394920 6491.334139
Asia 57.319269 8187.468699
Europe 70.775033 12479.575246
Oceania 71.910000 16417.333380
1977 Africa 49.580423 2585.938508
Americas 64.391560 7352.007126
Asia 59.610556 7791.314020
Europe 71.937767 14283.979110
Oceania 72.855000 17283.957605
1982 Africa 51.592865 2481.592960
Americas 66.228840 7506.737088
Asia 62.617939 7434.135157
Europe 72.806400 15617.896551
Oceania 74.290000 18554.709840
1987 Africa 53.344788 2282.668991
Americas 68.090720 7793.400261
Asia 64.851182 7608.226508
Europe 73.642167 17214.310727
Oceania 75.320000 20448.040160
1992 Africa 53.629577 2281.810333
Americas 69.568360 8044.934406
Asia 66.537212 8639.690248
Europe 74.440100 17061.568084
Oceania 76.945000 20894.045885
1997 Africa 53.598269 2378.759555
Americas 71.150480 8889.300863
Asia 68.020515 9834.093295
Europe 75.505167 19076.781802
Oceania 78.190000 24024.175170
2002 Africa 53.325231 2599.385159
Americas 72.422040 9287.677107
Asia 69.233879 10174.090397
Europe 76.700600 21711.732422
Oceania 79.740000 26938.778040
2007 Africa 54.806038 3089.032605
Americas 73.608120 11003.031625
Asia 70.728485 12473.026870
Europe 77.648600 25054.481636
Oceania 80.719500 29810.188275
统计每个洲的国家数
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2
Name: country, dtype: int64
基本绘图
import pandas as pd
df = pd.read_csv(r"../data/gapminder.tsv", sep='\t')
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expectancy)
global_yearly_life_expectancy.plot()