文章很长,高低要忍一下,如果忍不了,那就收藏吧,总会用到的
萝卜哥也贴心的做成了PDF,在文末获取!
- 如何使用列表和字典创建 Series
- 使用列表创建 Series
- 使用 name 参数创建 Series
- 使用简写的列表创建 Series
- 使用字典创建 Series
- 如何使用 Numpy 函数创建 Series
- 如何获取 Series 的索引和值
- 如何在创建 Series 时指定索引
- 如何获取 Series 的大小和形状
- 如何获取 Series 开始或末尾几行数据
- Head()
- Tail()
- Take()
- 使用切片获取 Series 子集
- 如何创建 DataFrame
- 如何设置 DataFrame 的索引和列信息
- 如何重命名 DataFrame 的列名称
- 如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行
- 在 DataFrame 中使用“isin”过滤多行
- 迭代 DataFrame 的行和列
- 如何通过名称或索引删除 DataFrame 的列
- 向 DataFrame 中新增列
- 如何从 DataFrame 中获取列标题列表
- 如何随机生成 DataFrame
- 如何选择 DataFrame 的多个列
- 如何将字典转换为 DataFrame
- 使用 ioc 进行切片
- 检查 DataFrame 中是否是空的
- 在创建 DataFrame 时指定索引和列名称
- 使用 iloc 进行切片
- iloc 和 loc 的区别
- 使用时间索引创建空 DataFrame
- 如何改变 DataFrame 列的排序
- 检查 DataFrame 列的数据类型
- 更改 DataFrame 指定列的数据类型
- 如何将列的数据类型转换为 DateTime 类型
- 将 DataFrame 列从 floats 转为 ints
- 如何把 dates 列转换为 DateTime 类型
- 两个 DataFrame 相加
- 在 DataFrame 末尾添加额外的行
- 为指定索引添加新行
- 如何使用 for 循环添加行
- 在 DataFrame 顶部添加一行
- 如何向 DataFrame 中动态添加行
- 在任意位置插入行
- 使用时间戳索引向 DataFrame 中添加行
- 为不同的行填充缺失值
- append, concat 和 combine_first 示例
- 获取行和列的平均值
- 计算行和列的总和
- 连接两列
- 过滤包含某字符串的行
- 过滤索引中包含某字符串的行
- 使用 AND 运算符过滤包含特定字符串值的行
- 查找包含某字符串的所有行
- 如果行中的值包含字符串,则创建与字符串相等的另一列
- 计算 pandas group 中每组的行数
- 检查字符串是否在 DataFrme 中
- 从 DataFrame 列中获取唯一行值
- 计算 DataFrame 列的不同值
- 删除具有重复索引的行
- 删除某些列具有重复值的行
- 从 DataFrame 单元格中获取值
- 使用 DataFrame 中的条件索引获取单元格上的标量值
- 设置 DataFrame 的特定单元格值
- 从 DataFrame 行获取单元格值
- 用字典替换 DataFrame 列中的值
- 统计基于某一列的一列的数值
- 处理 DataFrame 中的缺失值
- 删除包含任何缺失数据的行
- 删除 DataFrame 中缺失数据的列
- 按降序对索引值进行排序
- 按降序对列进行排序
- 使用 rank 方法查找 DataFrame 中元素的排名
- 在多列上设置索引
- 确定 DataFrame 的周期索引和列
- 导入 CSV 指定特定索引
- 将 DataFrame 写入 csv
- 使用 Pandas 读取 csv 文件的特定列
- Pandas 获取 CSV 列的列表
- 找到列值最大的行
- 使用查询方法进行复杂条件选择
- 检查 Pandas 中是否存在列
- 为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
- 从 DataFrame 中查找所有列的最小值和最大值
- 在 DataFrame 中找到最小值和最大值所在的索引位置
- 计算 DataFrame Columns 的累积乘积和累积总和
- 汇总统计
- 查找 DataFrame 的均值、中值和众数
- 测量 DataFrame 列的方差和标准偏差
- 计算 DataFrame 列之间的协方差
- 计算 Pandas 中两个 DataFrame 对象之间的相关性
- 计算 DataFrame 列的每个单元格的百分比变化
- 在 Pandas 中向前和向后填充 DataFrame 列的缺失值
- 在 Pandas 中使用非分层索引使用 Stacking
- 使用分层索引对 Pandas 进行拆分
- Pandas 获取 HTML 页面上 table 数据
1如何使用列表和字典创建 Series
使用列表创建 Series
import pandas as pd ser1 = pd.Series([1.5, 2.5, 3, 4.5, 5.0, 6]) print(ser1)
Output:
0 1.5 1 2.5 2 3.0 3 4.5 4 5.0 5 6.0 dtype: float64
使用 name 参数创建 Series
import pandas as pd ser2 = pd.Series(["India", "Canada", "Germany"], name="Countries") print(ser2)
Output:
0 India 1 Canada 2 Germany Name: Countries, dtype: object
使用简写的列表创建 Series
import pandas as pd ser3 = pd.Series(["A"]*4) print(ser3)
Output:
0 A 1 A 2 A 3 A dtype: object
使用字典创建 Series
import pandas as pd ser4 = pd.Series({"India": "New Delhi", "Japan": "Tokyo", "UK": "London"}) print(ser4)
Output:
India New Delhi Japan Tokyo UK London dtype: object
2如何使用 Numpy 函数创建 Series
import pandas as pd import numpy as np ser1 = pd.Series(np.linspace(1, 10, 5)) print(ser1) ser2 = pd.Series(np.random.normal(size=5)) print(ser2)
Output:
0 1.00 1 3.25 2 5.50 3 7.75 4 10.00 dtype: float64 0 -1.694452 1 -1.570006 2 1.713794 3 0.338292 4 0.803511 dtype: float64
3如何获取 Series 的索引和值
import pandas as pd import numpy as np ser1 = pd.Series({"India": "New Delhi", "Japan": "Tokyo", "UK": "London"}) print(ser1.values) print(ser1.index) print("\n") ser2 = pd.Series(np.random.normal(size=5)) print(ser2.index) print(ser2.values)
Output:
['New Delhi' 'Tokyo' 'London'] Index(['India', 'Japan', 'UK'], dtype='object') RangeIndex(start=0, stop=5, step=1) [ 0.66265478 -0.72222211 0.3608642 1.40955436 1.3096732 ]
4如何在创建 Series 时指定索引
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] ser1 = pd.Series(values, index=code) print(ser1)
Output:
IND India CAN Canada AUS Australia JAP Japan GER Germany FRA France dtype: object
5如何获取 Series 的大小和形状
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] ser1 = pd.Series(values, index=code) print(len(ser1)) print(ser1.shape) print(ser1.size)
Output:
6 (6,) 6
6如何获取 Series 开始或末尾几行数据
Head()
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] ser1 = pd.Series(values, index=code) print("-----Head()-----") print(ser1.head()) print("\n\n-----Head(2)-----") print(ser1.head(2))
Output:
-----Head()----- IND India CAN Canada AUS Australia JAP Japan GER Germany dtype: object -----Head(2)----- IND India CAN Canada dtype: object
Tail()
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] ser1 = pd.Series(values, index=code) print("-----Tail()-----") print(ser1.tail()) print("\n\n-----Tail(2)-----") print(ser1.tail(2))
Output:
-----Tail()----- CAN Canada AUS Australia JAP Japan GER Germany FRA France dtype: object -----Tail(2)----- GER Germany FRA France dtype: object
Take()
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] ser1 = pd.Series(values, index=code) print("-----Take()-----") print(ser1.take([2, 4, 5]))
Output:
-----Take()----- AUS Australia GER Germany FRA France dtype: object
7使用切片获取 Series 子集
import pandas as pd num = [000, 100, 200, 300, 400, 500, 600, 700, 800, 900] idx = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] series = pd.Series(num, index=idx) print("\n [2:2] \n") print(series[2:4]) print("\n [1:6:2] \n") print(series[1:6:2]) print("\n [:6] \n") print(series[:6]) print("\n [4:] \n") print(series[4:]) print("\n [:4:2] \n") print(series[:4:2]) print("\n [4::2] \n") print(series[4::2]) print("\n [::-1] \n") print(series[::-1])
Output
[2:2] C 200 D 300 dtype: int64 [1:6:2] B 100 D 300 F 500 dtype: int64 [:6] A 0 B 100 C 200 D 300 E 400 F 500 dtype: int64 [4:] E 400 F 500 G 600 H 700 I 800 J 900 dtype: int64 [:4:2] A 0 C 200 dtype: int64 [4::2] E 400 G 600 I 800 dtype: int64 [::-1] J 900 I 800 H 700 G 600 F 500 E 400 D 300 C 200 B 100 A 0 dtype: int64
8如何创建 DataFrame
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp00'], 'Name': ['John Doe', 'William Spark'], 'Occupation': ['Chemist', 'Statistician'], 'Date Of Join': ['2018-01-25', '2018-01-26'], 'Age': [23, 24]}) print(employees)
Output:
Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Doe Chemist 1 24 2018-01-26 Emp00 William Spark Statistician
9如何设置 DataFrame 的索引和列信息
import pandas as pd employees = pd.DataFrame( data={'Name': ['John Doe', 'William Spark'], 'Occupation': ['Chemist', 'Statistician'], 'Date Of Join': ['2018-01-25', '2018-01-26'], 'Age': [23, 24]}, index=['Emp001', 'Emp002'], columns=['Name', 'Occupation', 'Date Of Join', 'Age']) print(employees)
Output
Name Occupation Date Of Join Age Emp001 John Doe Chemist 2018-01-25 23 Emp002 William Spark Statistician 2018-01-26 24
10如何重命名 DataFrame 的列名称
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp00'], 'Name': ['John Doe', 'William Spark'], 'Occupation': ['Chemist', 'Statistician'], 'Date Of Join': ['2018-01-25', '2018-01-26'], 'Age': [23, 24]}) employees.columns = ['EmpCode', 'EmpName', 'EmpOccupation', 'EmpDOJ', 'EmpAge'] print(employees)
Output:
EmpCode EmpName EmpOccupation EmpDOJ EmpAge 0 23 2018-01-25 Emp001 John Doe Chemist 1 24 2018-01-26 Emp00 William Spark Statistician
11如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print("\nUse == operator\n") print(employees.loc[employees['Age'] == 23]) print("\nUse < operator\n") print(employees.loc[employees['Age'] < 30]) print("\nUse != operator\n") print(employees.loc[employees['Occupation'] != 'Statistician']) print("\nMultiple Conditions\n") print(employees.loc[(employees['Occupation'] != 'Statistician') & (employees['Name'] == 'John')])
Output:
Use == operator Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist Use < operator Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 1 24 2018-01-26 Emp002 Doe Statistician 3 29 2018-02-26 Emp004 Spark Statistician Use != operator Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 4 40 2018-03-16 Emp005 Mark Programmer Multiple Conditions Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist
12在 DataFrame 中使用“isin”过滤多行
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print("\nUse isin operator\n") print(employees.loc[employees['Occupation'].isin(['Chemist','Programmer'])]) print("\nMultiple Conditions\n") print(employees.loc[(employees['Occupation'] == 'Chemist') | (employees['Name'] == 'John') & (employees['Age'] < 30)])
Output:
Use isin operator Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 4 40 2018-03-16 Emp005 Mark Programmer Multiple Conditions Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist
13迭代 DataFrame 的行和列
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print("\n Example iterrows \n") for index, col in employees.iterrows(): print(col['Name'], "--", col['Age']) print("\n Example itertuples \n") for row in employees.itertuples(index=True, name='Pandas'): print(getattr(row, "Name"), "--", getattr(row, "Age"))
Output:
Example iterrows John -- 23 Doe -- 24 William -- 34 Spark -- 29 Mark -- 40 Example itertuples John -- 23 Doe -- 24 William -- 34 Spark -- 29 Mark -- 40
14如何通过名称或索引删除 DataFrame 的列
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print(employees) print("\n Drop Column by Name \n") employees.drop('Age', axis=1, inplace=True) print(employees) print("\n Drop Column by Index \n") employees.drop(employees.columns[[0,1]], axis=1, inplace=True) print(employees)
Output:
Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 1 24 2018-01-26 Emp002 Doe Statistician 2 34 2018-01-26 Emp003 William Statistician 3 29 2018-02-26 Emp004 Spark Statistician 4 40 2018-03-16 Emp005 Mark Programmer Drop Column by Name Date Of Join EmpCode Name Occupation 0 2018-01-25 Emp001 John Chemist 1 2018-01-26 Emp002 Doe Statistician 2 2018-01-26 Emp003 William Statistician 3 2018-02-26 Emp004 Spark Statistician 4 2018-03-16 Emp005 Mark Programmer Drop Column by Index Name Occupation 0 John Chemist 1 Doe Statistician 2 William Statistician 3 Spark Statistician 4 Mark Programmer
15向 DataFrame 中新增列
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) employees['City'] = ['London', 'Tokyo', 'Sydney', 'London', 'Toronto'] print(employees)
Output:
Age Date Of Join EmpCode Name Occupation City 0 23 2018-01-25 Emp001 John Chemist London 1 24 2018-01-26 Emp002 Doe Statistician Tokyo 2 34 2018-01-26 Emp003 William Statistician Sydney 3 29 2018-02-26 Emp004 Spark Statistician London 4 40 2018-03-16 Emp005 Mark Programmer Toronto
16如何从 DataFrame 中获取列标题列表
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print(list(employees)) print(list(employees.columns.values)) print(employees.columns.tolist())
Output:
['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation'] ['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation'] ['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
17如何随机生成 DataFrame
import pandas as pd import numpy as np np.random.seed(5) df_random = pd.DataFrame(np.random.randint(100, size=(10, 6)), columns=list('ABCDEF'), index=['Row-{}'.format(i) for i in range(10)]) print(df_random)
Output:
A B C D E F Row-0 99 78 61 16 73 8 Row-1 62 27 30 80 7 76 Row-2 15 53 80 27 44 77 Row-3 75 65 47 30 84 86 Row-4 18 9 41 62 1 82 Row-5 16 78 5 58 0 80 Row-6 4 36 51 27 31 2 Row-7 68 38 83 19 18 7 Row-8 30 62 11 67 65 55 Row-9 3 91 78 27 29 33
18如何选择 DataFrame 的多个列
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) df = employees[['EmpCode', 'Age', 'Name']] print(df)
Output:
EmpCode Age Name 0 Emp001 23 John 1 Emp002 24 Doe 2 Emp003 34 William 3 Emp004 29 Spark 4 Emp005 40 Mark
19如何将字典转换为 DataFrame
import pandas as pd data = ({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }) print(data) df = pd.DataFrame(data) print(df)
Output:
{'Height': [165, 70, 120, 80, 180, 172, 150], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Age': [30, 20, 22, 40, 32, 28, 39], 'Sco re': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'Color': ['Blue', 'Green', 'Red', 'Whi te', 'Gray', 'Black', 'Red'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX' ]} Age Color Food Height Score State 0 30 Blue Steak 165 4.6 NY 1 20 Green Lamb 70 8.3 TX 2 22 Red Mango 120 9.0 FL 3 40 White Apple 80 3.3 AL 4 32 Gray Cheese 180 1.8 AK 5 28 Black Melon 172 9.5 TX 6 39 Red Beans 150 2.2 TX
20使用 ioc 进行切片
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n -- Selecting a single row with .loc with a string -- \n") print(df.loc['Penelope']) print("\n -- Selecting multiple rows with .loc with a list of strings -- \n") print(df.loc[['Cornelia', 'Jane', 'Dean']]) print("\n -- Selecting multiple rows with .loc with slice notation -- \n") print(df.loc['Aaron':'Dean'])
Output:
-- Selecting a single row with .loc with a string -- Age 40 Color White Food Apple Height 80 Score 3.3 State AL Name: Penelope, dtype: object -- Selecting multiple rows with .loc with a list of strings -- Age Color Food Height Score State Cornelia 39 Red Beans 150 2.2 TX Jane 30 Blue Steak 165 4.6 NY Dean 32 Gray Cheese 180 1.8 AK -- Selecting multiple rows with .loc with slice notation -- Age Color Food Height Score State Aaron 22 Red Mango 120 9.0 FL Penelope 40 White Apple 80 3.3 AL Dean 32 Gray Cheese 180 1.8 AK
21检查 DataFrame 中是否是空的
import pandas as pd df = pd.DataFrame() if df.empty: print('DataFrame is empty!')
Output:
DataFrame is empty!
22在创建 DataFrame 时指定索引和列名称
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"] df = pd.DataFrame(values, index=code, columns=['Country']) print(df)
Output:
Country IND India CAN Canada AUS Australia JAP Japan GER Germany FRA France
23使用 iloc 进行切片
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n -- Selecting a single row with .iloc with an integer -- \n") print(df.iloc[4]) print("\n -- Selecting multiple rows with .iloc with a list of integers -- \n") print(df.iloc[[2, -2]]) print("\n -- Selecting multiple rows with .iloc with slice notation -- \n") print(df.iloc[:5:3])
Output:
-- Selecting a single row with .iloc with an integer -- Age 32 Color Gray Food Cheese Height 180 Score 1.8 State AK Name: Dean, dtype: object -- Selecting multiple rows with .iloc with a list of integers -- Age Color Food Height Score State Aaron 22 Red Mango 120 9.0 FL Christina 28 Black Melon 172 9.5 TX -- Selecting multiple rows with .iloc with slice notation -- Age Color Food Height Score State Jane 30 Blue Steak 165 4.6 NY Penelope 40 White Apple 80 3.3 AL
24iloc 和 loc 的区别
- loc 索引器还可以进行布尔选择,例如,如果我们想查找 Age 小于 30 的所有行并仅返回 Color 和 Height 列,我们可以执行以下操作。我们可以用 iloc 复制它,但我们不能将它传递给一个布尔系列,必须将布尔系列转换为 numpy 数组
- loc 从索引中获取具有特定标签的行(或列)
- iloc 在索引中的特定位置获取行(或列)(因此它只需要整数)
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n -- loc -- \n") print(df.loc[df['Age'] < 30, ['Color', 'Height']]) print("\n -- iloc -- \n") print(df.iloc[(df['Age'] < 30).values, [1, 3]])
Output:
-- loc -- Color Height Nick Green 70 Aaron Red 120 Christina Black 172 -- iloc -- Color Height Nick Green 70 Aaron Red 120 Christina Black 172
25使用时间索引创建空 DataFrame
import datetime import pandas as pd todays_date = datetime.datetime.now().date() index = pd.date_range(todays_date, periods=10, freq='D') columns = ['A', 'B', 'C'] df = pd.DataFrame(index=index, columns=columns) df = df.fillna(0) print(df)
Output:
A B C 2018-09-30 0 0 0 2018-10-01 0 0 0 2018-10-02 0 0 0 2018-10-03 0 0 0 2018-10-04 0 0 0 2018-10-05 0 0 0 2018-10-06 0 0 0 2018-10-07 0 0 0 2018-10-08 0 0 0 2018-10-09 0 0 0
26如何改变 DataFrame 列的排序
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n -- Change order using columns -- \n") new_order = [3, 2, 1, 4, 5, 0] df = df[df.columns[new_order]] print(df) print("\n -- Change order using reindex -- \n") df = df.reindex(['State', 'Color', 'Age', 'Food', 'Score', 'Height'], axis=1) print(df)
Output:
-- Change order using columns -- Height Food Color Score State Age Jane 165 Steak Blue 4.6 NY 30 Nick 70 Lamb Green 8.3 TX 20 Aaron 120 Mango Red 9.0 FL 22 Penelope 80 Apple White 3.3 AL 40 Dean 180 Cheese Gray 1.8 AK 32 Christina 172 Melon Black 9.5 TX 28 Cornelia 150 Beans Red 2.2 TX 39 -- Change order using reindex -- State Color Age Food Score Height Jane NY Blue 30 Steak 4.6 165 Nick TX Green 20 Lamb 8.3 70 Aaron FL Red 22 Mango 9.0 120 Penelope AL White 40 Apple 3.3 80 Dean AK Gray 32 Cheese 1.8 180 Christina TX Black 28 Melon 9.5 172 Cornelia TX Red 39 Beans 2.2 150
27检查 DataFrame 列的数据类型
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df.dtypes)
Output:
Age int64 Color object Food object Height int64 Score float64 State object dtype: object
28更改 DataFrame 指定列的数据类型
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'], 'Height': [165, 70, 120, 80, 180, 172, 150], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df.dtypes) df['Age'] = df['Age'].astype(str) print(df.dtypes)
Output:
Age int64 Color object Food object Height int64 Score float64 State object dtype: object Age object Color object Food object Height int64 Score float64 State object dtype: object
29如何将列的数据类型转换为 DateTime 类型
import pandas as pd df = pd.DataFrame({'DateOFBirth': [1349720105, 1349806505, 1349892905, 1349979305, 1350065705, 1349792905, 1349730105], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n----------------Before---------------\n") print(df.dtypes) print(df) df['DateOFBirth'] = pd.to_datetime(df['DateOFBirth'], unit='s') print("\n----------------After----------------\n") print(df.dtypes) print(df)
Output:
----------------Before--------------- DateOFBirth int64 State object dtype: object DateOFBirth State Jane 1349720105 NY Nick 1349806505 TX Aaron 1349892905 FL Penelope 1349979305 AL Dean 1350065705 AK Christina 1349792905 TX Cornelia 1349730105 TX ----------------After---------------- DateOFBirth datetime64[ns] State object dtype: object DateOFBirth State Jane 2012-10-08 18:15:05 NY Nick 2012-10-09 18:15:05 TX Aaron 2012-10-10 18:15:05 FL Penelope 2012-10-11 18:15:05 AL Dean 2012-10-12 18:15:05 AK Christina 2012-10-09 14:28:25 TX Cornelia 2012-10-08 21:01:45 TX
30将 DataFrame 列从 floats 转为 ints
import pandas as pd df = pd.DataFrame({'DailyExp': [75.7, 56.69, 55.69, 96.5, 84.9, 110.5, 58.9], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n----------------Before---------------\n") print(df.dtypes) print(df) df['DailyExp'] = df['DailyExp'].astype(int) print("\n----------------After----------------\n") print(df.dtypes) print(df)
Output:
----------------Before--------------- DailyExp float64 State object dtype: object DailyExp State Jane 75.70 NY Nick 56.69 TX Aaron 55.69 FL Penelope 96.50 AL Dean 84.90 AK Christina 110.50 TX Cornelia 58.90 TX ----------------After---------------- DailyExp int32 State object dtype: object DailyExp State Jane 75 NY Nick 56 TX Aaron 55 FL Penelope 96 AL Dean 84 AK Christina 110 TX Cornelia 58 TX
31如何把 dates 列转换为 DateTime 类型
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print("\n----------------Before---------------\n") print(df.dtypes) df['DateOfBirth'] = df['DateOfBirth'].astype('datetime64') print("\n----------------After----------------\n") print(df.dtypes)
Output:
----------------Before--------------- DateOfBirth object State object dtype: object ----------------After---------------- DateOfBirth datetime64[ns] State object dtype: object
32两个 DataFrame 相加
import pandas as pd df1 = pd.DataFrame({'Age': [30, 20, 22, 40], 'Height': [165, 70, 120, 80], 'Score': [4.6, 8.3, 9.0, 3.3], 'State': ['NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Nick', 'Aaron', 'Penelope']) df2 = pd.DataFrame({'Age': [32, 28, 39], 'Color': ['Gray', 'Black', 'Red'], 'Food': ['Cheese', 'Melon', 'Beans'], 'Score': [1.8, 9.5, 2.2], 'State': ['AK', 'TX', 'TX']}, index=['Dean', 'Christina', 'Cornelia']) df3 = df1.append(df2, sort=True) print(df3)
Output:
Age Color Food Height Score State Jane 30 NaN NaN 165.0 4.6 NY Nick 20 NaN NaN 70.0 8.3 TX Aaron 22 NaN NaN 120.0 9.0 FL Penelope 40 NaN NaN 80.0 3.3 AL Dean 32 Gray Cheese NaN 1.8 AK Christina 28 Black Melon NaN 9.5 TX Cornelia 39 Red Beans NaN 2.2 TX
33在 DataFrame 末尾添加额外的行
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print("\n------------ BEFORE ----------------\n") print(employees) employees.loc[len(employees)] = [45, '2018-01-25', 'Emp006', 'Sunny', 'Programmer'] print("\n------------ AFTER ----------------\n") print(employees)
Output:
------------ BEFORE ---------------- Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 1 24 2018-01-26 Emp002 Doe Statistician 2 34 2018-01-26 Emp003 William Statistician 3 29 2018-02-26 Emp004 Spark Statistician 4 40 2018-03-16 Emp005 Mark Programmer ------------ AFTER ---------------- Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp001 John Chemist 1 24 2018-01-26 Emp002 Doe Statistician 2 34 2018-01-26 Emp003 William Statistician 3 29 2018-02-26 Emp004 Spark Statistician 4 40 2018-03-16 Emp005 Mark Programmer 5 45 2018-01-25 Emp006 Sunny Programmer
34为指定索引添加新行
import pandas as pd employees = pd.DataFrame( data={'Name': ['John Doe', 'William Spark'], 'Occupation': ['Chemist', 'Statistician'], 'Date Of Join': ['2018-01-25', '2018-01-26'], 'Age': [23, 24]}, index=['Emp001', 'Emp002'], columns=['Name', 'Occupation', 'Date Of Join', 'Age']) print("\n------------ BEFORE ----------------\n") print(employees) employees.loc['Emp003'] = ['Sunny', 'Programmer', '2018-01-25', 45] print("\n------------ AFTER ----------------\n") print(employees)
Output:
------------ BEFORE ---------------- Name Occupation Date Of Join Age Emp001 John Doe Chemist 2018-01-25 23 Emp002 William Spark Statistician 2018-01-26 24 ------------ AFTER ---------------- Name Occupation Date Of Join Age Emp001 John Doe Chemist 2018-01-25 23 Emp002 William Spark Statistician 2018-01-26 24 Emp003 Sunny Programmer 2018-01-25 45
35如何使用 for 循环添加行
import pandas as pd cols = ['Zip'] lst = [] zip = 32100 for a in range(10): lst.append([zip]) zip = zip + 1 df = pd.DataFrame(lst, columns=cols) print(df)
Output:
Zip 0 32100 1 32101 2 32102 3 32103 4 32104 5 32105 6 32106 7 32107 8 32108 9 32109
36在 DataFrame 顶部添加一行
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp002', 'Emp003', 'Emp004'], 'Name': ['John', 'Doe', 'William'], 'Occupation': ['Chemist', 'Statistician', 'Statistician'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26'], 'Age': [23, 24, 34]}) print("\n------------ BEFORE ----------------\n") print(employees) # New line line = pd.DataFrame({'Name': 'Dean', 'Age': 45, 'EmpCode': 'Emp001', 'Date Of Join': '2018-02-26', 'Occupation': 'Chemist' }, index=[0]) # Concatenate two dataframe employees = pd.concat([line,employees.ix[:]]).reset_index(drop=True) print("\n------------ AFTER ----------------\n") print(employees)
Output:
------------ BEFORE ---------------- Age Date Of Join EmpCode Name Occupation 0 23 2018-01-25 Emp002 John Chemist 1 24 2018-01-26 Emp003 Doe Statistician 2 34 2018-01-26 Emp004 William Statistician ------------ AFTER ---------------- Age Date Of Join EmpCode Name Occupation 0 45 2018-02-26 Emp001 Dean Chemist 1 23 2018-01-25 Emp002 John Chemist 2 24 2018-01-26 Emp003 Doe Statistician 3 34 2018-01-26 Emp004 William Statistician
37如何向 DataFrame 中动态添加行
import pandas as pd df = pd.DataFrame(columns=['Name', 'Age']) df.loc[1, 'Name'] = 'Rocky' df.loc[1, 'Age'] = 23 df.loc[2, 'Name'] = 'Sunny' print(df)
Output:
Name Age 1 Rocky 23 2 Sunny NaN
38在任意位置插入行
import pandas as pd df = pd.DataFrame(columns=['Name', 'Age']) df.loc[1, 'Name'] = 'Rocky' df.loc[1, 'Age'] = 21 df.loc[2, 'Name'] = 'Sunny' df.loc[2, 'Age'] = 22 df.loc[3, 'Name'] = 'Mark' df.loc[3, 'Age'] = 25 df.loc[4, 'Name'] = 'Taylor' df.loc[4, 'Age'] = 28 print("\n------------ BEFORE ----------------\n") print(df) line = pd.DataFrame({"Name": "Jack", "Age": 24}, index=[2.5]) df = df.append(line, ignore_index=False) df = df.sort_index().reset_index(drop=True) df = df.reindex(['Name', 'Age'], axis=1) print("\n------------ AFTER ----------------\n") print(df)
Output:
------------ BEFORE ---------------- Name Age 1 Rocky 21 2 Sunny 22 3 Mark 25 4 Taylor 28 ------------ AFTER ---------------- Name Age 0 Rocky 21 1 Sunny 22 2 Jack 24 3 Mark 25 4 Taylor 28
39使用时间戳索引向 DataFrame 中添加行
import pandas as pd df = pd.DataFrame(columns=['Name', 'Age']) df.loc['2014-05-01 18:47:05', 'Name'] = 'Rocky' df.loc['2014-05-01 18:47:05', 'Age'] = 21 df.loc['2014-05-02 18:47:05', 'Name'] = 'Sunny' df.loc['2014-05-02 18:47:05', 'Age'] = 22 df.loc['2014-05-03 18:47:05', 'Name'] = 'Mark' df.loc['2014-05-03 18:47:05', 'Age'] = 25 print("\n------------ BEFORE ----------------\n") print(df) line = pd.to_datetime("2014-05-01 18:50:05", format="%Y-%m-%d %H:%M:%S") new_row = pd.DataFrame([['Bunny', 26]], columns=['Name', 'Age'], index=[line]) df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=False) print("\n------------ AFTER ----------------\n") print(df)
Output:
------------ BEFORE ---------------- Name Age 2014-05-01 18:47:05 Rocky 21 2014-05-02 18:47:05 Sunny 22 2014-05-03 18:47:05 Mark 25 ------------ AFTER ---------------- Name Age 2014-05-01 18:47:05 Rocky 21 2014-05-02 18:47:05 Sunny 22 2014-05-03 18:47:05 Mark 25 2014-05-01 18:50:05 Bunny 26
40为不同的行填充缺失值
import pandas as pd a = {'A': 10, 'B': 20} b = {'B': 30, 'C': 40, 'D': 50} df1 = pd.DataFrame(a, index=[0]) df2 = pd.DataFrame(b, index=[1]) df = pd.DataFrame() df = df.append(df1) df = df.append(df2).fillna(0) print(df)
Output:
A B C D 0 10.0 20 0.0 0.0 1 0.0 30 40.0 50.0
41append, concat 和 combine_first 示例
import pandas as pd a = {'A': 10, 'B': 20} b = {'B': 30, 'C': 40, 'D': 50} df1 = pd.DataFrame(a, index=[0]) df2 = pd.DataFrame(b, index=[1]) d1 = pd.DataFrame() d1 = d1.append(df1) d1 = d1.append(df2).fillna(0) print("\n------------ append ----------------\n") print(d1) d2 = pd.concat([df1, df2]).fillna(0) print("\n------------ concat ----------------\n") print(d2) d3 = pd.DataFrame() d3 = d3.combine_first(df1).combine_first(df2).fillna(0) print("\n------------ combine_first ----------------\n") print(d3)
Output:
------------ append ---------------- A B C D 0 10.0 20 0.0 0.0 1 0.0 30 40.0 50.0 ------------ concat ---------------- A B C D 0 10.0 20 0.0 0.0 1 0.0 30 40.0 50.0 ------------ combine_first ---------------- A B C D 0 10.0 20.0 0.0 0.0 1 0.0 30.0 40.0 50.0
42获取行和列的平均值
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) df['Mean Basket'] = df.mean(axis=1) df.loc['Mean Fruit'] = df.mean() print(df)
Output:
Apple Orange Banana Pear Mean Basket Basket1 10.000000 20.0 30.0 40.000000 25.0 Basket2 7.000000 14.0 21.0 28.000000 17.5 Basket3 5.000000 5.0 0.0 0.000000 2.5 Mean Fruit 7.333333 13.0 17.0 22.666667 15.0
43计算行和列的总和
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) df['Sum Basket'] = df.sum(axis=1) df.loc['Sum Fruit'] = df.sum() print(df)
Output:
Apple Orange Banana Pear Sum Basket Basket1 10 20 30 40 100 Basket2 7 14 21 28 70 Basket3 5 5 0 0 10 Sum Fruit 22 39 51 68 180
44连接两列
import pandas as pd df = pd.DataFrame(columns=['Name', 'Age']) df.loc[1, 'Name'] = 'Rocky' df.loc[1, 'Age'] = 21 df.loc[2, 'Name'] = 'Sunny' df.loc[2, 'Age'] = 22 df.loc[3, 'Name'] = 'Mark' df.loc[3, 'Age'] = 25 df.loc[4, 'Name'] = 'Taylor' df.loc[4, 'Age'] = 28 print('\n------------ BEFORE ----------------\n') print(df) df['Employee'] = df['Name'].map(str) + ' - ' + df['Age'].map(str) df = df.reindex(['Employee'], axis=1) print('\n------------ AFTER ----------------\n') print(df)
Output:
------------ BEFORE ---------------- Name Age 1 Rocky 21 2 Sunny 22 3 Mark 25 4 Taylor 28 ------------ AFTER ---------------- Employee 1 Rocky - 21 2 Sunny - 22 3 Mark - 25 4 Taylor - 28
45过滤包含某字符串的行
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df) print("\n---- Filter with State contains TX ----\n") df1 = df[df['State'].str.contains("TX")] print(df1)
Output:
DateOfBirth State Jane 1986-11-11 NY Nick 1999-05-12 TX Aaron 1976-01-01 FL Penelope 1986-06-01 AL Dean 1983-06-04 AK Christina 1990-03-07 TX Cornelia 1999-07-09 TX ---- Filter with State contains TX ---- DateOfBirth State Nick 1999-05-12 TX Christina 1990-03-07 TX Cornelia 1999-07-09 TX