46过滤索引中包含某字符串的行
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) print(df) print("\n---- Filter Index contains ane ----\n") df.index = df.index.astype('str') df1 = df[df.index.str.contains('ane')] print(df1)
Output:
DateOfBirth State Jane 1986-11-11 NY Pane 1999-05-12 TX Aaron 1976-01-01 FL Penelope 1986-06-01 AL Frane 1983-06-04 AK Christina 1990-03-07 TX Cornelia 1999-07-09 TX ---- Filter Index contains ane ---- DateOfBirth State Jane 1986-11-11 NY Pane 1999-05-12 TX Frane 1983-06-04 AK
47使用 AND 运算符过滤包含特定字符串值的行
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) print(df) print("\n---- Filter DataFrame using & ----\n") df.index = df.index.astype('str') df1 = df[df.index.str.contains('ane') & df['State'].str.contains("TX")] print(df1)
Output:
DateOfBirth State Jane 1986-11-11 NY Pane 1999-05-12 TX Aaron 1976-01-01 FL Penelope 1986-06-01 AL Frane 1983-06-04 AK Christina 1990-03-07 TX Cornelia 1999-07-09 TX ---- Filter DataFrame using & ---- DateOfBirth State Pane 1999-05-12 TX
48查找包含某字符串的所有行
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) print(df) print("\n---- Filter DataFrame using & ----\n") df.index = df.index.astype('str') df1 = df[df.index.str.contains('ane') | df['State'].str.contains("TX")] print(df1)
Output:
DateOfBirth State Jane 1986-11-11 NY Pane 1999-05-12 TX Aaron 1976-01-01 FL Penelope 1986-06-01 AL Frane 1983-06-04 AK Christina 1990-03-07 TX Cornelia 1999-07-09 TX ---- Filter DataFrame using & ---- DateOfBirth State Jane 1986-11-11 NY Pane 1999-05-12 TX Frane 1983-06-04 AK Christina 1990-03-07 TX Cornelia 1999-07-09 TX
49如果行中的值包含字符串,则创建与字符串相等的另一列
import pandas as pd import numpy as np df = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Accountant', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) df['Department'] = pd.np.where(df.Occupation.str.contains("Chemist"), "Science", pd.np.where(df.Occupation.str.contains("Statistician"), "Economics", pd.np.where(df.Occupation.str.contains("Programmer"), "Computer", "General"))) print(df)
Output:
Age Date Of Join EmpCode Name Occupation Department 0 23 2018-01-25 Emp001 John Chemist Science 1 24 2018-01-26 Emp002 Doe Accountant General 2 34 2018-01-26 Emp003 William Statistician Economics 3 29 2018-02-26 Emp004 Spark Statistician Economics 4 40 2018-03-16 Emp005 Mark Programmer Computer
50计算 pandas group 中每组的行数
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0], [6, 6, 6, 6], [8, 8, 8, 8], [5, 5, 0, 0]], columns=['Apple', 'Orange', 'Rice', 'Oil'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print(df) print("\n ----------------------------- \n") print(df[['Apple', 'Orange', 'Rice', 'Oil']]. groupby(['Apple']).agg(['mean', 'count']))
Output:
Apple Orange Rice Oil Basket1 10 20 30 40 Basket2 7 14 21 28 Basket3 5 5 0 0 Basket4 6 6 6 6 Basket5 8 8 8 8 Basket6 5 5 0 0 ----------------------------- Orange Rice Oil mean count mean count mean count Apple 5 5 2 0 2 0 2 6 6 1 6 1 6 1 7 14 1 21 1 28 1 8 8 1 8 1 8 1 10 20 1 30 1 40 1
51检查字符串是否在 DataFrme 中
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) if df['State'].str.contains('TX').any(): print("TX is there")
Output:
TX is there
52从 DataFrame 列中获取唯一行值
import pandas as pd df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df) print("\n----------------\n") print(df["State"].unique())
Output:
State Jane NY Nick TX Aaron FL Penelope AL Dean AK Christina TX Cornelia TX ---------------- ['NY' 'TX' 'FL' 'AL' 'AK']
53计算 DataFrame 列的不同值
import pandas as pd df = pd.DataFrame({'Age': [30, 20, 22, 40, 20, 30, 20, 25], 'Height': [165, 70, 120, 80, 162, 72, 124, 81], 'Score': [4.6, 8.3, 9.0, 3.3, 4, 8, 9, 3], 'State': ['NY', 'TX', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print(df.Age.value_counts())
Output:
20 3 30 2 25 1 22 1 40 1 Name: Age, dtype: int64
54删除具有重复索引的行
import pandas as pd df = pd.DataFrame({'Age': [30, 30, 22, 40, 20, 30, 20, 25], 'Height': [165, 165, 120, 80, 162, 72, 124, 81], 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3], 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print("\n -------- Duplicate Rows ----------- \n") print(df) df1 = df.reset_index().drop_duplicates(subset='index', keep='first').set_index('index') print("\n ------- Unique Rows ------------ \n") print(df1)
Output:
-------- Duplicate Rows ----------- Age Height Score State Jane 30 165 4.6 NY Jane 30 165 4.6 NY Aaron 22 120 9.0 FL Penelope 40 80 3.3 AL Jaane 20 162 4.0 NY Nicky 30 72 8.0 TX Armour 20 124 9.0 FL Ponting 25 81 3.0 AL ------- Unique Rows ------------ Age Height Score State index Jane 30 165 4.6 NY Aaron 22 120 9.0 FL Penelope 40 80 3.3 AL Jaane 20 162 4.0 NY Nicky 30 72 8.0 TX Armour 20 124 9.0 FL Ponting 25 81 3.0 AL
55删除某些列具有重复值的行
import pandas as pd df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25], 'Height': [120, 162, 120, 120, 120, 72, 120, 81], 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3], 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print("\n -------- Duplicate Rows ----------- \n") print(df) df1 = df.reset_index().drop_duplicates(subset=['Age','Height'], keep='first').set_index('index') print("\n ------- Unique Rows ------------ \n") print(df1)
Output:
-------- Duplicate Rows ----------- Age Height Score State Jane 30 120 4.6 NY Jane 40 162 4.6 NY Aaron 30 120 9.0 FL Penelope 40 120 3.3 AL Jaane 30 120 4.0 NY Nicky 30 72 8.0 TX Armour 20 120 9.0 FL Ponting 25 81 3.0 AL ------- Unique Rows ------------ Age Height Score State index Jane 30 120 4.6 NY Jane 40 162 4.6 NY Penelope 40 120 3.3 AL Nicky 30 72 8.0 TX Armour 20 120 9.0 FL Ponting 25 81 3.0 AL
56从 DataFrame 单元格中获取值
import pandas as pd df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25], 'Height': [120, 162, 120, 120, 120, 72, 120, 81], 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3], 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print(df.loc['Nicky', 'Age'])
Output:
30
57使用 DataFrame 中的条件索引获取单元格上的标量值
import pandas as pd df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25], 'Height': [120, 162, 120, 120, 120, 72, 120, 81], 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3], 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print("\nGet Height where Age is 20") print(df.loc[df['Age'] == 20, 'Height'].values[0]) print("\nGet State where Age is 30") print(df.loc[df['Age'] == 30, 'State'].values[0])
Output:
Get Height where Age is 20 120 Get State where Age is 30 NY
58设置 DataFrame 的特定单元格值
import pandas as pd df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25], 'Height': [120, 162, 120, 120, 120, 72, 120, 81]}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print("\n--------------Before------------\n") print(df) df.iat[0, 0] = 90 df.iat[0, 1] = 91 df.iat[1, 1] = 92 df.iat[2, 1] = 93 df.iat[7, 1] = 99 print("\n--------------After------------\n") print(df)
Output:
--------------Before------------ Age Height Jane 30 120 Jane 40 162 Aaron 30 120 Penelope 40 120 Jaane 30 120 Nicky 30 72 Armour 20 120 Ponting 25 81 --------------After------------ Age Height Jane 90 91 Jane 40 92 Aaron 30 93 Penelope 40 120 Jaane 30 120 Nicky 30 72 Armour 20 120 Ponting 25 99
59从 DataFrame 行获取单元格值
import pandas as pd df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25], 'Height': [120, 162, 120, 120, 120, 72, 120, 81]}, index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting']) print(df.loc[df.Age == 30,'Height'].tolist())
Output:
[120, 120, 120, 72]
60用字典替换 DataFrame 列中的值
import pandas as pd df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df) dict = {"NY": 1, "TX": 2, "FL": 3, "AL": 4, "AK": 5} df1 = df.replace({"State": dict}) print("\n\n") print(df1)
Output:
State Jane NY Nick TX Aaron FL Penelope AL Dean AK Christina TX Cornelia TX State Jane 1 Nick 2 Aaron 3 Penelope 4 Dean 5 Christina 2 Cornelia 2
61统计基于某一列的一列的数值
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']) print(df.groupby('State').DateOfBirth.nunique())
Output:
State AK 1 AL 1 FL 1 NY 1 TX 3 Name: DateOfBirth, dtype: int64
62处理 DataFrame 中的缺失值
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print("\n--------- DataFrame ---------\n") print(df) print("\n--------- Use of isnull() ---------\n") print(df.isnull()) print("\n--------- Use of notnull() ---------\n") print(df.notnull())
Output:
--------- DataFrame --------- Apple Orange Banana Pear Basket1 10 20.0 30.0 40.0 Basket2 7 14.0 21.0 28.0 Basket3 5 NaN NaN NaN --------- Use of isnull() --------- Apple Orange Banana Pear Basket1 False False False False Basket2 False False False False Basket3 False True True True --------- Use of notnull() --------- Apple Orange Banana Pear Basket1 True True True True Basket2 True True True True Basket3 True False False False
63删除包含任何缺失数据的行
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print("\n--------- DataFrame ---------\n") print(df) print("\n--------- Use of dropna() ---------\n") print(df.dropna())
Output:
--------- DataFrame --------- Apple Orange Banana Pear Basket1 10 20.0 30.0 40.0 Basket2 7 14.0 21.0 28.0 Basket3 5 NaN NaN NaN --------- Use of dropna() --------- Apple Orange Banana Pear Basket1 10 20.0 30.0 40.0 Basket2 7 14.0 21.0 28.0
64删除 DataFrame 中缺失数据的列
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print("\n--------- DataFrame ---------\n") print(df) print("\n--------- Drop Columns) ---------\n") print(df.dropna(1))
Output:
--------- DataFrame --------- Apple Orange Banana Pear Basket1 10 20.0 30.0 40.0 Basket2 7 14.0 21.0 28.0 Basket3 5 NaN NaN NaN --------- Drop Columns) --------- Apple Basket1 10 Basket2 7 Basket3 5
65按降序对索引值进行排序
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) print(df.sort_index(ascending=False))
Output:
DateOfBirth State Penelope 1986-06-01 AL Pane 1999-05-12 TX Jane 1986-11-11 NY Frane 1983-06-04 AK Cornelia 1999-07-09 TX Christina 1990-03-07 TX Aaron 1976-01-01 FL
66按降序对列进行排序
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print(employees.sort_index(axis=1, ascending=False))
Output:
Occupation Name EmpCode Date Of Join Age 0 Chemist John Emp001 2018-01-25 23 1 Statistician Doe Emp002 2018-01-26 24 2 Statistician William Emp003 2018-01-26 34 3 Statistician Spark Emp004 2018-02-26 29 4 Programmer Mark Emp005 2018-03-16 40
67使用 rank 方法查找 DataFrame 中元素的排名
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print("\n--------- DataFrame Values--------\n") print(df) print("\n--------- DataFrame Values by Rank--------\n") print(df.rank())
Output:
--------- DataFrame Values-------- Apple Orange Banana Pear Basket1 10 20 30 40 Basket2 7 14 21 28 Basket3 5 5 0 0 --------- DataFrame Values by Rank-------- Apple Orange Banana Pear Basket1 3.0 3.0 3.0 3.0 Basket2 2.0 2.0 2.0 2.0 Basket3 1.0 1.0 1.0 1.0
68在多列上设置索引
import pandas as pd employees = pd.DataFrame({ 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'], 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'], 'Occupation': ['Chemist', 'Statistician', 'Statistician', 'Statistician', 'Programmer'], 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26', '2018-03-16'], 'Age': [23, 24, 34, 29, 40]}) print("\n --------- Before Index ----------- \n") print(employees) print("\n --------- Multiple Indexing ----------- \n") print(employees.set_index(['Occupation', 'Age']))
Output:
Date Of Join EmpCode Name Occupation Age Chemist 23 2018-01-25 Emp001 John Statistician 24 2018-01-26 Emp002 Doe 34 2018-01-26 Emp003 William 29 2018-02-26 Emp004 Spark Programmer 40 2018-03-16 Emp005 Mark
69确定 DataFrame 的周期索引和列
import pandas as pd values = ["India", "Canada", "Australia", "Japan", "Germany", "France"] pidx = pd.period_range('2015-01-01', periods=6) df = pd.DataFrame(values, index=pidx, columns=['Country']) print(df)
Output:
Country 2015-01-01 India 2015-01-02 Canada 2015-01-03 Australia 2015-01-04 Japan 2015-01-05 Germany 2015-01-06 France
70导入 CSV 指定特定索引
import pandas as pd df = pd.read_csv('test.csv', index_col="DateTime") print(df)
Output:
Wheat Rice Oil DateTime 10/10/2016 10.500 12.500 16.500 10/11/2016 11.250 12.750 17.150 10/12/2016 10.000 13.150 15.500 10/13/2016 12.000 14.500 16.100 10/14/2016 13.000 14.825 15.600 10/15/2016 13.075 15.465 15.315 10/16/2016 13.650 16.105 15.030 10/17/2016 14.225 16.745 14.745 10/18/2016 14.800 17.385 14.460 10/19/2016 15.375 18.025 14.175
71将 DataFrame 写入 csv
import pandas as pd df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01', '1986-06-01', '1983-06-04', '1990-03-07', '1999-07-09'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'] }, index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane', 'Christina', 'Cornelia']) df.to_csv('test.csv', encoding='utf-8', index=True)
Output:
检查本地文件
72使用 Pandas 读取 csv 文件的特定列
import pandas as pd df = pd.read_csv("test.csv", usecols = ['Wheat','Oil']) print(df)
73Pandas 获取 CSV 列的列表
import pandas as pd cols = list(pd.read_csv("test.csv", nrows =1)) print(cols)
Output:
['DateTime', 'Wheat', 'Rice', 'Oil']
74找到列值最大的行
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print(df.ix[df['Apple'].idxmax()])
Output:
Apple 55 Orange 15 Banana 8 Pear 12 Name: Basket3, dtype: int64
75使用查询方法进行复杂条件选择
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) print(df) print("\n ----------- Filter data using query method ------------- \n") df1 = df.ix[df.query('Apple > 50 & Orange <= 15 & Banana < 15 & Pear == 12').index] print(df1)
Output:
Apple Orange Banana Pear Basket1 10 20 30 40 Basket2 7 14 21 28 Basket3 55 15 8 12 ----------- Filter data using query method ------------- Apple Orange Banana Pear Basket3 55 15 8 12
76检查 Pandas 中是否存在列
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3']) if 'Apple' in df.columns: print("Yes") else: print("No") if set(['Apple','Orange']).issubset(df.columns): print("Yes") else: print("No")
77为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- nsmallest -----------\n") print(df.nsmallest(2, ['Apple'])) print("\n----------- nlargest -----------\n") print(df.nlargest(2, ['Apple']))
Output:
----------- nsmallest ----------- Apple Orange Banana Pear Basket6 5 4 9 2 Basket2 7 14 21 28 ----------- nlargest ----------- Apple Orange Banana Pear Basket3 55 15 8 12 Basket4 15 14 1 8
78从 DataFrame 中查找所有列的最小值和最大值
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Minimum -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].min()) print("\n----------- Maximum -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].max())
Output:
----------- Minimum ----------- Apple 5 Orange 1 Banana 1 Pear 2 dtype: int64 ----------- Maximum ----------- Apple 55 Orange 20 Banana 30 Pear 40 dtype: int64
79在 DataFrame 中找到最小值和最大值所在的索引位置
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Minimum -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmin()) print("\n----------- Maximum -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmax())
Output:
----------- Minimum ----------- Apple Basket6 Orange Basket5 Banana Basket4 Pear Basket6 dtype: object ----------- Maximum ----------- Apple Basket3 Orange Basket1 Banana Basket1 Pear Basket1 dtype: object
80计算 DataFrame Columns 的累积乘积和累积总和
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Cumulative Product -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumprod()) print("\n----------- Cumulative Sum -----------\n") print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumsum())
Output:
----------- Cumulative Product ----------- Apple Orange Banana Pear Basket1 10 20 30 40 Basket2 70 280 630 1120 Basket3 3850 4200 5040 13440 Basket4 57750 58800 5040 107520 Basket5 404250 58800 5040 860160 Basket6 2021250 235200 45360 1720320 ----------- Cumulative Sum ----------- Apple Orange Banana Pear Basket1 10 20 30 40 Basket2 17 34 51 68 Basket3 72 49 59 80 Basket4 87 63 60 88 Basket5 94 64 61 96 Basket6 99 68 70 98
81汇总统计
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Describe DataFrame -----------\n") print(df.describe()) print("\n----------- Describe Column -----------\n") print(df[['Apple']].describe())
Output:
----------- Describe DataFrame ----------- Apple Orange Banana Pear count 6.000000 6.000000 6.000000 6.000000 mean 16.500000 11.333333 11.666667 16.333333 std 19.180719 7.257180 11.587349 14.555640 min 5.000000 1.000000 1.000000 2.000000 25% 7.000000 6.500000 2.750000 8.000000 50% 8.500000 14.000000 8.500000 10.000000 75% 13.750000 14.750000 18.000000 24.000000 max 55.000000 20.000000 30.000000 40.000000 ----------- Describe Column ----------- Apple count 6.000000 mean 16.500000 std 19.180719 min 5.000000 25% 7.000000 50% 8.500000 75% 13.750000 max 55.000000
82查找 DataFrame 的均值、中值和众数
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Calculate Mean -----------\n") print(df.mean()) print("\n----------- Calculate Median -----------\n") print(df.median()) print("\n----------- Calculate Mode -----------\n") print(df.mode())
Output:
----------- Calculate Mean ----------- Apple 16.500000 Orange 11.333333 Banana 11.666667 Pear 16.333333 dtype: float64 ----------- Calculate Median ----------- Apple 8.5 Orange 14.0 Banana 8.5 Pear 10.0 dtype: float64 ----------- Calculate Mode ----------- Apple Orange Banana Pear 0 7 14 1 8
83测量 DataFrame 列的方差和标准偏差
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Calculate Mean -----------\n") print(df.mean()) print("\n----------- Calculate Median -----------\n") print(df.median()) print("\n----------- Calculate Mode -----------\n") print(df.mode())
Output:
----------- Measure Variance ----------- Apple 367.900000 Orange 52.666667 Banana 134.266667 Pear 211.866667 dtype: float64 ----------- Standard Deviation ----------- Apple 19.180719 Orange 7.257180 Banana 11.587349 Pear 14.555640 dtype: float64
84计算 DataFrame 列之间的协方差
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----------- Calculating Covariance -----------\n") print(df.cov()) print("\n----------- Between 2 columns -----------\n") # Covariance of Apple vs Orange print(df.Apple.cov(df.Orange))
Output:
----------- Calculating Covariance ----------- Apple Orange Banana Pear Apple 367.9 47.600000 -40.200000 -35.000000 Orange 47.6 52.666667 54.333333 77.866667 Banana -40.2 54.333333 134.266667 154.933333 Pear -35.0 77.866667 154.933333 211.866667 ----------- Between 2 columns ----------- 47.60000000000001
85计算 Pandas 中两个 DataFrame 对象之间的相关性
import pandas as pd df1 = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n------ Calculating Correlation of one DataFrame Columns -----\n") print(df1.corr()) df2 = pd.DataFrame([[52, 54, 58, 41], [14, 24, 51, 78], [55, 15, 8, 12], [15, 14, 1, 8], [7, 17, 18, 98], [15, 34, 29, 52]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n----- Calculating correlation between two DataFrame -------\n") print(df2.corrwith(other=df1))
Output:
------ Calculating Correlation of one DataFrame Columns ----- Apple Orange Banana Pear Apple 1.000000 0.341959 -0.180874 -0.125364 Orange 0.341959 1.000000 0.646122 0.737144 Banana -0.180874 0.646122 1.000000 0.918606 Pear -0.125364 0.737144 0.918606 1.000000 ----- Calculating correlation between two DataFrame ------- Apple 0.678775 Orange 0.354993 Banana 0.920872 Pear 0.076919 dtype: float64
86计算 DataFrame 列的每个单元格的百分比变化
import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n------ Percent change at each cell of a Column -----\n") print(df[['Apple']].pct_change()[:3]) print("\n------ Percent change at each cell of a DataFrame -----\n") print(df.pct_change()[:5])
Output:
------ Percent change at each cell of a Column ----- Apple Basket1 NaN Basket2 -0.300000 Basket3 6.857143 ------ Percent change at each cell of a DataFrame ----- Apple Orange Banana Pear Basket1 NaN NaN NaN NaN Basket2 -0.300000 -0.300000 -0.300000 -0.300000 Basket3 6.857143 0.071429 -0.619048 -0.571429 Basket4 -0.727273 -0.066667 -0.875000 -0.333333 Basket5 -0.533333 -0.928571 0.000000 0.000000
87在 Pandas 中向前和向后填充 DataFrame 列的缺失值
import pandas as pd df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12], [15, 14, 1, 8], [7, 8], [5, 4, 1]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n------ DataFrame with NaN -----\n") print(df) print("\n------ DataFrame with Forward Filling -----\n") print(df.ffill()) print("\n------ DataFrame with Forward Filling -----\n") print(df.bfill())
Output:
------ DataFrame with NaN ----- Apple Orange Banana Pear Basket1 10.0 30.0 40.0 NaN Basket2 NaN NaN NaN NaN Basket3 15.0 8.0 12.0 NaN Basket4 15.0 14.0 1.0 8.0 Basket5 7.0 8.0 NaN NaN Basket6 5.0 4.0 1.0 NaN ------ DataFrame with Forward Filling ----- Apple Orange Banana Pear Basket1 10.0 30.0 40.0 NaN Basket2 10.0 30.0 40.0 NaN Basket3 15.0 8.0 12.0 NaN Basket4 15.0 14.0 1.0 8.0 Basket5 7.0 8.0 1.0 8.0 Basket6 5.0 4.0 1.0 8.0 ------ DataFrame with Forward Filling ----- Apple Orange Banana Pear Basket1 10.0 30.0 40.0 8.0 Basket2 15.0 8.0 12.0 8.0 Basket3 15.0 8.0 12.0 8.0 Basket4 15.0 14.0 1.0 8.0 Basket5 7.0 8.0 1.0 NaN Basket6 5.0 4.0 1.0 NaN
88在 Pandas 中使用非分层索引使用 Stacking
import pandas as pd df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12], [15, 14, 1, 8], [7, 8], [5, 4, 1]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n------ DataFrame-----\n") print(df) print("\n------ Stacking DataFrame -----\n") print(df.stack(level=-1))
Output:
------ DataFrame----- Apple Orange Banana Pear Basket1 10.0 30.0 40.0 NaN Basket2 NaN NaN NaN NaN Basket3 15.0 8.0 12.0 NaN Basket4 15.0 14.0 1.0 8.0 Basket5 7.0 8.0 NaN NaN Basket6 5.0 4.0 1.0 NaN ------ Stacking DataFrame ----- Basket1 Apple 10.0 Orange 30.0 Banana 40.0 Basket3 Apple 15.0 Orange 8.0 Banana 12.0 Basket4 Apple 15.0 Orange 14.0 Banana 1.0 Pear 8.0 Basket5 Apple 7.0 Orange 8.0 Basket6 Apple 5.0 Orange 4.0 Banana 1.0 dtype: float64
89使用分层索引对 Pandas 进行拆分
import pandas as pd df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12], [15, 14, 1, 8], [7, 8], [5, 4, 1]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6']) print("\n------ DataFrame-----\n") print(df) print("\n------ Unstacking DataFrame -----\n") print(df.unstack(level=-1))
Output:
------ DataFrame----- Apple Orange Banana Pear Basket1 10.0 30.0 40.0 NaN Basket2 NaN NaN NaN NaN Basket3 15.0 8.0 12.0 NaN Basket4 15.0 14.0 1.0 8.0 Basket5 7.0 8.0 NaN NaN Basket6 5.0 4.0 1.0 NaN ------ Unstacking DataFrame ----- Apple Basket1 10.0 Basket2 NaN Basket3 15.0 Basket4 15.0 Basket5 7.0 Basket6 5.0 Orange Basket1 30.0 Basket2 NaN Basket3 8.0 Basket4 14.0 Basket5 8.0 Basket6 4.0 Banana Basket1 40.0 Basket2 NaN Basket3 12.0 Basket4 1.0 Basket5 NaN Basket6 1.0 Pear Basket1 NaN Basket2 NaN Basket3 NaN Basket4 8.0 Basket5 NaN Basket6 NaN dtype: float64
90Pandas 获取 HTML 页面上 table 数据
import pandas as pd df pd.read_html("url")