导入pandas库和numpy库
import pandas as pd import numpy as np
我们以一个csv文件来展示pandas是如何来进行数据预处理的:titanic_train.csv
读入文件titanic_train.csv,并显示前十行数据
titanic_survival = pd.read_csv("titanic_train.csv") titanic_survival.head()
OUT:
下面对数据进行处理
#获取数据 age = titanic_survival["Age"] print(age.loc[:10]) print("__________") #判断是否有缺失值 age_is_null = pd.isnull(age) print(age_is_null.loc[:10]) print("__________") #保留isnull为true值 age_null_true = age[age_is_null] print(age_null_true.loc[:10]) print("__________") #计算缺失值个数 age_null_count = len(age_null_true) print(age_null_count)
OUT:
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 NaN 6 54.0 7 2.0 8 27.0 9 14.0 10 4.0 Name: Age, dtype: float64 __________ 0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 False 8 False 9 False 10 False Name: Age, dtype: bool __________ 5 NaN Name: Age, dtype: float64 __________ 177