数据分析处理库Pandas-常用函数

简介: 数据分析处理库Pandas-常用函数

导入pandas库和numpy库


import pandas as pd
import numpy as np


我们以一个csv文件来展示pandas是如何来进行数据预处理的:titanic_train.csv


读入文件titanic_train.csv


titanic_survival = pd.read_csv("titanic_train.csv")


1、求平均值


①通过求和和求长函数来计算平均值


#去除缺失值
age_is_null = pd.isnull(titanic_survival["Age"])
good_ages = titanic_survival["Age"][age_is_null == False]
#求和后的数据除以总数据长度
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)


OUT:


29.6991176471


②通过pandas函数来求平均值


correct_mean_age_pn = titanic_survival["Age"].mean()
print(correct_mean_age_pn)


OUT:


29.69911764705882


③计算不同等级船舱对应的平均消费


passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
    #取出Pclass值为this_class的行
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    #取出Pclass值为this_class的行的Fare值
    pclass_fare = pclass_rows["Fare"]
    #计算取出的Fare值的平均数
    fare_to_class = pclass_fare.mean()
    #将计算出的平均数传入fares_by_class字典
    fares_by_class[this_class] = fare_to_class
print(fares_by_class)


OUT:


{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}


2、透视表


①统计船舱等级和获救几率之间的关系


passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print(passenger_survival)


参数解释:


index:统计基准

values:被统计数据

aggfunc:统计方式(默认值是mean)


OUT:


Pclass  Survived        
1       0.629630
2       0.472826
3       0.242363


②统计性别和获救几率的关系


sex_survival = titanic_survival.pivot_table(index="Sex", values="Survived", aggfunc=np.mean)
print(sex_survival)


OUT:


Sex     Survived        
female  0.742038
male    0.188908

3

③统计登船地点与消费和获救人数的关系


port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)
print(port_stats)


OUT:


Embarked     Fare        Survived        
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217


3、去除缺失值


①去除有缺失值的列


drop_na_columns = titanic_survival.dropna(axis=1)


OUT:


PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   
                                                Name     Sex  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    male      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female      1      0   
2                             Heikkinen, Miss. Laina  female      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female      1      0   
4                           Allen, Mr. William Henry    male      0      0   
5                                   Moran, Mr. James    male      0      0   
6                            McCarthy, Mr. Timothy J    male      0      0   
7                     Palsson, Master. Gosta Leonard    male      3      1   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female      0      2   
9                Nasser, Mrs. Nicholas (Adele Achem)  female      1      0   
             Ticket     Fare  
0         A/5 21171   7.2500  
1          PC 17599  71.2833  
2  STON/O2. 3101282   7.9250  
3            113803  53.1000  
4            373450   8.0500  
5            330877   8.4583  
6             17463  51.8625  
7            349909  21.0750  
8            347742  11.1333  
9            237736  30.0708


②去除有缺失值的行


new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"])
print(new_titanic_survival)


OUT:


PassengerId  Survived  Pclass  \
0             1         0       3   
1             2         1       1   
2             3         1       3   
3             4         1       1   
4             5         0       3   
6             7         0       1   
7             8         0       3   
8             9         1       3   
9            10         1       2   
10           11         1       3   
                                                 Name     Sex   Age  SibSp  \
0                             Braund, Mr. Owen Harris    male  22.0      1   
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                              Heikkinen, Miss. Laina  female  26.0      0   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                            Allen, Mr. William Henry    male  35.0      0   
6                             McCarthy, Mr. Timothy J    male  54.0      0   
7                      Palsson, Master. Gosta Leonard    male   2.0      3   
8   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                 Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   
10                    Sandstrom, Miss. Marguerite Rut  female   4.0      1   
    Parch            Ticket     Fare Cabin Embarked  
0       0         A/5 21171   7.2500   NaN        S  
1       0          PC 17599  71.2833   C85        C  
2       0  STON/O2. 3101282   7.9250   NaN        S  
3       0            113803  53.1000  C123        S  
4       0            373450   8.0500   NaN        S  
6       0             17463  51.8625   E46        S  
7       1            349909  21.0750   NaN        S  
8       2            347742  11.1333   NaN        S  
9       0            237736  30.0708   NaN        C  
10      1           PP 9549  16.7000    G6        S


4、定位


row_index_age = titanic_survival.loc[83, "Age"]
print(row_index_age)
row_index_pclass = titanic_survival.loc[766, "Pclass"]
print(row_index_pclass)


OUT:


28.0


5、排序


new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)
print(new_titanic_survival[:5])
print("----------")
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed[:5])


OUT:


PassengerId  Survived  Pclass                                  Name  \
630          631         1       1  Barkworth, Mr. Algernon Henry Wilson   
851          852         0       3                   Svensson, Mr. Johan   
493          494         0       1               Artagaveytia, Mr. Ramon   
96            97         0       1             Goldschmidt, Mr. George B   
116          117         0       3                  Connors, Mr. Patrick   
      Sex   Age  SibSp  Parch    Ticket     Fare Cabin Embarked  
630  male  80.0      0      0     27042  30.0000   A23        S  
851  male  74.0      0      0    347060   7.7750   NaN        S  
493  male  71.0      0      0  PC 17609  49.5042   NaN        C  
96   male  71.0      0      0  PC 17754  34.6542    A5        C  
116  male  70.5      0      0    370369   7.7500   NaN        Q  
----------
   PassengerId  Survived  Pclass                                  Name   Sex  \
0          631         1       1  Barkworth, Mr. Algernon Henry Wilson  male   
1          852         0       3                   Svensson, Mr. Johan  male   
2          494         0       1               Artagaveytia, Mr. Ramon  male   
3           97         0       1             Goldschmidt, Mr. George B  male   
4          117         0       3                  Connors, Mr. Patrick  male   
    Age  SibSp  Parch    Ticket     Fare Cabin Embarked  
0  80.0      0      0     27042  30.0000   A23        S  
1  74.0      0      0    347060   7.7750   NaN        S  
2  71.0      0      0  PC 17609  49.5042   NaN        C  
3  71.0      0      0  PC 17754  34.6542    A5        C  
4  70.5      0      0    370369   7.7500   NaN        Q


6、自定义函数apply()


①查看第100个样本的数据


def hundredth_row(column):
    hundredth_item = column.loc[99]
    return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)


OUT:


PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object


②统计每一列的缺失值情况


def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)


OUT:


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


③替换船舱等级称呼


def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknow"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)
print(classes[:10])


OUT:


0     Third Class
1     First Class
2     Third Class
3     First Class
4     Third Class
5     Third Class
6     First Class
7     Third Class
8     Third Class
9    Second Class
dtype: object


7、综合:查看获救几率与年龄之间的关系


#年龄分段
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "Unknow"
    elif age < 18:
        return "Minor"
    else:
        return "Adult"
age_label = titanic_survival.apply(generate_age_label, axis=1)
#分析获救几率与年龄之间的关系
titanic_survival['age_label'] = age_label
age_group_survival = titanic_survival.pivot_table(index="age_label", values="Survived")
print(age_group_survival)


OUT:


age_label  Survived        
Adult      0.381032
Minor      0.539823
Unknow     0.293785
相关文章
|
15天前
|
数据采集 SQL 数据可视化
使用Pandas进行高效数据分析
【6月更文挑战第1天】Pandas是Python数据分析的核心库,基于NumPy,提供高效的数据结构如Series和DataFrame。它支持数据加载(CSV、Excel、SQL等)、清洗、预处理、探索、可视化及时间序列分析。通过实例展示了如何加载CSV文件,填充缺失值,进行数据统计和按部门平均薪资的可视化。Pandas与Matplotlib等库集成,简化了数据分析流程,对数据科学家和分析师极其重要。
|
4天前
|
数据采集 数据可视化 数据挖掘
数据分析大神养成记:Python+Pandas+Matplotlib助你飞跃!
【6月更文挑战第12天】在数字时代,Python因其强大的数据处理能力和易用性成为数据分析首选工具。结合Pandas(用于高效数据处理)和Matplotlib(用于数据可视化),能助你成为数据分析专家。Python处理数据预处理、分析和可视化,Pandas的DataFrame简化表格数据操作,Matplotlib则提供丰富图表展示数据。掌握这三个库,数据分析之路将更加畅通无阻。
|
5天前
|
JSON 数据挖掘 API
数据分析实战丨基于pygal与requests分析GitHub最受欢迎的Python库
数据分析实战丨基于pygal与requests分析GitHub最受欢迎的Python库
17 2
|
14天前
|
Python 数据挖掘 数据可视化
Python数据分析——Pandas与Jupyter Notebook
【6月更文挑战第1天】 本文探讨了如何使用Python的Pandas库和Jupyter Notebook进行数据分析。首先,介绍了安装和设置步骤,然后展示了如何使用Pandas的DataFrame进行数据加载、清洗和基本分析。接着,通过Jupyter Notebook的交互式环境,演示了数据分析和可视化,包括直方图的创建。文章还涉及数据清洗,如处理缺失值,并展示了如何进行高级数据分析,如数据分组和聚合。此外,还提供了将分析结果导出到文件的方法。通过销售数据的完整案例,详细说明了从加载数据到可视化和结果导出的全过程。最后,讨论了进一步的分析和可视化技巧,如销售额趋势、产品销售排名和区域分布,以及
34 2
|
18天前
|
数据采集 SQL 数据处理
Python中的Pandas库:数据处理与分析的利器
Python中的Pandas库:数据处理与分析的利器
30 0
|
18天前
|
存储 并行计算 数据挖掘
Python中的NumPy库:科学计算与数据分析的基石
Python中的NumPy库:科学计算与数据分析的基石
68 0
|
19天前
|
数据采集 数据挖掘 数据处理
Python数据分析实战:使用Pandas处理Excel文件
Python数据分析实战:使用Pandas处理Excel文件
96 0
|
19天前
|
数据采集 数据可视化 数据处理
Python中的高效数据处理:Pandas库详解
Python中的高效数据处理:Pandas库详解
35 2
|
19天前
|
数据采集 SQL 数据可视化
使用Python和Pandas库进行数据分析的入门指南
使用Python和Pandas库进行数据分析的入门指南
78 0
|
26天前
|
存储 数据采集 数据挖掘
Python数据分析实验一:Python数据采集与存储
Python数据分析实验一:Python数据采集与存储
118 1