Hands-on data analysis 第一章

2022-06-15 151

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Hands-on data analysis 第一章

Hands-on data analysis 第一章

1.1.数据载入

在进行后续步骤之前都需要载入模块：

import numpy as np
import pandas as pd

数据的载入，针对不同的文件有不同的载入方法：

一般对于csv文件有：

pd.read_csv('train.csv')

其他类型的文件，可以参考pandas的官方文档：

IO tools (text, CSV, HDF5, …) — pandas 1.4.2 documentation (pydata.org)

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	LaTeX		Styler.to_latex
text	XML	read_xml	to_xml
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Stata	read_stata	to_stata
binary	SAS)	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

对于不同的文件格式，可以参考上面的表格

1.2.修改列明，重定义索引

df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)

names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口']将列名重新定义为了中文格式

index_col='乘客ID'将'乘客ID'作为索引列

header=0表示第一行为列名

1.3.查看数据的基本信息

df.info()可以用来查看数据的基本信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

1.4.只观察前几行数据或者末尾几行数据

df.head(10)观察前十行

df.tail(15)观察末尾十五行

1.5. 判断数据是否为空

df.isnull().head()，记住其返回的是True或者False，而不是数值。

    是否幸存    仓位等级    姓名    性别    年龄    兄弟姐妹个数    父母子女个数    船票信息    票价    客舱    登船港口
乘客ID                                            
1    False    False    False    False    False    False    False    False    False    True    False
2    False    False    False    False    False    False    False    False    False    False    False
3    False    False    False    False    False    False    False    False    False    True    False
4    False    False    False    False    False    False    False    False    False    False    False
5    False    False    False    False    False    False    False    False    False    True    False

2.1.数据排序

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                     index=['2', '1'], 
                     columns=['d', 'a', 'b', 'c'])
frame

生成的frame为：

    d    a    b    c
2    0    1    2    3
1    4    5    6    7

pd.DataFrame创建一个DataFrame对象

np.arange(8).reshape((2, 4))生成一个二维数组,2行4列（2*4）,第一行为：0，1，2，3 第二行为：4，5，6，7

index=['2', '1'] 定义DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c']定义DataFrame 对象的列名

排序代码：

frame.sort_values(by='c', ascending=True)

输出结果为：

    d    a    b    c
2    0    1    2    3
1    4    5    6    7

可以发现，其排序是根据c这一列的值来的，且是升序排列

by参数指向要排列的列，ascending参数指向排序的方式（升序还是降序）

#按行索引排序
frame.sort_index()

    d    a    b    c
1    4    5    6    7
2    0    1    2    3

可以看到每一行都是升序的

#按列索引排序
frame.sort_index(axis=1)

    a    b    c    d
2    1    2    3    0
1    5    6    7    4

2.2.DataFrame相加

frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
                     columns=['a', 'b', 'c'],
                     index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
                     columns=['a', 'e', 'c'],
                     index=['first', 'one', 'two', 'second'])

#frame1_a
        a    b    c
one        0.0    1.0    2.0
two        3.0    4.0    5.0
three    6.0    7.0    8.0

#frame1_b
        a     e     c
first    0.0    1.0     2.0
one        3.0    4.0     5.0
two        6.0    7.0     8.0
second    9.0    10.0 11.0

执行frame1_a + frame1_b

        a    b    c    e
first    NaN    NaN    NaN    NaN
one    3.0    NaN    7.0    NaN
second    NaN    NaN    NaN    NaN
three    NaN    NaN    NaN    NaN
two    9.0    NaN    13.0    NaN

2.3.观察frame2的数据基本信息

describe()可以返回数据的基本信息

count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值

frame2 = pd.DataFrame([[1.4, np.nan], 
                       [7.1, -4.5],
                       [np.nan, np.nan], 
                       [0.75, -1.3]
                      ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2.describe()

        one            two
count    3.000000    2.000000
mean    3.083333    -2.900000
std        3.493685    2.262742
min        0.750000    -4.500000
25%        1.075000    -3.700000
50%        1.400000    -2.900000
75%        4.250000    -2.100000
max        7.100000    -1.300000

3.1.pandas基本数据类型

pandas中有两个数据类型DateFrame和Series

Series，只是一个一维数据结构，它由index和value组成。
DateFrame，是一个二维结构，除了拥有index和value之外，还拥有column。

DateFrame由多个Series组成

3.2.DateFrame列的名称

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

就会输出这个DateFrame的列名称

3.3.查看某列的值

#查看"Cabin"这列的所有值
df['Cabin'].head(3)
#df.Cabin.head(3)

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

3.4.删除某一列

del test_1['a']

df.drop(['a'],axis=1,inplace=True).head(3)

inplace=True会将原始数据覆盖

3.5.数据筛选

df[df["Age"]<10].head(3)

midage = df[(df["Age"]>10)& (df["Age"]<50)]

3.6.loc与iloc方法

使用loc方法将midage的数据中第100，105，108行的"Pclass"，"Name"和"Sex"的数据显示出来

midage.loc[[100,105,108],['Pclass','Name','Sex']]

使用iloc方法将midage的数据中第100，105，108行的"Pclass"，"Name"和"Sex"的数据显示出来

midage.iloc[[100,105,108],[2,3,4]]

iloc使用的是索引下标，而loc使用的是列名

参考资料

hands-on-data-analysis 第一单元 - 飞桨AI Studio (baidu.com)

DATAWHALE - 一个热爱学习的社区 (linklearner.com)

pandas.read_csv — pandas 1.4.2 documentation (pydata.org)

IO tools (text, CSV, HDF5, …) — pandas 1.4.2 documentation (pydata.org)

Hands-on data analysis 第一章

Hands-on data analysis 第一章

1.1.数据载入

1.2.修改列明，重定义索引

1.3.查看数据的基本信息

1.4.只观察前几行数据或者末尾几行数据

1.5. 判断数据是否为空

2.1.数据排序

2.2.DataFrame相加

2.3.观察frame2的数据基本信息

3.1.pandas基本数据类型

3.2.DateFrame列的名称

3.3.查看某列的值

3.4.删除某一列

3.5.数据筛选

3.6.loc与iloc方法

参考资料

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Hands-on data analysis 第一章

Hands-on data analysis 第一章

1.1.数据载入

1.2.修改列明，重定义索引

1.3.查看数据的基本信息

1.4.只观察前几行数据或者末尾几行数据

1.5. 判断数据是否为空

2.1.数据排序

2.2.DataFrame相加

2.3.观察frame2的数据基本信息

3.1.pandas基本数据类型

3.2.DateFrame列的名称

3.3.查看某列的值

3.4.删除某一列

3.5.数据筛选

3.6.loc与iloc方法

参考资料

热门文章

最新文章

相关电子书