Python机器学习小知识：Pandas去重-阿里云开发者社区

Python机器学习小知识：Pandas去重

2021-07-20 848

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 介绍通过drop_duplicates进行数据去重

在使用Pandas处理数据的时候经常遇到重复数据的情况，这时候可以使用drop_duplicates（pandas.DataFrame.drop_duplicates）来进行处理。函数定义如下：
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)[source]
返回去重后的数据集（如果inplace为True则返回None）

参数：
subset：column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
用来指定特定的列，默认所有列

keep：{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
确定保留哪条重复的数据，默认第一条

inplace：bool, default False
Whether to drop duplicates in place or to return a copy.
是直接在原来数据上修改还是保留一个副本

ignore_index：bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
是否忽略Index，默认为True

例子：
df = pd.DataFrame({

'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]

})
df

brand style  rating

0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.

df.drop_duplicates()

brand style  rating

0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
To remove duplicates on specific column(s), use subset.

具体可以参考说明：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Python机器学习小知识：Pandas去重

热门文章

最新文章

相关课程

相关电子书

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Python机器学习小知识：Pandas去重

热门文章

最新文章

相关课程

相关电子书

推荐镜像