Pandas分类
- categorical data是指分类数据:数据类型为:男女、班级(一班、二班)、省份(河北、江苏等),若使用赋值法给变量赋值,例如(男=1,女=0),数字1,0之间没有大小之分,不能认为1是比0大的。
- numerical data是指数值型数据:收入(1000元,500元),是可以进行比较大小并进行运算的数据。
从0.15版本开始,pandas可以在DataFrame中支持Categorical类型的数据,
Pandas可以在DataFrame中包含分类数据
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
|
id |
raw_grade |
0 |
1 |
a |
1 |
2 |
b |
2 |
3 |
b |
3 |
4 |
a |
4 |
5 |
a |
5 |
6 |
e |
df["raw_grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: raw_grade, dtype: object
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
|
id |
raw_grade |
grade |
0 |
1 |
a |
very good |
1 |
2 |
b |
good |
2 |
3 |
b |
good |
3 |
4 |
a |
very good |
4 |
5 |
a |
very good |
5 |
6 |
e |
very bad |
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by="grade")
|
id |
raw_grade |
grade |
5 |
6 |
e |
very bad |
1 |
2 |
b |
good |
2 |
3 |
b |
good |
0 |
1 |
a |
very good |
3 |
4 |
a |
very good |
4 |
5 |
a |
very good |
df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64