目录
Pandas 可以很方便的处理 JSON 数据
demo.json
[ { "name":"张三", "age":23, "gender":true }, { "name":"李四", "age":24, "gender":true }, { "name":"王五", "age":25, "gender":false } ]
JSON 转换为 CSV
非常方便,只要通过 pd.read_json
读出JSON数据,再通过 df.to_csv 写入 CSV 即可
import pandas as pd json_path = 'data/demo.json' # 加载 JSON 数据 with open(json_path, 'r', encoding='utf8') as f: # 解析一个有效的JSON字符串并将其转换为Python字典 df = pd.read_json(f.read()) print(df.to_string()) # to_string() 用于返回 DataFrame 类型的数据,我们也可以直接处理 JSON 字符串。 print('-' * 10) # 重新定义标题 df.columns = ['姓名', '年龄', '性别'] print(df) df.to_csv('data/result.csv', index=False, encoding='GB2312')
简单 JSON
从 URL 中读取 JSON 数据:
import pandas as pd URL = 'https://static.runoob.com/download/sites.json' df = pd.read_json(URL) # 和读文件一样 print(df)
输出:
id name url likes 0 A001 菜鸟教程 www.runoob.com 61 1 A002 Google www.google.com 124 2 A003 淘宝 www.taobao.com 45
字典转化为 DataFrame 数据
import pandas as pd s = { "col1": {"row1": 1, "row2": 2, "row3": 3}, "col2": {"row1": "x", "row2": "y", "row4": "z"} } df = pd.DataFrame(s) print(df) print('-' * 10) new_df = df.dropna() # 数据清洗,删除包含空数据的行 print(new_df.to_string()) print('-' * 10) df.fillna(99, inplace=True) # fillna() 方法来替换一些空字段 print(df.to_string())
输出:不同的行会用 NaN 填充
col1 col2 row1 1.0 x row2 2.0 y row3 3.0 NaN row4 NaN z ---------- col1 col2 row1 1.0 x row2 2.0 y ---------- col1 col2 row1 1.0 x row2 2.0 y row3 3.0 99 row4 99.0 z
内嵌的 JSON 数据
nested_list.json 嵌套的JSON数据
{ "school_name": "ABC primary school", "class": "Year 1", "students": [ { "id": "A001", "name": "Tom", "math": 60, "physics": 66, "chemistry": 61 }, { "id": "A002", "name": "James", "math": 89, "physics": 76, "chemistry": 51 }, { "id": "A003", "name": "Jenny", "math": 79, "physics": 90, "chemistry": 78 } ] }
运行代码
data = json.loads(f.read())
使用 Python JSON 模块载入数据。
json_normalize()
使用了参数 record_path 并设置为 ['students'] 用于展开内嵌的 JSON 数据 students。
import pandas as pd import json # 打印出结果JSON结构 with open('data/nested_list.json', 'r') as f: data = pd.read_json(f.read()) print(data) # 使用 Python JSON 模块载入数据 with open('data/nested_list.json', 'r') as f: data = json.loads(f.read()) # 展平数据-- json_normalize() 方法将内嵌的数据完整的解析出来: df_nested_list = pd.json_normalize(data, record_path=['students']) print(df_nested_list)
import pandas as pd import json data_path = 'data/nested_list.json' print(('-' * 10) + ' 连同上级JSON值一起显示') # 使用 Python JSON 模块载入数据 with open(data_path, 'r') as f: data = json.loads(f.read()) # 展平数据 df_nested_list = pd.json_normalize( data, record_path=['students'], meta=['school_name', 'class'] ) print(df_nested_list)
复杂 JSON
该数据嵌套了列表和字典,数据文件 nested_mix.json 如下
nested_mix.json
{ "school_name": "local primary school", "class": "Year 1", "info": { "president": "John Kasich", "address": "ABC road, London, UK", "contacts": { "email": "admin@e.com", "tel": "123456789" } }, "students": [ { "id": "A001", "name": "Tom", "math": 60, "physics": 66, "chemistry": 61 }, { "id": "A002", "name": "James", "math": 89, "physics": 76, "chemistry": 51 }, { "id": "A003", "name": "Jenny", "math": 79, "physics": 90, "chemistry": 78 }] }
import pandas as pd import json # 使用 Python JSON 模块载入数据 with open('data/nested_mix.json', 'r') as f: data = json.loads(f.read()) df = pd.json_normalize( data, record_path=['students'], meta=[ 'class', ['info', 'president'], # 类似 info.president ['info', 'contacts', 'tel'] ] ) print(df)
id name math ... class info.president info.contacts.tel 0 A001 Tom 60 ... Year 1 John Kasich 123456789 1 A002 James 89 ... Year 1 John Kasich 123456789 2 A003 Jenny 79 ... Year 1 John Kasich 123456789 [3 rows x 8 columns]
读取内嵌数据中的一组数据
nested_deep.json
{ "school_name": "local primary school", "class": "Year 1", "students": [ { "id": "A001", "name": "Tom", "grade": { "math": 60, "physics": 66, "chemistry": 61 } }, { "id": "A002", "name": "James", "grade": { "math": 89, "physics": 76, "chemistry": 51 } }, { "id": "A003", "name": "Jenny", "grade": { "math": 79, "physics": 90, "chemistry": 78 } }] }
这里我们需要使用到 glom
模块来处理数据套嵌,glom 模块允许我们使用 . 来访问内嵌对象的属性。
第一次使用我们需要安装 glom:
pip3 install glom -i https://pypi.tuna.tsinghua.edu.cn/simple
import pandas as pd from glom import glom df = pd.read_json('nested_deep.json') data = df['students'].apply(lambda row: glom(row, 'grade.math')) print(data)
输出:
0 60 1 89 2 79