一、基于PaddleNLP的第五届中国法研杯LAIC2022——司法文本小样本多任务竞赛
1.任务介绍
本赛道由中国司法大数据研究院承办。 在司法的各个业务中,存在着丰富的NLP场景,但往往会出现标注样本不足的现象,因此研究小样本场景的模型训练问题就变得非常必要。本赛题发布了关于司法的小样本多任务数据,旨在探索小样本学习最佳模型和在司法上的实践。
数据集下载请访问比赛data.court.gov.cn/pages/laic2…。该数据只可用于该比赛,未经允许禁止在其他领域中使用。
2.数据介绍
- (1) 本赛题数据来源于法律领域,包括文本分类、命名实体识别任务,总任务数量为10个(分类:案件要素8个、刑档1个,命名实体识别:1个);
- (2) 训练集:案件要素304条,刑档90条,命名实体识别150条,10个任务共544条数据(包括验证集,不再单独提供验证集,由选手自己切分);
- (3) 标注样本示例:
2.1 案件要素
案件要素中包含8个二分类子任务,子任务数据格式为json格式,字典包含字段为:
id
:文本id。data
:文本内容,文书中事实描述部分。label
:是否是当前案件要素,是/否。task
:任务类型,CLS。
{ "id": 491, "data": "经审理查明:2017年8月21日13时许,被告人杨信驾驶强制报废的赣0131133变型拖拉机,由北向南行驶至南昌县新莲塔公路曹骆十字路口时,闯红灯继续行驶,将由西向东被害人邓某1驾驶的三轮电动车撞倒,致被害人邓某1受伤。经南昌县公安局交通管理大队道路交通事故认定书认定:被告人杨信负本次事故的主要责任。2017年10月10日,经南昌县公安司法鉴定中心鉴定:被害人邓某1损伤构成重伤二级。2018年2月12日,被害人邓某1死亡。经鉴定被害人邓某1符合因交通事故致颅脑损伤后并衰竭死亡。\n案发后,被告人杨信打电话报警,并随救护车辆送被害人邓某1到医院救治,支付了部分医疗费。2017年10月24日,经传唤,被告人杨信主动向公安机关投案。\n上述事实,被告人在开庭审理过程中亦无异议,且有法医学鉴定书,事故责任认定书,现场勘验、检查笔录,现场照片,事故车辆技术鉴定书,交通事故车辆安全性能鉴定报告书,医疗费票据,预付款收据,归案经过等相关证据证实,足以认定。\n", "label": ["负"], "task": "CLS" }
2.2 刑档
刑档任务数据格式为json格式,字典包含字段:
id
:文本id。data
:文本内容,文书中事实描述部分。label
:文书对应刑档等级,一档/二档/三档。task
:任务类型,CLS。
{ "id": 491, "data": "交通肇事罪[SEP]河北省[SEP]本院认为,被告人许祥东违反交通运输管理法规,发生重大交通事故,且肇事后逃逸致人死亡,其行为已构成交通肇事罪,公诉机关指控事实和罪名成立,予以支持。被告人许祥东到案后,如实供述犯罪事实,当庭自愿认罪认罚,可依法从轻处罚。公诉机关提出的对被告人许祥东的量刑意见以及被告人许祥东的辩护人提出的相关辩护意见予以采纳。为维护公共安全,保护公民人身权利,打击刑事犯罪。[SEP]经审理查明,2019年12月6日6时18分许,在308国道宁晋县处,被告人许祥东无机动车驾驶证驾驶冀E小型轿车由东南向西北行驶时,与由南向西北转弯的田某1驾驶自行车相撞,致田某1及其驾驶的自行车倒地,许祥东驾驶冀E小型轿车逃逸。后由东南向西北行驶的张某驾驶冀A冀A重型仓栅式半挂车又与倒在公路上的田某1及其自行车相撞,造成田某1死亡,三车不同程度损坏的交通事故。许祥东负事故的主要责任。经鉴定,田某1符合车祸致颅脑损伤合并肺脏、肝脏破裂死亡。", "label": ["三档"], "task": "CLS" }
2.3 命名实体识别
命名实体识别任务数据格式为json格式,字典包含字段:
id
:案例中文本的唯一标识符。relations
:空列表,无用信息。text
:文本内容,文书中事实描述部分。entities
:句子所包含的实体信息列表。label
:实体标签名称。start_offset
:实体开始位置下标。end_offset
:实体结束位置下标。task
:任务类型,NER。
其中 命名实体识别
任务中的 label
的十种实体类型分别为:
label | 含义 |
11017 | 犯罪嫌疑人情况 |
11018 | 被害人 |
11019 | 被害人类型 |
11020 | 犯罪嫌疑人交通工具 |
11021 | 犯罪嫌疑人交通工具情况 |
11022 | 被害人交通工具情况 |
11023 | 犯罪嫌疑人责任认定 |
11024 | 被害人责任认定 |
11025 | 事故发生地 |
11027 | 被害人交通工具 |
{ "id": 139, "relations": [] "text": "经审理查明:2016年10月12日10时50分许,被告人昌某无驾驶资格驾驶无号牌机动三轮车沿某政府1门前路段由东向西行驶,行驶过程中该机动三轮车车厢中人员王某2从车上掉落受伤。2016年10月13日,被告人昌某被某政府2处以行政拘留五日、罚款500元的行政处罚;2016年10月16日被害人王某2经医院抢救无效死亡。经某政府2交通警察支队二大队认定,被告人昌某承担本次事故的全部责任。2016年11月30日,被告人昌某接公安机关电话通知后主动到案。", "entities": [ {"id": 11017, "start_offset": 30, "end_offset": 35, "label": "犯罪嫌疑人情况"}, {"id": 11018, "start_offset": 77, "end_offset": 80, "label": "被害人"}, {"id": 11018, "start_offset": 145, "end_offset": 148, "label": "被害人"}, {"id": 11020, "start_offset": 40, "end_offset": 45, "label": "犯罪嫌疑人交通工具"}, {"id": 11021, "start_offset": 37, "end_offset": 40, "label": "犯罪嫌疑人交通工具情况"}, {"id": 11023, "start_offset": 187, "end_offset": 191, "label": "犯罪嫌疑人责任认定"}, {"id": 11019, "start_offset": 72, "end_offset": 77, "label": "被害人类型"}, {"id": 11025, "start_offset": 46, "end_offset": 54, "label": "事故发生地"}], "task": "NER" }
二、数据处理
1.解压缩
!unzip -qoa data/data173002/PaddleNLP-develop.zip !unzip -qoa data/data173234/1666234655361.zip
2.查看数据
!tree 训练集
训练集 ├── ner ├── 案件要素 │ ├── 被害人被后车撞击 │ ├── 被害人闯红灯 │ ├── 被害人为本车人员 │ ├── 交通肇事后逃逸 │ ├── 全部责任 │ ├── 肇事车辆超速行驶 │ ├── 肇事车辆逆行 │ └── 中型客车交通肇事 └── 刑档 1 directory, 10 files
!tree 测试集_选手
测试集_选手 ├── ner ├── 案件要素 │ ├── 被害人被后车撞击 │ ├── 被害人闯红灯 │ ├── 被害人为本车人员 │ ├── 交通肇事后逃逸 │ ├── 全部责任 │ ├── 肇事车辆超速行驶 │ ├── 肇事车辆逆行 │ └── 中型客车交通肇事 └── 刑档 1 directory, 10 files
3.数据格式转换
- 转换为doccano数据格式
- 按照 8:2 进行训练集、测试集划分
!python doccano.py
[32m[2022-10-20 11:57:46,145] [ INFO][0m - Converting doccano data...[0m100%|███████████████████████████████████████| 120/120 [00:00<00:00, 2405.21it/s] [32m[2022-10-20 11:57:46,196] [ INFO][0m - Adding negative samples for first stage prompt...[0m100%|█████████████████████████████████████| 120/120 [00:00<00:00, 135811.25it/s] [32m[2022-10-20 11:57:46,198] [ INFO][0m - Converting doccano data...[0m100%|████████████████████████████████████████| 30/30 [00:00<00:00, 19001.68it/s] [32m[2022-10-20 11:57:46,200] [ INFO][0m - Adding negative samples for first stage prompt...[0m100%|███████████████████████████████████████| 30/30 [00:00<00:00, 104509.24it/s] [32m[2022-10-20 11:57:46,200] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,201] [ INFO][0m - Adding negative samples for first stage prompt...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,213] [ INFO][0m - Save 1188 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,216] [ INFO][0m - Save 300 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,216] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,216] [ INFO][0m - Finished! It takes 0.07 seconds[0m训练集/案件要素/交通肇事后逃逸[32m[2022-10-20 11:57:46,217] [ INFO][0m - Converting doccano data...[0m 0%| | 0/29 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,218] [ INFO][0m - Converting doccano data...[0m 0%| | 0/8 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,218] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,219] [ INFO][0m - Save 29 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,219] [ INFO][0m - Save 8 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,219] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,219] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/肇事车辆超速行驶[32m[2022-10-20 11:57:46,220] [ INFO][0m - Converting doccano data...[0m 0%| | 0/27 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,221] [ INFO][0m - Converting doccano data...[0m 0%| | 0/7 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,221] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,222] [ INFO][0m - Save 27 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,222] [ INFO][0m - Save 7 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,222] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,222] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/被害人闯红灯[32m[2022-10-20 11:57:46,223] [ INFO][0m - Converting doccano data...[0m 0%| | 0/31 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,224] [ INFO][0m - Converting doccano data...[0m 0%| | 0/8 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,224] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,225] [ INFO][0m - Save 31 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,225] [ INFO][0m - Save 8 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,225] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,225] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/被害人为本车人员[32m[2022-10-20 11:57:46,226] [ INFO][0m - Converting doccano data...[0m 0%| | 0/35 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,227] [ INFO][0m - Converting doccano data...[0m 0%| | 0/9 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,228] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,229] [ INFO][0m - Save 35 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,229] [ INFO][0m - Save 9 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,229] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,229] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/被害人被后车撞击[32m[2022-10-20 11:57:46,230] [ INFO][0m - Converting doccano data...[0m 0%| | 0/32 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,231] [ INFO][0m - Converting doccano data...[0m 0%| | 0/8 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,231] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,232] [ INFO][0m - Save 32 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,232] [ INFO][0m - Save 8 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,232] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,232] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/肇事车辆逆行[32m[2022-10-20 11:57:46,233] [ INFO][0m - Converting doccano data...[0m 0%| | 0/29 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,234] [ INFO][0m - Converting doccano data...[0m 0%| | 0/8 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,234] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,235] [ INFO][0m - Save 29 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,235] [ INFO][0m - Save 8 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,235] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,235] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/中型客车交通肇事[32m[2022-10-20 11:57:46,236] [ INFO][0m - Converting doccano data...[0m 0%| | 0/30 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,237] [ INFO][0m - Converting doccano data...[0m 0%| | 0/8 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,237] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,238] [ INFO][0m - Save 30 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,238] [ INFO][0m - Save 8 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,238] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,238] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/案件要素/全部责任[32m[2022-10-20 11:57:46,239] [ INFO][0m - Converting doccano data...[0m 0%| | 0/28 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,240] [ INFO][0m - Converting doccano data...[0m 0%| | 0/7 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,240] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,241] [ INFO][0m - Save 28 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,241] [ INFO][0m - Save 7 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,241] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,241] [ INFO][0m - Finished! It takes 0.00 seconds[0m训练集/刑档[32m[2022-10-20 11:57:46,242] [ INFO][0m - Converting doccano data...[0m 0%| | 0/72 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,243] [ INFO][0m - Converting doccano data...[0m 0%| | 0/18 [00:00<?, ?it/s] [32m[2022-10-20 11:57:46,244] [ INFO][0m - Converting doccano data...[0m0it [00:00, ?it/s] [32m[2022-10-20 11:57:46,245] [ INFO][0m - Save 72 examples to ./data_save/train.txt.[0m[32m[2022-10-20 11:57:46,245] [ INFO][0m - Save 18 examples to ./data_save/dev.txt.[0m[32m[2022-10-20 11:57:46,245] [ INFO][0m - Save 0 examples to ./data_save/test.txt.[0m[32m[2022-10-20 11:57:46,246] [ INFO][0m - Finished! It takes 0.00 seconds[0m[0m复制代码
!tree data_save
data_save ├── dev.txt ├── test.txt └── train.txt 0 directories, 3 files
三、模型训练
1.升级paddlenlp
# 升级paddlenlp !pip install -q -U paddlenlp !pip list|grep paddlenlp
2.模型finetune
%cd ~ !python PaddleNLP-develop/model_zoo/uie/finetune.py \ --train_path data_save/train.txt \ --dev_path data_save/dev.txt \ --save_dir ./checkpoint \ --learning_rate 5e-6 \ --model uie-base \ --batch_size 40 \ --logging_steps 10 \ --valid_steps 20 \ --num_epochs 100 \ --device gpu
[2022-10-20 12:12:54,048] [ INFO] - global step 10, epoch: 1, loss: 0.00643, speed: 0.68 step/s [2022-10-20 12:13:07,360] [ INFO] - global step 20, epoch: 1, loss: 0.00553, speed: 0.75 step/s [2022-10-20 12:13:15,567] [ INFO] - Evaluation precision: 0.57273, recall: 0.30657, F1: 0.39937 [2022-10-20 12:13:15,567] [ INFO] - best F1 performence has been updated: 0.00000 --> 0.39937 [2022-10-20 12:13:29,955] [ INFO] - global step 30, epoch: 1, loss: 0.00503, speed: 0.74 step/s [2022-10-20 12:13:42,766] [ INFO] - global step 40, epoch: 1, loss: 0.00459, speed: 0.78 step/s [2022-10-20 12:13:50,974] [ INFO] - Evaluation precision: 0.69167, recall: 0.60584, F1: 0.64591 [2022-10-20 12:13:50,974] [ INFO] - best F1 performence has been updated: 0.39937 --> 0.64591 [2022-10-20 12:14:08,071] [ INFO] - global step 50, epoch: 2, loss: 0.00430, speed: 0.75 step/s [2022-10-20 12:14:21,344] [ INFO] - global step 60, epoch: 2, loss: 0.00403, speed: 0.75 step/s [2022-10-20 12:14:30,020] [ INFO] - Evaluation precision: 0.74550, recall: 0.70560, F1: 0.72500 [2022-10-20 12:14:30,020] [ INFO] - best F1 performence has been updated: 0.64591 --> 0.72500 [2022-10-20 12:14:50,562] [ INFO] - global step 70, epoch: 2, loss: 0.00383, speed: 0.75 step/s [2022-10-20 12:15:03,857] [ INFO] - global step 80, epoch: 2, loss: 0.00368, speed: 0.75 step/s [2022-10-20 12:15:12,047] [ INFO] - Evaluation precision: 0.76684, recall: 0.72019, F1: 0.74279 [2022-10-20 12:15:12,047] [ INFO] - best F1 performence has been updated: 0.72500 --> 0.74279
3.模型评估
!python PaddleNLP-develop/model_zoo/uie/evaluate.py \ --model_path ./checkpoint/model_best \ --test_path data_save/dev.txt \ --batch_size 32 \ --max_seq_len 512
[32m[2022-10-20 12:58:50,085] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0mW1020 12:58:50.112598 489 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W1020 12:58:50.115797 489 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2022-10-20 12:59:00,970] [ INFO][0m - -----------------------------[0m[32m[2022-10-20 12:59:00,971] [ INFO][0m - Class Name: all_classes[0m[32m[2022-10-20 12:59:00,971] [ INFO][0m - Evaluation Precision: 0.83377 | Recall: 0.76886 | F1: 0.80000[0m[0m复制代码
四、结果预测
1.预测 ner
#coding:utf-8 import json import os import argparse from paddlenlp import Taskflow def data_read(path): data_dir = {} for file_name in os.listdir(path): if file_name != '案件要素': data_dir[file_name] = [] with open(os.path.join(path, file_name), encoding='utf8') as f: for line in f: line_js = json.loads(line) data_dir[file_name].append(line_js) else: for case_plot_name in os.listdir(os.path.join(path, file_name)): data_dir[case_plot_name] = [] with open(os.path.join(path, os.path.join(file_name, case_plot_name)), encoding='utf8') as f: for line in f: line_js = json.loads(line) data_dir[case_plot_name].append(line_js) return data_dir def predict(input_dir, output_file): data = data_read(input_dir) output = open(output_file, 'w', encoding='utf8') ie = Taskflow(task='information_extraction', schema=[], task_path='checkpoint/model_best') for task_name,task_data_list in data.items(): if task_name != 'ner': tq_name = 'data' if task_name == '刑档': prompt = task_name + '[一档,二档,三档]' else: prompt = task_name + '[正,负]' else: tq_name = 'text' prompt = ' ' if prompt != ' ': # 分类任务 for line in task_data_list: ie.set_schema(prompt) try: result = ie(line[tq_name].strip()) # print(result) p_lab = [[result[0][prompt][0]['text']]] except: p_lab = [[]] output.write(json.dumps({'id':line['id'], 'data':line[tq_name], 'task':line['task'], 'label':p_lab}, ensure_ascii=False)+'\n') else: # ner任务 for line in task_data_list: prompt = ['犯罪嫌疑人情况','被害人','被害人类型','犯罪嫌疑人交通工具','犯罪嫌疑人交通工具情况','被害人交通工具情况','犯罪嫌疑人责任认定','被害人责任认定','事故发生地','被害人交通工具'] p_lab = {} ie.set_schema(prompt) result = ie(line[tq_name].strip()) # print(result) for entity_type in prompt: p_lab[entity_type] = [] if entity_type not in result[0]: p_lab[entity_type].append([]) continue for samp in result[0][entity_type]: p_lab[entity_type].append([samp['start'], samp['end']]) output.write(json.dumps({'id':line['id'], 'data':line[tq_name], 'task':line['task'], 'label':p_lab}, ensure_ascii=False)+'\n') output.close()
predict('测试集_选手/', 'result.txt')
[2022-10-20 16:34:32,072] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'checkpoint/model_best'. [{'犯罪嫌疑人情况': [{'text': '酒后', 'start': 34, 'end': 36, 'probability': 0.8422301310098987}, {'text': '无证', '