一、面向低资源和增量类型的命名实体识别挑战赛简介
使用无所不能的PaddleNLP写个比赛基线,第一次提交,分数虽然比较低,但是还凑合,主要是给的初赛数据集覆盖范围小,太小了。
竞赛地址:
1.数据简介
本赛题采用的数据聚焦装备领域,主要从以下三个方面的来源收集整理得到,具有一定的权威性和领域价值:
- 开源资讯:对国内外主流新闻网站、百度百科、维基百科、武器大全等开源资讯网站进行数据收集,优先收集中文,并将外文数据进行翻译后获得情报数据;
- 智库报告:从智库网站中获取含有装备情报信息的论文以及报告;
- 内部成果:通过国内军工企业、研究院所、国内综合图书馆、数字图书馆、军工院所图书馆等内部网站获取成果相关的文件进行分析和整理。
本赛题从上述来源收集到充足原始无标注数据后,先结合人工排查和关键字匹配等自动化方法过滤偏离主题、不真实和有偏见的数据;随后清洗无效和非法字符并过滤篇幅较长以及不含领域实体的文本;其次采用参考权威装备标准与论著制定的标签体系对文本进行标注,并采用相关领域以往研究成果中的模型对数据进行预打标;最终统计筛选出类型分布符合任务需求的样本作为原始数据集。
2.数据说明
• 初赛数据说明 该赛题数据集共计约6000条样本,包含以下9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备。参考低资源学习领域的任务设置,为每种类型从原始数据集中采样50个左右样本案例,形成共97条标注样本的训练集(每一条样本可能包含多个实体和实体类型),其余样本均用于测试集。所有数据文件编码均为UTF-8。
文件类型 | 文件名 | 文件说明 |
训练集 | ner_train.json | 97条已标注样本,每个样本对应内容为:样本id(sample_id),原始文本(text)和标注实体列表(annotations),列表中每个元素对应一个实体,包括类型(type)、文本(text)、跨度起始位置(start)和结束位置(end) |
测试集 | ner_test.json | 5920条未标注样本,每个样本对应内容为:样本id(sample_id)和原始文本(text) |
二、数据处理
1.数据查看
!ls data/data218296/
ner_test.json ner_train.json
- 查看可知,需要提取9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备
- 目前训练集97条标注样本
2.数据集格式转换&& 数据集划分
主要是:
- 转换格式 一般使用docano进行数据标注,完毕进行格式转换。这里我直接处理文件格式为我所需要的二个是。
- 分割训练集和测试机 按照 8:2比例进行数据切分
%cd ~ import json import csv from pprint import pprint import random # 读取 JSON 文件 with open('data/data218296/ner_train.json', 'r', encoding='utf-8') as f: data = json.load(f) print(len(data)) data_len=len(data) random.shuffle(data) train_data=data[:int(data_len*0.75)] dev_data=data[int(data_len*0.75):]
/home/aistudio 97
%cd ~ import json import csv from pprint import pprint import random # schema key_words = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ") # 数据集格式转换 # 并根据8:2比例分割为train和dev def convert(source_data, key_word): convert_target = [] for item in source_data: # 单条记录 result_list = [] # 标注格式化 for item2 in item["annotations"]: result_temp = dict() if item2['type'] == key_word: # 构造结果列表 result_temp['text'] = item2['text'] result_temp['start'] = item2['start'] result_temp['end'] = item2['end'] result_list.append(result_temp) # 构造单条数据 temp = dict() temp['content'] = item['text'] temp['result_list'] = result_list temp['prompt'] = key_word # 加入列表 convert_target.append(temp) return convert_target def convert_main(data,key_words): result=[] for key_word in key_words: temp_list = convert(data, key_word) result=result+temp_list random.shuffle(result) return result # 转换后总列表 train_data_convert = convert_main(train_data,key_words) dev_data_convert = convert_main(dev_data,key_words) # 将JSON数据转换为CSV格式 with open('train.txt', 'w', encoding="utf-8") as f: for item in train_data_convert: f.write(json.dumps(item, ensure_ascii=False) + '\n') with open('dev.txt', 'w', encoding="utf-8") as f: for item in dev_data_convert: f.write(json.dumps(item, ensure_ascii=False) + '\n')
/home/aistudio
三、训练训练
1.环境设置
主要是下载并安装PaddleNLP
# git 下载PaddleNLP !git clone https://gitee.com/paddlepaddle/PaddleNLP.git --depth=1
正克隆到 'PaddleNLP'... remote: Enumerating objects: 5825, done.[K remote: Counting objects: 100% (5825/5825), done.[K remote: Compressing objects: 100% (4099/4099), done.[K remote: Total 5825 (delta 2254), reused 3581 (delta 1437), pack-reused 0[K 接收对象中: 100% (5825/5825), 22.98 MiB | 1.19 MiB/s, 完成. 处理 delta 中: 100% (2254/2254), 完成. 检查连接... 完成。
# 安装升级PaddleNLP %cd ~/PaddleNLP !pip install -U -e ./ from IPython.display import clear_output clear_output() # 清理很长的内容
2.模型微调
推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。
可配置参数说明:
model_name_or_path
:必须,进行 few shot 训练使用的预训练模型。可选择的有 "uie-base"、 "uie-medium", "uie-mini", "uie-micro", "uie-nano", "uie-m-base", "uie-m-large"。multilingual
:是否是跨语言模型,用 "uie-m-base", "uie-m-large" 等模型进微调得到的模型也是多语言模型,需要设置为 True;默认为 False。output_dir
:必须,模型训练或压缩后保存的模型目录;默认为None
。device
: 训练设备,可选择 'cpu'、'gpu' 、'npu'其中的一种;默认为 GPU 训练。per_device_train_batch_size
:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。per_device_eval_batch_size
:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。learning_rate
:训练最大学习率,UIE 推荐设置为 1e-5;默认值为3e-5。num_train_epochs
: 训练轮次,使用早停法时可以选择 100;默认为10。logging_steps
: 训练过程中日志打印的间隔 steps 数,默认100。save_steps
: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。seed
:全局随机种子,默认为 42。weight_decay
:除了所有 bias 和 LayerNorm 权重之外,应用于所有层的权重衰减数值。可选;默认为 0.0;do_train
:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。do_eval
:是否进行评估,设置该参数表示进行评估。
该示例代码中由于设置了参数 --do_eval
,因此在训练完会自动进行评估。
%cd ~/PaddleNLP/model_zoo/uie/ # !python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py \ !python finetune.py \ --device gpu \ --logging_steps 10 \ --save_steps 50 \ --eval_steps 50 \ --seed 1000 \ --model_name_or_path uie-medium \ --output_dir ./checkpoint/model_best \ --train_path ~/train.txt \ --dev_path ~/dev.txt \ --max_seq_length 512 \ --per_device_eval_batch_size 16 \ --per_device_train_batch_size 16 \ --num_train_epochs 32 \ --learning_rate 1e-5 \ --label_names "start_positions" "end_positions" \ --do_train \ --do_eval \ --do_export \ --export_model_dir ./checkpoint/model_best \ --overwrite_output_dir \ --disable_tqdm True \ --metric_for_best_model eval_f1 \ --load_best_model_at_end True \ --save_total_limit 1
/home/aistudio/PaddleNLP/model_zoo/uie [33m[2023-06-19 16:20:21,620] [ WARNING][0m - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.[0m[32m[2023-06-19 16:20:21,620] [ INFO][0m - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - Model Configuration Arguments [0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - paddle commit id :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - export_model_dir :./checkpoint/model_best[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - model_name_or_path :uie-medium[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - multilingual :False[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - [0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - Data Configuration Arguments [0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - paddle commit id :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - dev_path :/home/aistudio/dev.txt[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - dynamic_max_length :None[0m[32m[2023-06-19 16:20:21,621] [ INFO][0m - max_seq_length :512[0m[32m[2023-06-19 16:20:21,622] [ INFO][0m - train_path :/home/aistudio/train.txt[0m[32m[2023-06-19 16:20:21,622] [ INFO][0m - [0m[33m[2023-06-19 16:20:21,622] [ WARNING][0m - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False[0m[32m[2023-06-19 16:20:21,622] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-medium'.[0m[32m[2023-06-19 16:20:21,622] [ INFO][0m - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/uie-medium[0m[32m[2023-06-19 16:20:21,752] [ INFO][0m - Downloading ernie_3.0_medium_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt[0m100%|█████████████████████████████████████████| 182k/182k [00:00<00:00, 889kB/s] [32m[2023-06-19 16:20:22,092] [ INFO][0m - tokenizer config file saved in /home/aistudio/.paddlenlp/models/uie-medium/tokenizer_config.json[0m[32m[2023-06-19 16:20:22,092] [ INFO][0m - Special tokens file saved in /home/aistudio/.paddlenlp/models/uie-medium/special_tokens_map.json[0m[32m[2023-06-19 16:20:22,093] [ INFO][0m - Model config ErnieConfig { "attention_probs_dropout_prob": 0.1, "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 6, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 16, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000}[0m[32m[2023-06-19 16:20:22,094] [ INFO][0m - Configuration saved in /home/aistudio/.paddlenlp/models/uie-medium/config.json[0m[32m[2023-06-19 16:20:22,095] [ INFO][0m - Downloading uie_medium.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams[0m100%|████████████████████████████████████████| 288M/288M [02:40<00:00, 1.88MB/s] W0619 16:23:05.005470 1653 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0619 16:23:05.010092 1653 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2023-06-19 16:23:05,669] [ INFO][0m - All model checkpoint weights were used when initializing UIE.[0m[32m[2023-06-19 16:23:05,669] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at uie-medium.If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
四、模型评估
1.评估模型
可配置参数说明:
model_path
: 进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams
及配置文件model_config.json
。test_path
: 进行评估的测试集文件。batch_size
: 批处理大小,请结合机器情况进行调整,默认为16。max_seq_len
: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。debug
: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。multilingual
: 是否是跨语言模型,默认关闭。schema_lang
: 选择schema的语言,可选有ch
和en
。默认为ch
,英文数据集请选择en
。
通过运行以下命令进行模型评估:
%cd ~/PaddleNLP/model_zoo/uie/ !python evaluate.py \ --model_path ./checkpoint0.67789894/model_best \ --test_path ~/dev.txt \ --batch_size 16 \ --max_seq_len 512 \ # --model_path ./checkpoint/model_best \
/home/aistudio/PaddleNLP/model_zoo/uie [32m[2023-06-03 19:58:07,106] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m [32m[2023-06-03 19:58:07,132] [ INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m [32m[2023-06-03 19:58:07,133] [ INFO][0m - Model config ErnieConfig { "architectures": [ "UIE" ], "attention_probs_dropout_prob": 0.1, "dtype": "float32", "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } [0m W0603 19:58:09.367168 9655 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0603 19:58:09.370771 9655 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2023-06-03 19:58:10,168] [ INFO][0m - All model checkpoint weights were used when initializing UIE. [0m [32m[2023-06-03 19:58:10,169] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best. If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m [32m[2023-06-03 19:58:15,447] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:58:15,447] [ INFO][0m - Class Name: all_classes[0m [32m[2023-06-03 19:58:15,447] [ INFO][0m - Evaluation Precision: 0.92920 | Recall: 0.86066 | F1: 0.89362[0m [0m
%cd ~/PaddleNLP/model_zoo/uie/ !python evaluate.py \ --model_path ./checkpoint/model_best \ --test_path ~/dev.txt \ --batch_size 16 \ --max_seq_len 512 \
/home/aistudio/PaddleNLP/model_zoo/uie [32m[2023-06-03 19:58:59,933] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m [32m[2023-06-03 19:58:59,958] [ INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m [32m[2023-06-03 19:58:59,960] [ INFO][0m - Model config ErnieConfig { "architectures": [ "UIE" ], "attention_probs_dropout_prob": 0.1, "dtype": "float32", "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } [0m W0603 19:59:02.195425 9944 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0603 19:59:02.199051 9944 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2023-06-03 19:59:02,995] [ INFO][0m - All model checkpoint weights were used when initializing UIE. [0m [32m[2023-06-03 19:59:02,996] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best. If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m [32m[2023-06-03 19:59:08,312] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:08,312] [ INFO][0m - Class Name: all_classes[0m [32m[2023-06-03 19:59:08,312] [ INFO][0m - Evaluation Precision: 0.89922 | Recall: 0.95082 | F1: 0.92430[0m [0m
2.debug模式评估模型
可开启debug
模式对每个正例类别分别进行评估,该模式仅用于模型调试:
%cd ~/PaddleNLP/model_zoo/uie/ !python evaluate.py \ --model_path ./checkpoint0.67789894/model_best \ --test_path ~/dev.txt \ --debug \ # --model_path ./checkpoint/model_best \
/home/aistudio/PaddleNLP/model_zoo/uie [32m[2023-06-03 19:59:51,590] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m [32m[2023-06-03 19:59:51,617] [ INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m [32m[2023-06-03 19:59:51,618] [ INFO][0m - Model config ErnieConfig { "architectures": [ "UIE" ], "attention_probs_dropout_prob": 0.1, "dtype": "float32", "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } [0m W0603 19:59:53.878067 10161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0603 19:59:53.881646 10161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2023-06-03 19:59:54,673] [ INFO][0m - All model checkpoint weights were used when initializing UIE. [0m [32m[2023-06-03 19:59:54,673] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best. If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m [32m[2023-06-03 19:59:55,584] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:55,585] [ INFO][0m - Class Name: 炸弹[0m [32m[2023-06-03 19:59:55,585] [ INFO][0m - Evaluation Precision: 0.86667 | Recall: 0.86667 | F1: 0.86667[0m [32m[2023-06-03 19:59:55,660] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:55,660] [ INFO][0m - Class Name: 装甲车辆[0m [32m[2023-06-03 19:59:55,660] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 19:59:55,809] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:55,809] [ INFO][0m - Class Name: 火炮[0m [32m[2023-06-03 19:59:55,809] [ INFO][0m - Evaluation Precision: 0.84615 | Recall: 0.73333 | F1: 0.78571[0m [32m[2023-06-03 19:59:55,903] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:55,904] [ INFO][0m - Class Name: 舰船舰艇[0m [32m[2023-06-03 19:59:55,904] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 19:59:55,966] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:55,966] [ INFO][0m - Class Name: 飞行器[0m [32m[2023-06-03 19:59:55,966] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 19:59:56,072] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:56,072] [ INFO][0m - Class Name: 单兵武器[0m [32m[2023-06-03 19:59:56,072] [ INFO][0m - Evaluation Precision: 0.90000 | Recall: 0.75000 | F1: 0.81818[0m [32m[2023-06-03 19:59:56,169] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:56,170] [ INFO][0m - Class Name: 太空装备[0m [32m[2023-06-03 19:59:56,170] [ INFO][0m - Evaluation Precision: 0.95833 | Recall: 0.95833 | F1: 0.95833[0m [32m[2023-06-03 19:59:56,227] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:56,228] [ INFO][0m - Class Name: 导弹[0m [32m[2023-06-03 19:59:56,228] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 19:59:56,308] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 19:59:56,308] [ INFO][0m - Class Name: 其他武器装备[0m [32m[2023-06-03 19:59:56,308] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 0.50000 | F1: 0.66667[0m [0m
%cd ~/PaddleNLP/model_zoo/uie/ !python evaluate.py \ --model_path ./checkpoint/model_best \ --test_path ~/dev.txt \ --debug
/home/aistudio/PaddleNLP/model_zoo/uie [32m[2023-06-03 20:00:12,238] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m [32m[2023-06-03 20:00:12,263] [ INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m [32m[2023-06-03 20:00:12,265] [ INFO][0m - Model config ErnieConfig { "architectures": [ "UIE" ], "attention_probs_dropout_prob": 0.1, "dtype": "float32", "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } [0m W0603 20:00:14.519615 10292 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0603 20:00:14.523311 10292 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [32m[2023-06-03 20:00:15,331] [ INFO][0m - All model checkpoint weights were used when initializing UIE. [0m [32m[2023-06-03 20:00:15,332] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best. If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m [32m[2023-06-03 20:00:16,208] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,208] [ INFO][0m - Class Name: 炸弹[0m [32m[2023-06-03 20:00:16,208] [ INFO][0m - Evaluation Precision: 0.93750 | Recall: 1.00000 | F1: 0.96774[0m [32m[2023-06-03 20:00:16,281] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,281] [ INFO][0m - Class Name: 装甲车辆[0m [32m[2023-06-03 20:00:16,281] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 20:00:16,425] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,426] [ INFO][0m - Class Name: 火炮[0m [32m[2023-06-03 20:00:16,426] [ INFO][0m - Evaluation Precision: 0.92308 | Recall: 0.80000 | F1: 0.85714[0m [32m[2023-06-03 20:00:16,519] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,519] [ INFO][0m - Class Name: 舰船舰艇[0m [32m[2023-06-03 20:00:16,519] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 20:00:16,579] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,579] [ INFO][0m - Class Name: 飞行器[0m [32m[2023-06-03 20:00:16,579] [ INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m [32m[2023-06-03 20:00:16,681] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,681] [ INFO][0m - Class Name: 单兵武器[0m [32m[2023-06-03 20:00:16,681] [ INFO][0m - Evaluation Precision: 0.91667 | Recall: 0.91667 | F1: 0.91667[0m [32m[2023-06-03 20:00:16,775] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,775] [ INFO][0m - Class Name: 太空装备[0m [32m[2023-06-03 20:00:16,775] [ INFO][0m - Evaluation Precision: 0.88889 | Recall: 1.00000 | F1: 0.94118[0m [32m[2023-06-03 20:00:16,830] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,830] [ INFO][0m - Class Name: 导弹[0m [32m[2023-06-03 20:00:16,830] [ INFO][0m - Evaluation Precision: 0.75000 | Recall: 1.00000 | F1: 0.85714[0m [32m[2023-06-03 20:00:16,907] [ INFO][0m - -----------------------------[0m [32m[2023-06-03 20:00:16,908] [ INFO][0m - Class Name: 其他武器装备[0m [32m[2023-06-03 20:00:16,908] [ INFO][0m - Evaluation Precision: 0.80000 | Recall: 0.85714 | F1: 0.82759[0m [0m
五、预测
1.读取test数据集
%cd ~/PaddleNLP/model_zoo/uie/ import json import csv from pprint import pprint # 读取 JSON 文件 with open('/home/aistudio/data/data218296/ner_test.json', 'r', encoding='utf-8') as f: test_data = json.load(f) print(f"数据集长度:{len(test_data)}") print("查看数据样例:") pprint(test_data[0])
/home/aistudio/PaddleNLP/model_zoo/uie 数据集长度:5920 查看数据样例: {'sample_id': 0, 'text': '第五艘西班牙海军F-100级护卫舰即将装备集成通信控制系统。该系统由葡萄牙EID公司生产。该系统已经用于巴西海军的圣保罗航母,荷兰海军的四艘荷兰级海上巡逻舰和四艘西班牙海军BAM近海巡逻舰。F-105护卫舰于2009年初铺设龙骨。该舰预计2010年建造完成,2012年夏交付。'}
2.设定抽取目标 && 定制化模型权重路径
from pprint import pprint from paddlenlp import Taskflow schema = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ") # 设定抽取目标和定制化模型权重路径 # PaddleNLP/model_zoo/uie/checkpoint0.67789894 my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint0.67789894/model_best') # my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
[2023-06-03 20:04:27,292] [ INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json [2023-06-03 20:04:27,297] [ INFO] - Model config ErnieConfig { "architectures": [ "UIE" ], "attention_probs_dropout_prob": 0.1, "dtype": "float32", "enable_recompute": false, "fuse": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 2048, "model_type": "ernie", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "paddlenlp_version": null, "pool_act": "tanh", "task_id": 0, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": true, "vocab_size": 40000 } W0603 20:04:27.792821 9916 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0603 20:04:27.796425 9916 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [2023-06-03 20:04:28,581] [ INFO] - All model checkpoint weights were used when initializing UIE. [2023-06-03 20:04:28,584] [ INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best. If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training. [2023-06-03 20:04:28,590] [ INFO] - Converting to the inference model cost a little time. [2023-06-03 20:04:42,631] [ INFO] - The inference model save in the path:./checkpoint0.67789894/model_best/static/inference [2023-06-03 20:04:44,793] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.
3.预测
list_text=[] for item in test_data: list_text.append(item['text'])
%%time results=my_ie(list_text)
CPU times: user 16min 20s, sys: 2min 27s, total: 18min 47s Wall time: 18min 45s
print(len(results))
5920
print(results[0])
{'舰船舰艇': [{'text': 'F-105护卫舰', 'start': 95, 'end': 103, 'probability': 0.9999301440846722}, {'text': 'F-100级护卫舰', 'start': 8, 'end': 17, 'probability': 0.999897838723939}, {'text': 'BAM近海巡逻舰', 'start': 86, 'end': 94, 'probability': 0.9974057948851112}, {'text': '荷兰级海上巡逻舰', 'start': 70, 'end': 78, 'probability': 0.9998418145644905}, {'text': '圣保罗航母', 'start': 57, 'end': 62, 'probability': 0.9989470402669269}]}
with open('/home/aistudio/result_list.json','w', encoding="utf-8") as f: json.dump(results, f, ensure_ascii=False)
results_list=[] for i in range(len(test_data)): for key, item in results[i].items(): for ii in range(len(item)): temp_result=dict() temp_result['sample_id']=test_data[i]['sample_id'] temp_result['text']=item[ii]['text'] temp_result['type']=key temp_result['start']=item[ii]['start'] temp_result['end']=item[ii]['end'] results_list.append(temp_result) with open('/home/aistudio/result.json','w', encoding="utf-8") as f: json.dump(results_list, f,indent=4, ensure_ascii=False)
results=[] for i in range(len(test_data)): uie_result=my_ie(test_data[i]['text']) # pprint(uie_result) for key, item in uie_result[0].items(): for ii in range(len(item)): temp_result=dict() temp_result['sample_id']=test_data[i]['sample_id'] temp_result['text']=item[ii]['text'] temp_result['type']=key temp_result['start']=item[ii]['start'] temp_result['end']=item[ii]['end'] results.append(temp_result)
with open('result.json','w', encoding="utf-8") as f: json.dump(results, f,indent=4, ensure_ascii=False)
六、提交