面向低资源和增量类型的命名实体识别挑战赛PaddleNLP解决方案

简介: 面向低资源和增量类型的命名实体识别挑战赛PaddleNLP解决方案

一、面向低资源和增量类型的命名实体识别挑战赛简介


使用无所不能的PaddleNLP写个比赛基线,第一次提交,分数虽然比较低,但是还凑合,主要是给的初赛数据集覆盖范围小,太小了。

1696840200171.jpg

竞赛地址:

面向低资源和增量类型的命名实体识别


1.数据简介


本赛题采用的数据聚焦装备领域,主要从以下三个方面的来源收集整理得到,具有一定的权威性和领域价值:

  • 开源资讯:对国内外主流新闻网站、百度百科、维基百科、武器大全等开源资讯网站进行数据收集,优先收集中文,并将外文数据进行翻译后获得情报数据;
  • 智库报告:从智库网站中获取含有装备情报信息的论文以及报告;
  • 内部成果:通过国内军工企业、研究院所、国内综合图书馆、数字图书馆、军工院所图书馆等内部网站获取成果相关的文件进行分析和整理。

本赛题从上述来源收集到充足原始无标注数据后,先结合人工排查和关键字匹配等自动化方法过滤偏离主题、不真实和有偏见的数据;随后清洗无效和非法字符并过滤篇幅较长以及不含领域实体的文本;其次采用参考权威装备标准与论著制定的标签体系对文本进行标注,并采用相关领域以往研究成果中的模型对数据进行预打标;最终统计筛选出类型分布符合任务需求的样本作为原始数据集。


2.数据说明


• 初赛数据说明 该赛题数据集共计约6000条样本,包含以下9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备。参考低资源学习领域的任务设置,为每种类型从原始数据集中采样50个左右样本案例,形成共97条标注样本的训练集(每一条样本可能包含多个实体和实体类型),其余样本均用于测试集。所有数据文件编码均为UTF-8。

文件类型 文件名 文件说明
训练集 ner_train.json 97条已标注样本,每个样本对应内容为:样本id(sample_id),原始文本(text)和标注实体列表(annotations),列表中每个元素对应一个实体,包括类型(type)、文本(text)、跨度起始位置(start)和结束位置(end)
测试集 ner_test.json 5920条未标注样本,每个样本对应内容为:样本id(sample_id)和原始文本(text)


二、数据处理


1.数据查看


!ls data/data218296/
ner_test.json  ner_train.json
  • 查看可知,需要提取9种实体类型:飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备
  • 目前训练集97条标注样本

2.数据集格式转换&& 数据集划分


主要是:

  • 转换格式 一般使用docano进行数据标注,完毕进行格式转换。这里我直接处理文件格式为我所需要的二个是。
  • 分割训练集和测试机 按照 8:2比例进行数据切分
%cd ~
import json
import csv
from pprint import pprint
import random
# 读取 JSON 文件
with open('data/data218296/ner_train.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
print(len(data))
data_len=len(data)
random.shuffle(data)
train_data=data[:int(data_len*0.75)]
dev_data=data[int(data_len*0.75):]
/home/aistudio
97
%cd ~
import json
import csv
from pprint import pprint
import random
# schema
key_words = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")
# 数据集格式转换
# 并根据8:2比例分割为train和dev
def convert(source_data, key_word):
    convert_target = []
    for item in source_data:
        # 单条记录
        result_list = []
        # 标注格式化
        for item2 in item["annotations"]:
            result_temp = dict()
            if item2['type'] == key_word:
                # 构造结果列表
                result_temp['text'] = item2['text']
                result_temp['start'] = item2['start']
                result_temp['end'] = item2['end']
                result_list.append(result_temp)
        # 构造单条数据
        temp = dict()
        temp['content'] = item['text']
        temp['result_list'] = result_list
        temp['prompt'] = key_word
        # 加入列表
        convert_target.append(temp)
    return convert_target
def convert_main(data,key_words):
    result=[]
    for key_word in key_words:
        temp_list = convert(data, key_word)
        result=result+temp_list
    random.shuffle(result)
    return result
# 转换后总列表
train_data_convert = convert_main(train_data,key_words)
dev_data_convert = convert_main(dev_data,key_words)
# 将JSON数据转换为CSV格式
with open('train.txt', 'w', encoding="utf-8") as f:
    for item in train_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')
with open('dev.txt', 'w', encoding="utf-8") as f:
    for item in dev_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')
/home/aistudio


三、训练训练


1.环境设置


主要是下载并安装PaddleNLP

# git 下载PaddleNLP
!git clone https://gitee.com/paddlepaddle/PaddleNLP.git  --depth=1
正克隆到 'PaddleNLP'...
remote: Enumerating objects: 5825, done.
remote: Counting objects: 100% (5825/5825), done.
remote: Compressing objects: 100% (4099/4099), done.
remote: Total 5825 (delta 2254), reused 3581 (delta 1437), pack-reused 0
接收对象中: 100% (5825/5825), 22.98 MiB | 1.19 MiB/s, 完成.
处理 delta 中: 100% (2254/2254), 完成.
检查连接... 完成。
# 安装升级PaddleNLP
%cd ~/PaddleNLP
!pip install -U -e ./
from IPython.display import clear_output
clear_output() # 清理很长的内容


2.模型微调


推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。

可配置参数说明:

  • model_name_or_path:必须,进行 few shot 训练使用的预训练模型。可选择的有 "uie-base"、 "uie-medium", "uie-mini", "uie-micro", "uie-nano", "uie-m-base", "uie-m-large"。
  • multilingual:是否是跨语言模型,用 "uie-m-base", "uie-m-large" 等模型进微调得到的模型也是多语言模型,需要设置为 True;默认为 False。
  • output_dir:必须,模型训练或压缩后保存的模型目录;默认为 None
  • device: 训练设备,可选择 'cpu'、'gpu' 、'npu'其中的一种;默认为 GPU 训练。
  • per_device_train_batch_size:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。
  • per_device_eval_batch_size:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。
  • learning_rate:训练最大学习率,UIE 推荐设置为 1e-5;默认值为3e-5。
  • num_train_epochs: 训练轮次,使用早停法时可以选择 100;默认为10。
  • logging_steps: 训练过程中日志打印的间隔 steps 数,默认100。
  • save_steps: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
  • seed:全局随机种子,默认为 42。
  • weight_decay:除了所有 bias 和 LayerNorm 权重之外,应用于所有层的权重衰减数值。可选;默认为 0.0;
  • do_train:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。
  • do_eval:是否进行评估,设置该参数表示进行评估。

该示例代码中由于设置了参数 --do_eval,因此在训练完会自动进行评估。

%cd ~/PaddleNLP/model_zoo/uie/
# !python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py \
!python finetune.py  \
    --device gpu \
    --logging_steps 10 \
    --save_steps 50 \
    --eval_steps 50 \
    --seed 1000 \
    --model_name_or_path uie-medium \
    --output_dir ./checkpoint/model_best \
    --train_path ~/train.txt \
    --dev_path ~/dev.txt  \
    --max_seq_length 512  \
    --per_device_eval_batch_size 16 \
    --per_device_train_batch_size  16 \
    --num_train_epochs 32 \
    --learning_rate 1e-5 \
    --label_names "start_positions" "end_positions" \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model eval_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1
/home/aistudio/PaddleNLP/model_zoo/uie
[2023-06-19 16:20:21,620] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.[2023-06-19 16:20:21,620] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).[2023-06-19 16:20:21,621] [    INFO] - ============================================================[2023-06-19 16:20:21,621] [    INFO] -      Model Configuration Arguments      [2023-06-19 16:20:21,621] [    INFO] - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[2023-06-19 16:20:21,621] [    INFO] - export_model_dir              :./checkpoint/model_best[2023-06-19 16:20:21,621] [    INFO] - model_name_or_path            :uie-medium[2023-06-19 16:20:21,621] [    INFO] - multilingual                  :False[2023-06-19 16:20:21,621] [    INFO] - [2023-06-19 16:20:21,621] [    INFO] - ============================================================[2023-06-19 16:20:21,621] [    INFO] -       Data Configuration Arguments      [2023-06-19 16:20:21,621] [    INFO] - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[2023-06-19 16:20:21,621] [    INFO] - dev_path                      :/home/aistudio/dev.txt[2023-06-19 16:20:21,621] [    INFO] - dynamic_max_length            :None[2023-06-19 16:20:21,621] [    INFO] - max_seq_length                :512[2023-06-19 16:20:21,622] [    INFO] - train_path                    :/home/aistudio/train.txt[2023-06-19 16:20:21,622] [    INFO] - [2023-06-19 16:20:21,622] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False[2023-06-19 16:20:21,622] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-medium'.[2023-06-19 16:20:21,622] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/uie-medium[2023-06-19 16:20:21,752] [    INFO] - Downloading ernie_3.0_medium_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt100%|█████████████████████████████████████████| 182k/182k [00:00<00:00, 889kB/s]
[2023-06-19 16:20:22,092] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/uie-medium/tokenizer_config.json[2023-06-19 16:20:22,092] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/uie-medium/special_tokens_map.json[2023-06-19 16:20:22,093] [    INFO] - Model config ErnieConfig {  "attention_probs_dropout_prob": 0.1,  "enable_recompute": false,  "fuse": false,  "hidden_act": "gelu",  "hidden_dropout_prob": 0.1,  "hidden_size": 768,  "initializer_range": 0.02,  "intermediate_size": 3072,  "layer_norm_eps": 1e-12,  "max_position_embeddings": 2048,  "model_type": "ernie",  "num_attention_heads": 12,  "num_hidden_layers": 6,  "pad_token_id": 0,  "paddlenlp_version": null,  "pool_act": "tanh",  "task_id": 0,  "task_type_vocab_size": 16,  "type_vocab_size": 4,  "use_task_id": true,  "vocab_size": 40000}[2023-06-19 16:20:22,094] [    INFO] - Configuration saved in /home/aistudio/.paddlenlp/models/uie-medium/config.json[2023-06-19 16:20:22,095] [    INFO] - Downloading uie_medium.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams100%|████████████████████████████████████████| 288M/288M [02:40<00:00, 1.88MB/s]
W0619 16:23:05.005470  1653 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0619 16:23:05.010092  1653 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-19 16:23:05,669] [    INFO] - All model checkpoint weights were used when initializing UIE.[2023-06-19 16:23:05,669] [    INFO] - All the weights of UIE were initialized from the model checkpoint at uie-medium.If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.


四、模型评估


1.评估模型


可配置参数说明:

  • model_path: 进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json
  • test_path: 进行评估的测试集文件。
  • batch_size: 批处理大小,请结合机器情况进行调整,默认为16。
  • max_seq_len: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
  • debug: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。
  • multilingual: 是否是跨语言模型,默认关闭。
  • schema_lang: 选择schema的语言,可选有chen。默认为ch,英文数据集请选择en

通过运行以下命令进行模型评估:

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint0.67789894/model_best \
    --test_path ~/dev.txt \
    --batch_size 16 \
    --max_seq_len 512 \
# --model_path ./checkpoint/model_best \
/home/aistudio/PaddleNLP/model_zoo/uie
[2023-06-03 19:58:07,106] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.
[2023-06-03 19:58:07,132] [    INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json
[2023-06-03 19:58:07,133] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}

W0603 19:58:09.367168  9655 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:58:09.370771  9655 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 19:58:10,168] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2023-06-03 19:58:10,169] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 19:58:15,447] [    INFO] - -----------------------------
[2023-06-03 19:58:15,447] [    INFO] - Class Name: all_classes
[2023-06-03 19:58:15,447] [    INFO] - Evaluation Precision: 0.92920 | Recall: 0.86066 | F1: 0.89362

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ~/dev.txt \
    --batch_size 16 \
    --max_seq_len 512 \
/home/aistudio/PaddleNLP/model_zoo/uie
[2023-06-03 19:58:59,933] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
[2023-06-03 19:58:59,958] [    INFO] - loading configuration file ./checkpoint/model_best/config.json
[2023-06-03 19:58:59,960] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}

W0603 19:59:02.195425  9944 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:02.199051  9944 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 19:59:02,995] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2023-06-03 19:59:02,996] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 19:59:08,312] [    INFO] - -----------------------------
[2023-06-03 19:59:08,312] [    INFO] - Class Name: all_classes
[2023-06-03 19:59:08,312] [    INFO] - Evaluation Precision: 0.89922 | Recall: 0.95082 | F1: 0.92430



2.debug模式评估模型


可开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试:

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint0.67789894/model_best \
    --test_path ~/dev.txt \
    --debug \
# --model_path ./checkpoint/model_best \
/home/aistudio/PaddleNLP/model_zoo/uie
[2023-06-03 19:59:51,590] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.
[2023-06-03 19:59:51,617] [    INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json
[2023-06-03 19:59:51,618] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}

W0603 19:59:53.878067 10161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:53.881646 10161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 19:59:54,673] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2023-06-03 19:59:54,673] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 19:59:55,584] [    INFO] - -----------------------------
[2023-06-03 19:59:55,585] [    INFO] - Class Name: 炸弹
[2023-06-03 19:59:55,585] [    INFO] - Evaluation Precision: 0.86667 | Recall: 0.86667 | F1: 0.86667
[2023-06-03 19:59:55,660] [    INFO] - -----------------------------
[2023-06-03 19:59:55,660] [    INFO] - Class Name: 装甲车辆
[2023-06-03 19:59:55,660] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 19:59:55,809] [    INFO] - -----------------------------
[2023-06-03 19:59:55,809] [    INFO] - Class Name: 火炮
[2023-06-03 19:59:55,809] [    INFO] - Evaluation Precision: 0.84615 | Recall: 0.73333 | F1: 0.78571
[2023-06-03 19:59:55,903] [    INFO] - -----------------------------
[2023-06-03 19:59:55,904] [    INFO] - Class Name: 舰船舰艇
[2023-06-03 19:59:55,904] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 19:59:55,966] [    INFO] - -----------------------------
[2023-06-03 19:59:55,966] [    INFO] - Class Name: 飞行器
[2023-06-03 19:59:55,966] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 19:59:56,072] [    INFO] - -----------------------------
[2023-06-03 19:59:56,072] [    INFO] - Class Name: 单兵武器
[2023-06-03 19:59:56,072] [    INFO] - Evaluation Precision: 0.90000 | Recall: 0.75000 | F1: 0.81818
[2023-06-03 19:59:56,169] [    INFO] - -----------------------------
[2023-06-03 19:59:56,170] [    INFO] - Class Name: 太空装备
[2023-06-03 19:59:56,170] [    INFO] - Evaluation Precision: 0.95833 | Recall: 0.95833 | F1: 0.95833
[2023-06-03 19:59:56,227] [    INFO] - -----------------------------
[2023-06-03 19:59:56,228] [    INFO] - Class Name: 导弹
[2023-06-03 19:59:56,228] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 19:59:56,308] [    INFO] - -----------------------------
[2023-06-03 19:59:56,308] [    INFO] - Class Name: 其他武器装备
[2023-06-03 19:59:56,308] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.50000 | F1: 0.66667

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ~/dev.txt \
    --debug 
/home/aistudio/PaddleNLP/model_zoo/uie
[2023-06-03 20:00:12,238] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
[2023-06-03 20:00:12,263] [    INFO] - loading configuration file ./checkpoint/model_best/config.json
[2023-06-03 20:00:12,265] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}

W0603 20:00:14.519615 10292 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:00:14.523311 10292 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 20:00:15,331] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2023-06-03 20:00:15,332] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 20:00:16,208] [    INFO] - -----------------------------
[2023-06-03 20:00:16,208] [    INFO] - Class Name: 炸弹
[2023-06-03 20:00:16,208] [    INFO] - Evaluation Precision: 0.93750 | Recall: 1.00000 | F1: 0.96774
[2023-06-03 20:00:16,281] [    INFO] - -----------------------------
[2023-06-03 20:00:16,281] [    INFO] - Class Name: 装甲车辆
[2023-06-03 20:00:16,281] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 20:00:16,425] [    INFO] - -----------------------------
[2023-06-03 20:00:16,426] [    INFO] - Class Name: 火炮
[2023-06-03 20:00:16,426] [    INFO] - Evaluation Precision: 0.92308 | Recall: 0.80000 | F1: 0.85714
[2023-06-03 20:00:16,519] [    INFO] - -----------------------------
[2023-06-03 20:00:16,519] [    INFO] - Class Name: 舰船舰艇
[2023-06-03 20:00:16,519] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 20:00:16,579] [    INFO] - -----------------------------
[2023-06-03 20:00:16,579] [    INFO] - Class Name: 飞行器
[2023-06-03 20:00:16,579] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2023-06-03 20:00:16,681] [    INFO] - -----------------------------
[2023-06-03 20:00:16,681] [    INFO] - Class Name: 单兵武器
[2023-06-03 20:00:16,681] [    INFO] - Evaluation Precision: 0.91667 | Recall: 0.91667 | F1: 0.91667
[2023-06-03 20:00:16,775] [    INFO] - -----------------------------
[2023-06-03 20:00:16,775] [    INFO] - Class Name: 太空装备
[2023-06-03 20:00:16,775] [    INFO] - Evaluation Precision: 0.88889 | Recall: 1.00000 | F1: 0.94118
[2023-06-03 20:00:16,830] [    INFO] - -----------------------------
[2023-06-03 20:00:16,830] [    INFO] - Class Name: 导弹
[2023-06-03 20:00:16,830] [    INFO] - Evaluation Precision: 0.75000 | Recall: 1.00000 | F1: 0.85714
[2023-06-03 20:00:16,907] [    INFO] - -----------------------------
[2023-06-03 20:00:16,908] [    INFO] - Class Name: 其他武器装备
[2023-06-03 20:00:16,908] [    INFO] - Evaluation Precision: 0.80000 | Recall: 0.85714 | F1: 0.82759



五、预测


1.读取test数据集


%cd ~/PaddleNLP/model_zoo/uie/
import json
import csv
from pprint import pprint
# 读取 JSON 文件
with open('/home/aistudio/data/data218296/ner_test.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)
print(f"数据集长度:{len(test_data)}")
print("查看数据样例:")
pprint(test_data[0])   
/home/aistudio/PaddleNLP/model_zoo/uie
数据集长度:5920
查看数据样例:
{'sample_id': 0,
 'text': '第五艘西班牙海军F-100级护卫舰即将装备集成通信控制系统。该系统由葡萄牙EID公司生产。该系统已经用于巴西海军的圣保罗航母,荷兰海军的四艘荷兰级海上巡逻舰和四艘西班牙海军BAM近海巡逻舰。F-105护卫舰于2009年初铺设龙骨。该舰预计2010年建造完成,2012年夏交付。'}


2.设定抽取目标 && 定制化模型权重路径


from pprint import pprint
from paddlenlp import Taskflow
schema = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")
# 设定抽取目标和定制化模型权重路径
# PaddleNLP/model_zoo/uie/checkpoint0.67789894
my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint0.67789894/model_best')
# my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint/model_best')
[2023-06-03 20:04:27,292] [    INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json
[2023-06-03 20:04:27,297] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
W0603 20:04:27.792821  9916 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:04:27.796425  9916 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 20:04:28,581] [    INFO] - All model checkpoint weights were used when initializing UIE.
[2023-06-03 20:04:28,584] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 20:04:28,590] [    INFO] - Converting to the inference model cost a little time.
[2023-06-03 20:04:42,631] [    INFO] - The inference model save in the path:./checkpoint0.67789894/model_best/static/inference
[2023-06-03 20:04:44,793] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.


3.预测


list_text=[]
for item in test_data:
    list_text.append(item['text'])
%%time
results=my_ie(list_text)
CPU times: user 16min 20s, sys: 2min 27s, total: 18min 47s
Wall time: 18min 45s
print(len(results))
5920
print(results[0])
{'舰船舰艇': [{'text': 'F-105护卫舰', 'start': 95, 'end': 103, 'probability': 0.9999301440846722}, {'text': 'F-100级护卫舰', 'start': 8, 'end': 17, 'probability': 0.999897838723939}, {'text': 'BAM近海巡逻舰', 'start': 86, 'end': 94, 'probability': 0.9974057948851112}, {'text': '荷兰级海上巡逻舰', 'start': 70, 'end': 78, 'probability': 0.9998418145644905}, {'text': '圣保罗航母', 'start': 57, 'end': 62, 'probability': 0.9989470402669269}]}
with open('/home/aistudio/result_list.json','w', encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False)
results_list=[]
for i in range(len(test_data)):
    for key, item in results[i].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results_list.append(temp_result)
with open('/home/aistudio/result.json','w', encoding="utf-8") as f:
    json.dump(results_list, f,indent=4, ensure_ascii=False)
results=[]
for i in range(len(test_data)):
    uie_result=my_ie(test_data[i]['text'])
    # pprint(uie_result)
    for key, item in uie_result[0].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results.append(temp_result)
with open('result.json','w', encoding="utf-8") as f:
    json.dump(results, f,indent=4, ensure_ascii=False)


六、提交


1696840872898.jpg

相关实践学习
部署Stable Diffusion玩转AI绘画(GPU云服务器)
本实验通过在ECS上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。
目录
相关文章
|
数据采集 机器学习/深度学习 自然语言处理
命名实体识别的一点经验与技巧(下)
命名实体识别的一点经验与技巧(下)
104 0
|
机器学习/深度学习 人工智能 自然语言处理
命名实体识别的一点经验与技巧(上)
命名实体识别的一点经验与技巧(上)
172 0
|
23小时前
|
自然语言处理
有关“RaNER命名实体识别-中文-新闻领域-base模型的命名实体识”的个人小建议
当新闻中出现不具体人名(如范某)时,建议模型能正确提取;对于含名词的非特殊名称(如“七块熹平石经”),建议不提取;此外,模型应解决去重问题,或给出词频。
|
自然语言处理 算法 机器人
PaddleNLP通用信息抽取技术UIE【一】产业应用实例:信息抽取{实体关系抽取、中文分词、精准实体标。情感分析等}、文本纠错、问答系统、闲聊机器人、定制训练
PaddleNLP通用信息抽取技术UIE【一】产业应用实例:信息抽取{实体关系抽取、中文分词、精准实体标。情感分析等}、文本纠错、问答系统、闲聊机器人、定制训练
PaddleNLP通用信息抽取技术UIE【一】产业应用实例:信息抽取{实体关系抽取、中文分词、精准实体标。情感分析等}、文本纠错、问答系统、闲聊机器人、定制训练
|
3月前
|
自然语言处理 数据处理 知识图谱
PaddleNLP UIE 实体关系抽取 -- 抽取药品说明书(名称、规格、用法、用量)【废弃】
PaddleNLP UIE 实体关系抽取 -- 抽取药品说明书(名称、规格、用法、用量)【废弃】
43 1
|
3月前
|
自然语言处理
预训练模型STAR问题之开放信息抽取(OpenIE)目标的问题如何解决
预训练模型STAR问题之开放信息抽取(OpenIE)目标的问题如何解决
|
5月前
|
机器学习/深度学习 人工智能 前端开发
人工智能平台PAI产品使用合集之创建了实时特征视图,里面的数据是通过什么传入的
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
5月前
|
机器学习/深度学习 人工智能 监控
人工智能平台PAI产品使用合集之设置了7个特征,但在最后生成的数据表中只包含了6个id_feature的特征,是什么导致的
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
6月前
|
机器学习/深度学习 缓存 文字识别
印刷文字识别产品使用合集之标注阶段设定了两个独立的字段,但在返回的信息中却合并成了一个字段如何解决
印刷文字识别(Optical Character Recognition, OCR)技术能够将图片、扫描文档或 PDF 中的印刷文字转化为可编辑和可搜索的数据。这项技术广泛应用于多个领域,以提高工作效率、促进信息数字化。以下是一些印刷文字识别产品使用的典型场景合集。
|
6月前
|
机器学习/深度学习 人工智能 异构计算
人工智能平台PAI问题之Tag类型特征等长如何解决
人工智能平台PAI是指阿里云提供的机器学习平台服务,支持建模、训练和部署机器学习模型;本合集将介绍机器学习PAI的功能和操作流程,以及在使用过程中遇到的问题和解决方案。