一、面向低资源和增量类型的命名实体识别挑战赛简介

使用无所不能的PaddleNLP写个比赛基线，第一次提交，分数虽然比较低，但是还凑合，主要是给的初赛数据集覆盖范围小，太小了。

竞赛地址：

1.数据简介

本赛题采用的数据聚焦装备领域，主要从以下三个方面的来源收集整理得到，具有一定的权威性和领域价值：

开源资讯：对国内外主流新闻网站、百度百科、维基百科、武器大全等开源资讯网站进行数据收集，优先收集中文，并将外文数据进行翻译后获得情报数据；
智库报告：从智库网站中获取含有装备情报信息的论文以及报告；
内部成果：通过国内军工企业、研究院所、国内综合图书馆、数字图书馆、军工院所图书馆等内部网站获取成果相关的文件进行分析和整理。

本赛题从上述来源收集到充足原始无标注数据后，先结合人工排查和关键字匹配等自动化方法过滤偏离主题、不真实和有偏见的数据；随后清洗无效和非法字符并过滤篇幅较长以及不含领域实体的文本；其次采用参考权威装备标准与论著制定的标签体系对文本进行标注，并采用相关领域以往研究成果中的模型对数据进行预打标；最终统计筛选出类型分布符合任务需求的样本作为原始数据集。

2.数据说明

• 初赛数据说明该赛题数据集共计约6000条样本，包含以下9种实体类型：飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备。参考低资源学习领域的任务设置，为每种类型从原始数据集中采样50个左右样本案例，形成共97条标注样本的训练集（每一条样本可能包含多个实体和实体类型），其余样本均用于测试集。所有数据文件编码均为UTF-8。

文件类型	文件名	文件说明
训练集	ner_train.json	97条已标注样本，每个样本对应内容为：样本id（sample_id），原始文本（text）和标注实体列表（annotations），列表中每个元素对应一个实体，包括类型（type）、文本（text）、跨度起始位置（start）和结束位置（end）
测试集	ner_test.json	5920条未标注样本，每个样本对应内容为：样本id（sample_id）和原始文本（text）

二、数据处理

1.数据查看

!ls data/data218296/

ner_test.json  ner_train.json

查看可知，需要提取9种实体类型：飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备
目前训练集97条标注样本

2.数据集格式转换&& 数据集划分

主要是：

转换格式一般使用docano进行数据标注，完毕进行格式转换。这里我直接处理文件格式为我所需要的二个是。
分割训练集和测试机按照 8：2比例进行数据切分

%cd ~
import json
import csv
from pprint import pprint
import random
# 读取 JSON 文件
with open('data/data218296/ner_train.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
print(len(data))
data_len=len(data)
random.shuffle(data)
train_data=data[:int(data_len*0.75)]
dev_data=data[int(data_len*0.75):]

/home/aistudio
97

%cd ~
import json
import csv
from pprint import pprint
import random
# schema
key_words = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")
# 数据集格式转换
# 并根据8：2比例分割为train和dev
def convert(source_data, key_word):
    convert_target = []
    for item in source_data:
        # 单条记录
        result_list = []
        # 标注格式化
        for item2 in item["annotations"]:
            result_temp = dict()
            if item2['type'] == key_word:
                # 构造结果列表
                result_temp['text'] = item2['text']
                result_temp['start'] = item2['start']
                result_temp['end'] = item2['end']
                result_list.append(result_temp)
        # 构造单条数据
        temp = dict()
        temp['content'] = item['text']
        temp['result_list'] = result_list
        temp['prompt'] = key_word
        # 加入列表
        convert_target.append(temp)
    return convert_target
def convert_main(data,key_words):
    result=[]
    for key_word in key_words:
        temp_list = convert(data, key_word)
        result=result+temp_list
    random.shuffle(result)
    return result
# 转换后总列表
train_data_convert = convert_main(train_data,key_words)
dev_data_convert = convert_main(dev_data,key_words)
# 将JSON数据转换为CSV格式
with open('train.txt', 'w', encoding="utf-8") as f:
    for item in train_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')
with open('dev.txt', 'w', encoding="utf-8") as f:
    for item in dev_data_convert:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

/home/aistudio

三、训练训练

1.环境设置

主要是下载并安装PaddleNLP

# git 下载PaddleNLP
!git clone https://gitee.com/paddlepaddle/PaddleNLP.git  --depth=1

正克隆到 'PaddleNLP'...
remote: Enumerating objects: 5825, done.[K
remote: Counting objects: 100% (5825/5825), done.[K
remote: Compressing objects: 100% (4099/4099), done.[K
remote: Total 5825 (delta 2254), reused 3581 (delta 1437), pack-reused 0[K
接收对象中: 100% (5825/5825), 22.98 MiB | 1.19 MiB/s, 完成.
处理 delta 中: 100% (2254/2254), 完成.
检查连接... 完成。

# 安装升级PaddleNLP
%cd ~/PaddleNLP
!pip install -U -e ./
from IPython.display import clear_output
clear_output() # 清理很长的内容

2.模型微调

推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。

可配置参数说明：

model_name_or_path：必须，进行 few shot 训练使用的预训练模型。可选择的有 "uie-base"、 "uie-medium", "uie-mini", "uie-micro", "uie-nano", "uie-m-base", "uie-m-large"。
multilingual：是否是跨语言模型，用 "uie-m-base", "uie-m-large" 等模型进微调得到的模型也是多语言模型，需要设置为 True；默认为 False。
output_dir：必须，模型训练或压缩后保存的模型目录；默认为 None 。
device: 训练设备，可选择 'cpu'、'gpu' 、'npu'其中的一种；默认为 GPU 训练。
per_device_train_batch_size：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
per_device_eval_batch_size：开发集评测过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
learning_rate：训练最大学习率，UIE 推荐设置为 1e-5；默认值为3e-5。
num_train_epochs: 训练轮次，使用早停法时可以选择 100；默认为10。
logging_steps: 训练过程中日志打印的间隔 steps 数，默认100。
save_steps: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
seed：全局随机种子，默认为 42。
weight_decay：除了所有 bias 和 LayerNorm 权重之外，应用于所有层的权重衰减数值。可选；默认为 0.0；
do_train:是否进行微调训练，设置该参数表示进行微调训练，默认不设置。
do_eval:是否进行评估，设置该参数表示进行评估。

该示例代码中由于设置了参数 --do_eval，因此在训练完会自动进行评估。

%cd ~/PaddleNLP/model_zoo/uie/
# !python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py \
!python finetune.py  \
    --device gpu \
    --logging_steps 10 \
    --save_steps 50 \
    --eval_steps 50 \
    --seed 1000 \
    --model_name_or_path uie-medium \
    --output_dir ./checkpoint/model_best \
    --train_path ~/train.txt \
    --dev_path ~/dev.txt  \
    --max_seq_length 512  \
    --per_device_eval_batch_size 16 \
    --per_device_train_batch_size  16 \
    --num_train_epochs 32 \
    --learning_rate 1e-5 \
    --label_names "start_positions" "end_positions" \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model eval_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1

/home/aistudio/PaddleNLP/model_zoo/uie
[33m[2023-06-19 16:20:21,620] [ WARNING][0m - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.[0m[32m[2023-06-19 16:20:21,620] [    INFO][0m - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m -      Model Configuration Arguments      [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - export_model_dir              :./checkpoint/model_best[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - model_name_or_path            :uie-medium[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - multilingual                  :False[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - ============================================================[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m -       Data Configuration Arguments      [0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - paddle commit id              :3fa7a736e32508e797616b6344d97814c37d3ff8[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - dev_path                      :/home/aistudio/dev.txt[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - dynamic_max_length            :None[0m[32m[2023-06-19 16:20:21,621] [    INFO][0m - max_seq_length                :512[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - train_path                    :/home/aistudio/train.txt[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - [0m[33m[2023-06-19 16:20:21,622] [ WARNING][0m - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-medium'.[0m[32m[2023-06-19 16:20:21,622] [    INFO][0m - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/uie-medium[0m[32m[2023-06-19 16:20:21,752] [    INFO][0m - Downloading ernie_3.0_medium_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt[0m100%|█████████████████████████████████████████| 182k/182k [00:00<00:00, 889kB/s]
[32m[2023-06-19 16:20:22,092] [    INFO][0m - tokenizer config file saved in /home/aistudio/.paddlenlp/models/uie-medium/tokenizer_config.json[0m[32m[2023-06-19 16:20:22,092] [    INFO][0m - Special tokens file saved in /home/aistudio/.paddlenlp/models/uie-medium/special_tokens_map.json[0m[32m[2023-06-19 16:20:22,093] [    INFO][0m - Model config ErnieConfig {  "attention_probs_dropout_prob": 0.1,  "enable_recompute": false,  "fuse": false,  "hidden_act": "gelu",  "hidden_dropout_prob": 0.1,  "hidden_size": 768,  "initializer_range": 0.02,  "intermediate_size": 3072,  "layer_norm_eps": 1e-12,  "max_position_embeddings": 2048,  "model_type": "ernie",  "num_attention_heads": 12,  "num_hidden_layers": 6,  "pad_token_id": 0,  "paddlenlp_version": null,  "pool_act": "tanh",  "task_id": 0,  "task_type_vocab_size": 16,  "type_vocab_size": 4,  "use_task_id": true,  "vocab_size": 40000}[0m[32m[2023-06-19 16:20:22,094] [    INFO][0m - Configuration saved in /home/aistudio/.paddlenlp/models/uie-medium/config.json[0m[32m[2023-06-19 16:20:22,095] [    INFO][0m - Downloading uie_medium.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams[0m100%|████████████████████████████████████████| 288M/288M [02:40<00:00, 1.88MB/s]
W0619 16:23:05.005470  1653 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0619 16:23:05.010092  1653 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-19 16:23:05,669] [    INFO][0m - All model checkpoint weights were used when initializing UIE.[0m[32m[2023-06-19 16:23:05,669] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at uie-medium.If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m

四、模型评估

1.评估模型

可配置参数说明：

model_path: 进行评估的模型文件夹路径，路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json。
test_path: 进行评估的测试集文件。
batch_size: 批处理大小，请结合机器情况进行调整，默认为16。
max_seq_len: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
debug: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
multilingual: 是否是跨语言模型，默认关闭。
schema_lang: 选择schema的语言，可选有ch和en。默认为ch，英文数据集请选择en。

通过运行以下命令进行模型评估：

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint0.67789894/model_best \
    --test_path ~/dev.txt \
    --batch_size 16 \
    --max_seq_len 512 \
# --model_path ./checkpoint/model_best \

/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:58:07,106] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m
[32m[2023-06-03 19:58:07,132] [    INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m
[32m[2023-06-03 19:58:07,133] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:58:09.367168  9655 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:58:09.370771  9655 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:58:10,168] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:58:10,169] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - Class Name: all_classes[0m
[32m[2023-06-03 19:58:15,447] [    INFO][0m - Evaluation Precision: 0.92920 | Recall: 0.86066 | F1: 0.89362[0m
[0m

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ~/dev.txt \
    --batch_size 16 \
    --max_seq_len 512 \

/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:58:59,933] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m
[32m[2023-06-03 19:58:59,958] [    INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m
[32m[2023-06-03 19:58:59,960] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:59:02.195425  9944 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:02.199051  9944 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:59:02,995] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:59:02,996] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - Class Name: all_classes[0m
[32m[2023-06-03 19:59:08,312] [    INFO][0m - Evaluation Precision: 0.89922 | Recall: 0.95082 | F1: 0.92430[0m
[0m

2.debug模式评估模型

可开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试：

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint0.67789894/model_best \
    --test_path ~/dev.txt \
    --debug \
# --model_path ./checkpoint/model_best \

/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 19:59:51,590] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.[0m
[32m[2023-06-03 19:59:51,617] [    INFO][0m - loading configuration file ./checkpoint0.67789894/model_best/config.json[0m
[32m[2023-06-03 19:59:51,618] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 19:59:53.878067 10161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 19:59:53.881646 10161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 19:59:54,673] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 19:59:54,673] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 19:59:55,584] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,585] [    INFO][0m - Class Name: 炸弹[0m
[32m[2023-06-03 19:59:55,585] [    INFO][0m - Evaluation Precision: 0.86667 | Recall: 0.86667 | F1: 0.86667[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - Class Name: 装甲车辆[0m
[32m[2023-06-03 19:59:55,660] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - Class Name: 火炮[0m
[32m[2023-06-03 19:59:55,809] [    INFO][0m - Evaluation Precision: 0.84615 | Recall: 0.73333 | F1: 0.78571[0m
[32m[2023-06-03 19:59:55,903] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,904] [    INFO][0m - Class Name: 舰船舰艇[0m
[32m[2023-06-03 19:59:55,904] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - Class Name: 飞行器[0m
[32m[2023-06-03 19:59:55,966] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - Class Name: 单兵武器[0m
[32m[2023-06-03 19:59:56,072] [    INFO][0m - Evaluation Precision: 0.90000 | Recall: 0.75000 | F1: 0.81818[0m
[32m[2023-06-03 19:59:56,169] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,170] [    INFO][0m - Class Name: 太空装备[0m
[32m[2023-06-03 19:59:56,170] [    INFO][0m - Evaluation Precision: 0.95833 | Recall: 0.95833 | F1: 0.95833[0m
[32m[2023-06-03 19:59:56,227] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,228] [    INFO][0m - Class Name: 导弹[0m
[32m[2023-06-03 19:59:56,228] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - Class Name: 其他武器装备[0m
[32m[2023-06-03 19:59:56,308] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 0.50000 | F1: 0.66667[0m
[0m

%cd ~/PaddleNLP/model_zoo/uie/
!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ~/dev.txt \
    --debug

/home/aistudio/PaddleNLP/model_zoo/uie
[32m[2023-06-03 20:00:12,238] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.[0m
[32m[2023-06-03 20:00:12,263] [    INFO][0m - loading configuration file ./checkpoint/model_best/config.json[0m
[32m[2023-06-03 20:00:12,265] [    INFO][0m - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
[0m
W0603 20:00:14.519615 10292 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:00:14.523311 10292 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-06-03 20:00:15,331] [    INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-06-03 20:00:15,332] [    INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - Class Name: 炸弹[0m
[32m[2023-06-03 20:00:16,208] [    INFO][0m - Evaluation Precision: 0.93750 | Recall: 1.00000 | F1: 0.96774[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - Class Name: 装甲车辆[0m
[32m[2023-06-03 20:00:16,281] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,425] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,426] [    INFO][0m - Class Name: 火炮[0m
[32m[2023-06-03 20:00:16,426] [    INFO][0m - Evaluation Precision: 0.92308 | Recall: 0.80000 | F1: 0.85714[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - Class Name: 舰船舰艇[0m
[32m[2023-06-03 20:00:16,519] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - Class Name: 飞行器[0m
[32m[2023-06-03 20:00:16,579] [    INFO][0m - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - Class Name: 单兵武器[0m
[32m[2023-06-03 20:00:16,681] [    INFO][0m - Evaluation Precision: 0.91667 | Recall: 0.91667 | F1: 0.91667[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - Class Name: 太空装备[0m
[32m[2023-06-03 20:00:16,775] [    INFO][0m - Evaluation Precision: 0.88889 | Recall: 1.00000 | F1: 0.94118[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - Class Name: 导弹[0m
[32m[2023-06-03 20:00:16,830] [    INFO][0m - Evaluation Precision: 0.75000 | Recall: 1.00000 | F1: 0.85714[0m
[32m[2023-06-03 20:00:16,907] [    INFO][0m - -----------------------------[0m
[32m[2023-06-03 20:00:16,908] [    INFO][0m - Class Name: 其他武器装备[0m
[32m[2023-06-03 20:00:16,908] [    INFO][0m - Evaluation Precision: 0.80000 | Recall: 0.85714 | F1: 0.82759[0m
[0m

五、预测

1.读取test数据集

%cd ~/PaddleNLP/model_zoo/uie/
import json
import csv
from pprint import pprint
# 读取 JSON 文件
with open('/home/aistudio/data/data218296/ner_test.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)
print(f"数据集长度：{len(test_data)}")
print("查看数据样例：")
pprint(test_data[0])

/home/aistudio/PaddleNLP/model_zoo/uie
数据集长度：5920
查看数据样例：
{'sample_id': 0,
 'text': '第五艘西班牙海军F-100级护卫舰即将装备集成通信控制系统。该系统由葡萄牙EID公司生产。该系统已经用于巴西海军的圣保罗航母，荷兰海军的四艘荷兰级海上巡逻舰和四艘西班牙海军BAM近海巡逻舰。F-105护卫舰于2009年初铺设龙骨。该舰预计2010年建造完成，2012年夏交付。'}

2.设定抽取目标 && 定制化模型权重路径

from pprint import pprint
from paddlenlp import Taskflow
schema = "飞行器, 单兵武器, 炸弹, 装甲车辆, 火炮, 导弹, 舰船舰艇, 太空装备, 其他武器装备".split(", ")
# 设定抽取目标和定制化模型权重路径
# PaddleNLP/model_zoo/uie/checkpoint0.67789894
my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint0.67789894/model_best')
# my_ie = Taskflow("information_extraction", schema=schema,  task_path='./checkpoint/model_best')

[2023-06-03 20:04:27,292] [    INFO] - loading configuration file ./checkpoint0.67789894/model_best/config.json
[2023-06-03 20:04:27,297] [    INFO] - Model config ErnieConfig {
  "architectures": [
    "UIE"
  ],
  "attention_probs_dropout_prob": 0.1,
  "dtype": "float32",
  "enable_recompute": false,
  "fuse": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 2048,
  "model_type": "ernie",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "paddlenlp_version": null,
  "pool_act": "tanh",
  "task_id": 0,
  "task_type_vocab_size": 3,
  "type_vocab_size": 4,
  "use_task_id": true,
  "vocab_size": 40000
}
W0603 20:04:27.792821  9916 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0603 20:04:27.796425  9916 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-06-03 20:04:28,581] [    INFO] - All model checkpoint weights were used when initializing UIE.
[2023-06-03 20:04:28,584] [    INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint0.67789894/model_best.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-06-03 20:04:28,590] [    INFO] - Converting to the inference model cost a little time.
[2023-06-03 20:04:42,631] [    INFO] - The inference model save in the path:./checkpoint0.67789894/model_best/static/inference
[2023-06-03 20:04:44,793] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint0.67789894/model_best'.

3.预测

list_text=[]
for item in test_data:
    list_text.append(item['text'])

%%time
results=my_ie(list_text)

CPU times: user 16min 20s, sys: 2min 27s, total: 18min 47s
Wall time: 18min 45s

print(len(results))

print(results[0])

{'舰船舰艇': [{'text': 'F-105护卫舰', 'start': 95, 'end': 103, 'probability': 0.9999301440846722}, {'text': 'F-100级护卫舰', 'start': 8, 'end': 17, 'probability': 0.999897838723939}, {'text': 'BAM近海巡逻舰', 'start': 86, 'end': 94, 'probability': 0.9974057948851112}, {'text': '荷兰级海上巡逻舰', 'start': 70, 'end': 78, 'probability': 0.9998418145644905}, {'text': '圣保罗航母', 'start': 57, 'end': 62, 'probability': 0.9989470402669269}]}

with open('/home/aistudio/result_list.json','w', encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False)

results_list=[]
for i in range(len(test_data)):
    for key, item in results[i].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results_list.append(temp_result)
with open('/home/aistudio/result.json','w', encoding="utf-8") as f:
    json.dump(results_list, f,indent=4, ensure_ascii=False)

results=[]
for i in range(len(test_data)):
    uie_result=my_ie(test_data[i]['text'])
    # pprint(uie_result)
    for key, item in uie_result[0].items(): 
        for ii in range(len(item)):
            temp_result=dict()
            temp_result['sample_id']=test_data[i]['sample_id']
            temp_result['text']=item[ii]['text']
            temp_result['type']=key
            temp_result['start']=item[ii]['start']
            temp_result['end']=item[ii]['end']
            results.append(temp_result)

with open('result.json','w', encoding="utf-8") as f:
    json.dump(results, f,indent=4, ensure_ascii=False)

面向低资源和增量类型的命名实体识别挑战赛PaddleNLP解决方案

一、面向低资源和增量类型的命名实体识别挑战赛简介

1.数据简介

2.数据说明

二、数据处理

1.数据查看

2.数据集格式转换&& 数据集划分

三、训练训练

1.环境设置

2.模型微调

四、模型评估

1.评估模型

2.debug模式评估模型

五、预测

1.读取test数据集

2.设定抽取目标 && 定制化模型权重路径

3.预测

六、提交

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

面向低资源和增量类型的命名实体识别挑战赛PaddleNLP解决方案

一、面向低资源和增量类型的命名实体识别挑战赛简介

1.数据简介

2.数据说明

二、数据处理

1.数据查看

2.数据集格式转换&& 数据集划分

三、训练训练

1.环境设置

2.模型微调

四、模型评估

1.评估模型

2.debug模式评估模型

五、预测

1.读取test数据集

2.设定抽取目标 && 定制化模型权重路径

3.预测

六、提交

热门文章

最新文章

相关电子书