一、学术论文分类挑战赛
赛题链接:challenge.xfyun.cn/h5/invite?i…
1.赛事背景
随着人工智能技术不断发展,每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题,这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。
2.赛事任务
本次赛题希望参赛选手利用论文信息:论文id、标题、摘要,划分论文具体类别。
赛题样例(使用\t分隔):
paperid:9821
title:Calculation of prompt diphoton production cross sections at Tevatron and LHC energies
abstract:A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.
categories:hep-ph
3.预测结果文件详细说明:
- 以csv格式提交,编码为UTF-8,第一行为表头;
- 提交前请确保预测结果的格式与sample_submit.csv中的格式一致。具体格式如下:
paperid,categories
test_00000,cs.CV
test_00001,cs.DC
test_00002,cs.AI
test_00003,cs.NI
test_00004,cs.SE
二、数据处理
1 升级paddlenlp
Found existing installation: paddlenlp 2.0.1 Uninstalling paddlenlp-2.0.1: Successfully uninstalled paddlenlp-2.0.1 Successfully installed paddlenlp-2.0.5
!pip install -U paddlenlp
import pandas as pd from paddlenlp.datasets import load_dataset import paddlenlp as ppnlp from functools import partial from paddlenlp.data import Stack, Tuple, Pad from utils import convert_example, create_dataloader import os import numpy as np import paddle import paddle.nn.functional as F
2.解压缩
# 解压缩 # !unzip -oq /home/aistudio/data/data100202/Datawhale_学术论文分类_数据集.zip -d dataset # !rm dataset/__MACOSX/ -rf # !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/test.csv.zip -d dataset/ # !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/train.csv.zip -d dataset/
3.数据查看
# 提交格式 !head dataset/Datawhale_学术论文分类_数据集/sample_submit.csv
paperid,categories test_00000,cs.CV test_00001,cs.CV test_00002,cs.CV test_00003,cs.CV test_00004,cs.CV test_00005,cs.CV test_00006,cs.CV test_00007,cs.CV test_00008,cs.CV
# train数据格式 !head -n20 dataset/train.csv
paperid title abstract categories train_00000 "Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in Neural Machine Translation" " Neural machine translation (NMT) typically adopts the encoder-decoder framework. A good understanding of the characteristics and functionalities of the encoder and decoder can help to explain the pros and cons of the framework, and design better models for NMT. In this work, we conduct an empirical study on the encoder and the decoder in NMT, taking Transformer as an example. We find that 1) the decoder handles an easier task than the encoder in NMT, 2) the decoder is more sensitive to the input noise than the encoder, and 3) the preceding words/tokens in the decoder provide strong conditional information, which accounts for the two observations above. We hope those observations can shed light on the characteristics of the encoder and decoder and inspire future research on NMT. " cs.CL train_00001 An Easy-to-use Real-world Multi-objective Optimization Problem Suite " Although synthetic test problems are widely used for the performance assessment of evolutionary multi-objective optimization algorithms, they are likely to include unrealistic properties which may lead to overestimation/underestimation. To address this issue, we present a multi-objective optimization problem suite consisting of 16 bound-constrained real-world problems. The problem suite includes various problems in terms of
# test数据格式 !head dataset/test.csv
paperid title abstract test_00000 "Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in APIs" " This paper addresses the following question: does a small, essential, core set of API members emerges from the actual usage of the API by client applications? To investigate this question, we study the 99 most popular libraries available in Maven Central and the 865,560 client programs that declare dependencies towards them, summing up to 2.3M dependencies. Our key findings are as follows: 43.5% of the dependencies declared by the clients are not used in the bytecode; all APIs contain a large part of rarely used types and a few frequently used types, and the ratio varies according to the nature
4.自定义read方法
import pandas as pd train = pd.read_csv('dataset/train.csv', sep='\t') test = pd.read_csv('dataset/test.csv', sep='\t') sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv') # 拼接title与abstract train['text'] = train['title']
label_id2cate = dict(enumerate(train.categories.unique())) label_cate2id = {value: key for key, value in label_id2cate.items()} train['label'] = train['categories'].map(label_cate2id) train = train[['text', 'label', 'paperid']] train_y = train["label"] train_df = train[['text', 'label', 'paperid']][:45000] eval_df = train[['text', 'label', 'paperid']][45000:]
print(label_id2cate)
{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}
print(label_cate2id)
{'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38}
train_df.describe
<bound method NDFrame.describe of text label paperid 0 Hard but Robust, Easy but Sensitive: How Encod... 0 train_00000 1 An Easy-to-use Real-world Multi-objective Opti... 1 train_00001 2 Exploration of reproducibility issues in scien... 2 train_00002 3 Scheduled Sampling for Transformers 0 train_00003 4 Hybrid Forests for Left Ventricle Segmentation... 3 train_00004 ... ... ... ... 44995 Categorizing Comparative Sentences 0 train_44995 44996 Fractional differentiation based image processing 3 train_44996 44997 A Misanthropic Reinterpretation of the Chinese... 12 train_44997 44998 Towards Purely Unsupervised Disentanglement of... 3 train_44998 44999 A Color Quantization Optimization Approach for... 3 train_44999 [45000 rows x 3 columns]>
from paddlenlp.datasets import load_dataset # read train data def read(pd_data): for index, item in pd_data.iterrows(): yield {'text': item['text'], 'label': item['label'], 'qid': item['paperid'].strip('train_')}
5.数据载入
# data_path为read()方法的参数 train_ds = load_dataset(read, pd_data=train_df,lazy=False) dev_ds = load_dataset(read, pd_data=eval_df,lazy=False)
for i in range(5): print(train_ds[i])
{'text': 'Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in\n Neural Machine Translation', 'label': 0, 'qid': '00000'} {'text': 'An Easy-to-use Real-world Multi-objective Optimization Problem Suite', 'label': 1, 'qid': '00001'} {'text': 'Exploration of reproducibility issues in scientometric research Part 1:\n Direct reproducibility', 'label': 2, 'qid': '00002'} {'text': 'Scheduled Sampling for Transformers', 'label': 0, 'qid': '00003'} {'text': 'Hybrid Forests for Left Ventricle Segmentation using only the first\n slice label', 'label': 3, 'qid': '00004'}
三、使用预训练模型
1.选取预训练模型
import paddlenlp as ppnlp from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # 指定模型名称,一键加载模型 model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=len(train.label.unique())) # 同样地,通过指定模型名称一键加载对应的Tokenizer,用于处理文本数据,如切分token,转token_id等。 tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-25 01:19:30,419] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams [2021-07-25 01:19:40,495] [ INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt
2.数据读取
使用paddle.io.DataLoader接口多线程异步加载数据。
import os from functools import partial import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from utils import create_dataloader def convert_example(example, tokenizer, max_seq_length=512, is_test=False): # 将原数据处理成model可读入的格式,enocded_inputs是一个dict,包含input_ids、token_type_ids等字段 encoded_inputs = tokenizer( text=example["text"], max_seq_len=max_seq_length) # input_ids:对文本切分token后,在词汇表中对应的token id input_ids = encoded_inputs["input_ids"] # token_type_ids:当前token属于句子1还是句子2,即上述图中表达的segment ids token_type_ids = encoded_inputs["token_type_ids"] if not is_test: # label:情感极性类别 label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: # qid:每条数据的编号 qid = np.array([example["qid"]], dtype="int64") return input_ids, token_type_ids, qid
# 批量数据大小 batch_size = 128 # 文本序列最大长度 max_seq_length = 50 # 将数据处理成模型可读入的数据格式 trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) # 将数据组成批量式数据,如 # 将不同长度的文本序列padding到批量式数据中最大长度 # 将每条数据label堆叠在一起 batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack() # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) dev_data_loader = create_dataloader( dev_ds, mode='dev', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
3.设置Fine-Tune优化策略,接入评价指标
四、模型训练与评估
模型训练的过程通常有以下步骤:
- 从dataloader中取出一个batch data
- 将batch data喂给model,做前向计算
- 将前向计算结果传给损失函数,计算loss。将前向计算结果传给评价方法,计算评价指标。
- loss反向回传,更新梯度。重复以上步骤。
每训练一个epoch时,程序将会评估一次,评估当前模型训练的效果。
1.参数配置
from paddlenlp.transformers import LinearDecayWithWarmup import paddle # 训练轮次 epochs = 10 # len(train_data_loader)一轮训练所需要的step数 num_training_steps = len(train_data_loader) * epochs # Adam优化器 optimizer = paddle.optimizer.AdamW( learning_rate=2e-5, parameters=model.parameters()) # 交叉熵损失函数 criterion = paddle.nn.loss.CrossEntropyLoss() # accuracy评价指标 metric = paddle.metric.Accuracy()
2.加入visualdl
# 加入日志显示 from visualdl import LogWriter writer = LogWriter("./log")
3.evaluate方法
@paddle.no_grad() def evaluate(model, criterion, metric, data_loader): model.eval() metric.reset() losses = [] for batch in data_loader: input_ids, token_type_ids, labels = batch logits = model(input_ids, token_type_ids) loss = criterion(logits, labels) losses.append(loss.numpy()) correct = metric.compute(logits, labels) metric.update(correct) accu = metric.accumulate() print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) # 加入eval日志显示 writer.add_scalar(tag="eval/loss", step=global_step, value=np.mean(losses)) writer.add_scalar(tag="eval/acc", step=global_step, value=accu) model.train() metric.reset() return accu
4.开始训练
save_dir = "checkpoint" if not os.path.exists(save_dir): os.makedirs(save_dir) global_step = 0 pre_accu=0 accu=0 for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, segment_ids, labels = batch logits = model(input_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0 : print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc)) loss.backward() optimizer.step() optimizer.clear_grad() # 每间隔 100 step 在验证集和测试集上进行评估 if global_step % 100 == 0: accu=evaluate(model, criterion, metric, dev_data_loader) # 加入train日志显示 writer.add_scalar(tag="train/loss", step=global_step, value=loss) writer.add_scalar(tag="train/acc", step=global_step, value=acc) if accu>pre_accu: # 加入保存 save_param_path = os.path.join(save_dir, 'model_state.pdparams') paddle.save(model.state_dict(), save_param_path) pre_accu=accu tokenizer.save_pretrained(save_dir)
5.训练日志
eval loss: 1.22675, accu: 0.77660 global step 6810, epoch: 10, batch: 474, loss: 0.01697, acc: 0.99219 global step 6820, epoch: 10, batch: 484, loss: 0.04531, acc: 0.98984 global step 6830, epoch: 10, batch: 494, loss: 0.03325, acc: 0.98854 global step 6840, epoch: 10, batch: 504, loss: 0.04574, acc: 0.98672 global step 6850, epoch: 10, batch: 514, loss: 0.02137, acc: 0.98625 global step 6860, epoch: 10, batch: 524, loss: 0.19356, acc: 0.98516 global step 6870, epoch: 10, batch: 534, loss: 0.03456, acc: 0.98482 global step 6880, epoch: 10, batch: 544, loss: 0.09647, acc: 0.98438 global step 6890, epoch: 10, batch: 554, loss: 0.11611, acc: 0.98351 global step 6900, epoch: 10, batch: 564, loss: 0.05723, acc: 0.98344 global step 6910, epoch: 10, batch: 574, loss: 0.00518, acc: 0.98310 global step 6920, epoch: 10, batch: 584, loss: 0.01201, acc: 0.98281 global step 6930, epoch: 10, batch: 594, loss: 0.07870, acc: 0.98221 global step 6940, epoch: 10, batch: 604, loss: 0.01748, acc: 0.98237 global step 6950, epoch: 10, batch: 614, loss: 0.01542, acc: 0.98208 global step 6960, epoch: 10, batch: 624, loss: 0.01469, acc: 0.98184 global step 6970, epoch: 10, batch: 634, loss: 0.07767, acc: 0.98189 global step 6980, epoch: 10, batch: 644, loss: 0.01516, acc: 0.98186 global step 6990, epoch: 10, batch: 654, loss: 0.02567, acc: 0.98125 global step 7000, epoch: 10, batch: 664, loss: 0.09072, acc: 0.98102 global step 7010, epoch: 10, batch: 674, loss: 0.07557, acc: 0.98080 global step 7020, epoch: 10, batch: 684, loss: 0.13695, acc: 0.98047 global step 7030, epoch: 10, batch: 694, loss: 0.09411, acc: 0.98016 global step 7040, epoch: 10, batch: 704, loss: 0.10656, acc: 0.98007
tokenizer.save_pretrained(save_dir)
五、模型预测
1.test数据处理
import pandas as pd from paddlenlp.datasets import load_dataset import paddlenlp as ppnlp from functools import partial from paddlenlp.data import Stack, Tuple, Pad from utils import convert_example, create_dataloader import os import numpy as np import paddle import paddle.nn.functional as F
test = pd.read_csv('dataset/test.csv', sep='\t') sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv') train = pd.read_csv('dataset/train.csv', sep='\t') label_id2cate = dict(enumerate(train.categories.unique())) label_cate2id = {value: key for key, value in label_id2cate.items()}
# 拼接title与abstract test['text'] = test['title']
print(label_id2cate) print(label_cate2id) print(len(label_cate2id))
{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'} {'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38} 39
# read test data def read_test(pd_data): for index, item in pd_data.iterrows(): yield {'text': item['text'], 'label': 0, 'qid': item['paperid'].strip('test_')}
test_ds = load_dataset(read_test, pd_data=test,lazy=False)
for i in range(5): print(test_ds[i])
{'text': 'Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in\n APIs', 'label': 0, 'qid': '00000'} {'text': 'Finding Higher Order Mutants Using Variational Execution', 'label': 0, 'qid': '00001'} {'text': 'Automatic Detection of Search Tactic in Individual Information Seeking:\n A Hidden Markov Model Approach', 'label': 0, 'qid': '00002'} {'text': 'Polygon Simplification by Minimizing Convex Corners', 'label': 0, 'qid': '00003'} {'text': 'Differentially passive circuits that switch and oscillate', 'label': 0, 'qid': '00004'}
print(len(test_ds))
10000
import paddlenlp as ppnlp from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # 指定模型名称,一键加载模型 model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=39) # 同样地,通过指定模型名称一键加载对应的Tokenizer,用于处理文本数据,如切分token,转token_id等。 tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")
[2021-07-25 01:21:10,126] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams [2021-07-25 01:21:14,575] [ INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt
max_seq_length = 40 trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length, is_test=True)
batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment Stack(dtype="int64") # label ): [data for data in fn(samples)]
test_data_loader = create_dataloader( test_ds, mode='test', batch_size=80, batchify_fn=batchify_fn, trans_fn=trans_func)
2.载入模型
# 根据实际运行情况,更换加载的参数路径 import os import paddle params_path = 'checkpoint/model_state.pdparams' if params_path and os.path.isfile(params_path): # 加载模型参数 state_dict = paddle.load(params_path) model.set_dict(state_dict) print("Loaded parameters from %s" % params_path)
Loaded parameters from checkpoint/model_state.pdparams
3.预测
import os from functools import partial import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from utils import create_dataloader results = [] # 切换model模型为评估模式,关闭dropout等随机因素 model.eval() for batch in test_data_loader: input_ids, token_type_ids, qids = batch # 喂数据给模型 logits = model(input_ids, token_type_ids) # 预测分类 probs = F.softmax(logits, axis=-1) idx = paddle.argmax(probs, axis=1).numpy() idx = idx.tolist() labels = [label_id2cate[i] for i in idx] qids = qids.numpy().tolist() results.extend( labels)
print(results[:5]) print(len(results))
4.保存提交
sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv') sub['categories'] = results sub.to_csv('submit.csv', index=False)
!zip -qr result.zip submit.csv
六、提交结果
提交结果第14名