一、学术论文分类挑战赛

赛题链接：challenge.xfyun.cn/h5/invite?i…

1.赛事背景

随着人工智能技术不断发展，每周都有非常多的论文公开发布。现如今对论文进行分类逐渐成为非常现实的问题，这也是研究人员和研究机构每天都面临的问题。现在希望选手能构建一个论文分类模型。

2.赛事任务

本次赛题希望参赛选手利用论文信息：论文id、标题、摘要，划分论文具体类别。

赛题样例（使用\t分隔）：

paperid：9821

title：Calculation of prompt diphoton production cross sections at Tevatron and LHC energies

abstract：A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy.

categories：hep-ph

3.预测结果文件详细说明：

以csv格式提交，编码为UTF-8，第一行为表头；

提交前请确保预测结果的格式与sample_submit.csv中的格式一致。具体格式如下：

paperid,categories

test_00000,cs.CV

test_00001,cs.DC

test_00002,cs.AI

test_00003,cs.NI

test_00004,cs.SE

二、数据处理

1 升级paddlenlp

Found existing installation: paddlenlp 2.0.1
    Uninstalling paddlenlp-2.0.1:
      Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.5

!pip install -U paddlenlp

import pandas as pd
from paddlenlp.datasets import load_dataset
import paddlenlp as ppnlp
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import  convert_example, create_dataloader
import os
import numpy as np
import paddle
import paddle.nn.functional as F

2.解压缩

# 解压缩
# !unzip -oq /home/aistudio/data/data100202/Datawhale_学术论文分类_数据集.zip -d dataset
# !rm dataset/__MACOSX/ -rf
# !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/test.csv.zip -d dataset/
# !unzip -oq /home/aistudio/dataset/Datawhale_学术论文分类_数据集/train.csv.zip -d dataset/

3.数据查看

# 提交格式
!head dataset/Datawhale_学术论文分类_数据集/sample_submit.csv

paperid,categories
test_00000,cs.CV
test_00001,cs.CV
test_00002,cs.CV
test_00003,cs.CV
test_00004,cs.CV
test_00005,cs.CV
test_00006,cs.CV
test_00007,cs.CV
test_00008,cs.CV

# train数据格式
!head -n20 dataset/train.csv

paperid title abstract  categories
train_00000 "Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in
  Neural Machine Translation" "  Neural machine translation (NMT) typically adopts the encoder-decoder
framework. A good understanding of the characteristics and functionalities of
the encoder and decoder can help to explain the pros and cons of the framework,
and design better models for NMT. In this work, we conduct an empirical study
on the encoder and the decoder in NMT, taking Transformer as an example. We
find that 1) the decoder handles an easier task than the encoder in NMT, 2) the
decoder is more sensitive to the input noise than the encoder, and 3) the
preceding words/tokens in the decoder provide strong conditional information,
which accounts for the two observations above. We hope those observations can
shed light on the characteristics of the encoder and decoder and inspire future
research on NMT.
" cs.CL
train_00001 An Easy-to-use Real-world Multi-objective Optimization Problem Suite  "  Although synthetic test problems are widely used for the performance
assessment of evolutionary multi-objective optimization algorithms, they are
likely to include unrealistic properties which may lead to
overestimation/underestimation. To address this issue, we present a
multi-objective optimization problem suite consisting of 16 bound-constrained
real-world problems. The problem suite includes various problems in terms of

# test数据格式
!head dataset/test.csv

paperid title abstract
test_00000  "Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in
  APIs" "  This paper addresses the following question: does a small, essential, core
set of API members emerges from the actual usage of the API by client
applications? To investigate this question, we study the 99 most popular
libraries available in Maven Central and the 865,560 client programs that
declare dependencies towards them, summing up to 2.3M dependencies. Our key
findings are as follows: 43.5% of the dependencies declared by the clients are
not used in the bytecode; all APIs contain a large part of rarely used types
and a few frequently used types, and the ratio varies according to the nature

4.自定义read方法

import pandas as pd
train = pd.read_csv('dataset/train.csv', sep='\t')
test = pd.read_csv('dataset/test.csv', sep='\t')
sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
# 拼接title与abstract
train['text'] = train['title']

label_id2cate = dict(enumerate(train.categories.unique()))
label_cate2id = {value: key for key, value in label_id2cate.items()}
train['label'] = train['categories'].map(label_cate2id)
train = train[['text', 'label', 'paperid']]
train_y = train["label"]
train_df = train[['text', 'label', 'paperid']][:45000]
eval_df = train[['text', 'label', 'paperid']][45000:]

print(label_id2cate)

{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}

print(label_cate2id)

{'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38}

train_df.describe

<bound method NDFrame.describe of                                                     text  label      paperid
0      Hard but Robust, Easy but Sensitive: How Encod...      0  train_00000
1      An Easy-to-use Real-world Multi-objective Opti...      1  train_00001
2      Exploration of reproducibility issues in scien...      2  train_00002
3                    Scheduled Sampling for Transformers      0  train_00003
4      Hybrid Forests for Left Ventricle Segmentation...      3  train_00004
...                                                  ...    ...          ...
44995                 Categorizing Comparative Sentences      0  train_44995
44996  Fractional differentiation based image processing      3  train_44996
44997  A Misanthropic Reinterpretation of the Chinese...     12  train_44997
44998  Towards Purely Unsupervised Disentanglement of...      3  train_44998
44999  A Color Quantization Optimization Approach for...      3  train_44999
[45000 rows x 3 columns]>

from paddlenlp.datasets import load_dataset
# read train data
def read(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['text'], 'label': item['label'], 'qid': item['paperid'].strip('train_')}

5.数据载入

# data_path为read()方法的参数
train_ds = load_dataset(read, pd_data=train_df,lazy=False)
dev_ds = load_dataset(read, pd_data=eval_df,lazy=False)

for i in range(5):
    print(train_ds[i])

{'text': 'Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform in\n  Neural Machine Translation', 'label': 0, 'qid': '00000'}
{'text': 'An Easy-to-use Real-world Multi-objective Optimization Problem Suite', 'label': 1, 'qid': '00001'}
{'text': 'Exploration of reproducibility issues in scientometric research Part 1:\n  Direct reproducibility', 'label': 2, 'qid': '00002'}
{'text': 'Scheduled Sampling for Transformers', 'label': 0, 'qid': '00003'}
{'text': 'Hybrid Forests for Left Ventricle Segmentation using only the first\n  slice label', 'label': 3, 'qid': '00004'}

三、使用预训练模型

1.选取预训练模型

import paddlenlp as ppnlp
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# 指定模型名称，一键加载模型
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=len(train.label.unique()))
# 同样地，通过指定模型名称一键加载对应的Tokenizer，用于处理文本数据，如切分token，转token_id等。
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")

[2021-07-25 01:19:30,419] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams
[2021-07-25 01:19:40,495] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt

2.数据读取

使用paddle.io.DataLoader接口多线程异步加载数据。

import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader
def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)
    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]
    if not is_test:
        # label：情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid：每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid

# 批量数据大小
batch_size = 128
# 文本序列最大长度
max_seq_length = 50
# 将数据处理成模型可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
# 将数据组成批量式数据，如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

3.设置Fine-Tune优化策略，接入评价指标

四、模型训练与评估

模型训练的过程通常有以下步骤：

从dataloader中取出一个batch data
将batch data喂给model，做前向计算
将前向计算结果传给损失函数，计算loss。将前向计算结果传给评价方法，计算评价指标。
loss反向回传，更新梯度。重复以上步骤。

每训练一个epoch时，程序将会评估一次，评估当前模型训练的效果。

1.参数配置

from paddlenlp.transformers import LinearDecayWithWarmup
import paddle
# 训练轮次
epochs = 10
# len(train_data_loader)一轮训练所需要的step数
num_training_steps = len(train_data_loader) * epochs
# Adam优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# 交叉熵损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy评价指标
metric = paddle.metric.Accuracy()

2.加入visualdl

# 加入日志显示
from visualdl import LogWriter
writer = LogWriter("./log")

3.evaluate方法

@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:      
        input_ids, token_type_ids,  labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    # 加入eval日志显示
    writer.add_scalar(tag="eval/loss", step=global_step, value=np.mean(losses))
    writer.add_scalar(tag="eval/acc", step=global_step, value=accu)  
    model.train()
    metric.reset()
    return accu

4.开始训练

save_dir = "checkpoint"
if not  os.path.exists(save_dir):
    os.makedirs(save_dir)
global_step = 0
pre_accu=0
accu=0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()
        global_step += 1
        if global_step % 10 == 0 :
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        # 每间隔 100 step 在验证集和测试集上进行评估
        if global_step % 100 == 0:
            accu=evaluate(model, criterion, metric, dev_data_loader)
            # 加入train日志显示
            writer.add_scalar(tag="train/loss", step=global_step, value=loss)
            writer.add_scalar(tag="train/acc", step=global_step, value=acc)       
        if accu>pre_accu:
            # 加入保存
            save_param_path = os.path.join(save_dir, 'model_state.pdparams')
            paddle.save(model.state_dict(), save_param_path)
            pre_accu=accu
tokenizer.save_pretrained(save_dir)

5.训练日志

eval loss: 1.22675, accu: 0.77660
global step 6810, epoch: 10, batch: 474, loss: 0.01697, acc: 0.99219
global step 6820, epoch: 10, batch: 484, loss: 0.04531, acc: 0.98984
global step 6830, epoch: 10, batch: 494, loss: 0.03325, acc: 0.98854
global step 6840, epoch: 10, batch: 504, loss: 0.04574, acc: 0.98672
global step 6850, epoch: 10, batch: 514, loss: 0.02137, acc: 0.98625
global step 6860, epoch: 10, batch: 524, loss: 0.19356, acc: 0.98516
global step 6870, epoch: 10, batch: 534, loss: 0.03456, acc: 0.98482
global step 6880, epoch: 10, batch: 544, loss: 0.09647, acc: 0.98438
global step 6890, epoch: 10, batch: 554, loss: 0.11611, acc: 0.98351
global step 6900, epoch: 10, batch: 564, loss: 0.05723, acc: 0.98344
global step 6910, epoch: 10, batch: 574, loss: 0.00518, acc: 0.98310
global step 6920, epoch: 10, batch: 584, loss: 0.01201, acc: 0.98281
global step 6930, epoch: 10, batch: 594, loss: 0.07870, acc: 0.98221
global step 6940, epoch: 10, batch: 604, loss: 0.01748, acc: 0.98237
global step 6950, epoch: 10, batch: 614, loss: 0.01542, acc: 0.98208
global step 6960, epoch: 10, batch: 624, loss: 0.01469, acc: 0.98184
global step 6970, epoch: 10, batch: 634, loss: 0.07767, acc: 0.98189
global step 6980, epoch: 10, batch: 644, loss: 0.01516, acc: 0.98186
global step 6990, epoch: 10, batch: 654, loss: 0.02567, acc: 0.98125
global step 7000, epoch: 10, batch: 664, loss: 0.09072, acc: 0.98102
global step 7010, epoch: 10, batch: 674, loss: 0.07557, acc: 0.98080
global step 7020, epoch: 10, batch: 684, loss: 0.13695, acc: 0.98047
global step 7030, epoch: 10, batch: 694, loss: 0.09411, acc: 0.98016
global step 7040, epoch: 10, batch: 704, loss: 0.10656, acc: 0.98007

tokenizer.save_pretrained(save_dir)

五、模型预测

1.test数据处理

import pandas as pd
from paddlenlp.datasets import load_dataset
import paddlenlp as ppnlp
from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
from utils import  convert_example, create_dataloader
import os
import numpy as np
import paddle
import paddle.nn.functional as F

test = pd.read_csv('dataset/test.csv', sep='\t')
sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
train = pd.read_csv('dataset/train.csv', sep='\t')
label_id2cate = dict(enumerate(train.categories.unique()))
label_cate2id = {value: key for key, value in label_id2cate.items()}

# 拼接title与abstract
test['text'] = test['title']

print(label_id2cate)
print(label_cate2id)
print(len(label_cate2id))

{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}
{'cs.CL': 0, 'cs.NE': 1, 'cs.DL': 2, 'cs.CV': 3, 'cs.LG': 4, 'cs.DS': 5, 'cs.IR': 6, 'cs.RO': 7, 'cs.DM': 8, 'cs.CR': 9, 'cs.AR': 10, 'cs.NI': 11, 'cs.AI': 12, 'cs.SE': 13, 'cs.CG': 14, 'cs.LO': 15, 'cs.SY': 16, 'cs.GR': 17, 'cs.PL': 18, 'cs.SI': 19, 'cs.OH': 20, 'cs.HC': 21, 'cs.MA': 22, 'cs.GT': 23, 'cs.ET': 24, 'cs.FL': 25, 'cs.CC': 26, 'cs.DB': 27, 'cs.DC': 28, 'cs.CY': 29, 'cs.CE': 30, 'cs.MM': 31, 'cs.NA': 32, 'cs.PF': 33, 'cs.OS': 34, 'cs.SD': 35, 'cs.SC': 36, 'cs.MS': 37, 'cs.GL': 38}
39

# read test data
def read_test(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['text'], 'label': 0, 'qid': item['paperid'].strip('test_')}

test_ds =  load_dataset(read_test, pd_data=test,lazy=False)

for i in range(5):
    print(test_ds[i])

{'text': 'Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in\n  APIs', 'label': 0, 'qid': '00000'}
{'text': 'Finding Higher Order Mutants Using Variational Execution', 'label': 0, 'qid': '00001'}
{'text': 'Automatic Detection of Search Tactic in Individual Information Seeking:\n  A Hidden Markov Model Approach', 'label': 0, 'qid': '00002'}
{'text': 'Polygon Simplification by Minimizing Convex Corners', 'label': 0, 'qid': '00003'}
{'text': 'Differentially passive circuits that switch and oscillate', 'label': 0, 'qid': '00004'}

print(len(test_ds))

import paddlenlp as ppnlp
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# 指定模型名称，一键加载模型
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=39)
# 同样地，通过指定模型名称一键加载对应的Tokenizer，用于处理文本数据，如切分token，转token_id等。
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")

[2021-07-25 01:21:10,126] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.pdparams
[2021-07-25 01:21:14,575] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_2.0_large_en/skep_ernie_2.0_large_en.vocab.txt

max_seq_length = 40
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)

batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]

test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=80,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

2.载入模型

# 根据实际运行情况，更换加载的参数路径
import os
import paddle
params_path = 'checkpoint/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

Loaded parameters from checkpoint/model_state.pdparams

3.预测

import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from utils import create_dataloader
results = []
# 切换model模型为评估模式，关闭dropout等随机因素
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    # 喂数据给模型
    logits = model(input_ids, token_type_ids)
    # 预测分类
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_id2cate[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend( labels)

print(results[:5])
print(len(results))

4.保存提交

sub = pd.read_csv('dataset/Datawhale_学术论文分类_数据集/sample_submit.csv')
sub['categories'] = results
sub.to_csv('submit.csv', index=False)

!zip -qr result.zip submit.csv

六、提交结果

提交结果第14名

基于PaddleNLP的学术论文分类挑战赛baseline