NLPer福利-中文语言理解基准测【CLUEbenchmark】(下)

简介: NLPer福利-中文语言理解基准测【CLUEbenchmark】(下)

各任务详细对比


Evaluation of Dataset for Different Models


AFQMC 蚂蚁语义相似度 Ant Semantic Similarity (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 69.13% 69.92% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-base 74.16% 73.70% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 73.74% 74.07% batch_size=16, length=128, epoch=3 lr=2e-5
ERNIE-base 74.88% 73.83% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-large 73.32% 74.02% batch_size=16, length=128, epoch=3 lr=2e-5
XLNet-mid 70.73% 70.50% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 74.30% 74.04% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 74.92% 76.55% batch_size=16, length=128, epoch=3 lr=2e-5


TNEWS' 头条新闻分类 Toutiao News Classification (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 53.55% 53.35% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-base 56.09% 56.58% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 56.77% 56.86% batch_size=16, length=128, epoch=3 lr=2e-5
ERNIE-base 58.24% 58.33% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-large 57.95% 57.84% batch_size=16, length=128, epoch=3 lr=2e-5
XLNet-mid 56.09% 56.24% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 57.51% 56.94% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 58.32% 58.61% batch_size=16, length=128, epoch=3 lr=2e-5


IFLYTEK' 长文本分类 Long Text Classification (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xlarge - - batch=32, length=128, epoch=3 lr=2e-5
ALBERT-tiny 48.76 48.71 batch=32, length=128, epoch=10 lr=2e-5
BERT-base 60.37 60.29 batch=32, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 59.88 59.43 batch=32, length=128, epoch=3 lr=2e-5
ERNIE-base 59.52 58.96 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-large 62.6 62.55 batch=24, length=128, epoch=3 lr=2e-5
XLNet-mid 57.72 57.85 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 60.8 60.31 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 62.75 62.98 batch=24, length=128, epoch=3 lr=2e-5


CMNLI 中文自然语言推理 Chinese Multi-Genre NLI (Accuracy):


模型 开发集 (dev %) 测试集(test %) 训练参数
BERT-base 79.47 79.69 batch=64, length=128, epoch=2 lr=3e-5
BERT-wwm-ext-base 80.92 80.42 batch=64, length=128, epoch=2 lr=3e-5
ERNIE-base 80.37 80.29 batch=64, length=128, epoch=2 lr=3e-5
ALBERT-xxlarge - - -
ALBERT-tiny 70.26 70.61 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-large 82.40 81.70 batch=64, length=128, epoch=2 lr=3e-5
xlnet-mid 82.21 81.25 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-wwm-ext 80.70 80.51 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-wwm-large-ext 83.20 82.12 batch=64, length=128, epoch=2 lr=3e-5


注:ALBERT-xlarge,在XNLI任务上训练暂时还存在有问题


WSC Winograd模式挑战中文版 The Winograd Schema Challenge,Chinese Version:


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 57.7(52.9) 58.5(52.1) lr=1e-4, batch_size=8, length=128, epoch=50
BERT-base 59.6(56.7) 62.0(57.9) lr=2e-5, batch_size=8, length=128, epoch=50
BERT-wwm-ext-base 59.4(56.7) 61.1(56.2) lr=2e-5, batch_size=8, length=128, epoch=50
ERNIE-base 58.1(54.9) 60.8(55.9) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-large 68.6(58.7) 72.7(63.6) lr=2e-5, batch_size=8, length=128, epoch=50
XLNet-mid 60.9(56.8) 64.4(57.3) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-wwm-ext 67.2(57.7) 67.8(63.5) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-wwm-large-ext 69.7(64.5) 74.6(69.4) lr=2e-5, batch_size=8, length=128, epoch=50


CSL 关键词识别 Keyword Recognition (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xlarge 80.23 80.29 batch_size=16, length=128, epoch=2, lr=5e-6
ALBERT-tiny 74.36 74.56 batch_size=4, length=256, epoch=5, lr=1e-5
BERT-base 79.63 80.23 batch_size=4, length=256, epoch=5, lr=1e-5
BERT-wwm-ext-base 80.60 81.00 batch_size=4, length=256, epoch=5, lr=1e-5
ERNIE-base 79.43 79.10 batch_size=4, length=256, epoch=5, lr=1e-5
RoBERTa-large 81.87 81.36 batch_size=4, length=256, epoch=5, lr=5e-6
XLNet-mid 82.06 81.26 batch_size=4, length=256, epoch=3, lr=1e-5
RoBERTa-wwm-ext 80.67 80.63 batch_size=4, length=256, epoch=5, lr=1e-5
RoBERTa-wwm-large-ext 82.17 82.13 batch_size=4, length=256, epoch=5, lr=1e-5


DRCD 繁体阅读理解 Reading Comprehension for Traditional Chinese (F1, EM):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base F1:92.30 EM:86.60 F1:91.46 EM:85.49 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
BERT-wwm-ext-base F1:93.27 EM:88.00 F1:92.63 EM:87.15 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ERNIE-base F1:92.78 EM:86.85 F1:92.01 EM:86.03 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ALBERT-large F1:93.90 EM:88.88 F1:93.06 EM:87.52 batch=32, length=512, epoch=3, lr=2e-5, warmup=0.05
ALBERT-xlarge F1:94.63 EM:89.68 F1:94.70 EM:89.78 batch_size=32, length=512, epoch=3, lr=2.5e-5, warmup=0.06
ALBERT-xxlarge F1:93.69 EM:89.97 F1:94.62 EM:89.67 batch_size=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ALBERT-tiny F1:81.51 EM:71.61 F1:80.67 EM:70.08 batch=32, length=512, epoch=3, lr=2e-4, warmup=0.1
RoBERTa-large F1:94.93 EM:90.11 F1:94.25 EM:89.35 batch=32, length=256, epoch=2, lr=3e-5, warmup=0.1
xlnet-mid F1:92.08 EM:84.40 F1:91.44 EM:83.28 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
RoBERTa-wwm-ext F1:94.26 EM:89.29 F1:93.53 EM:88.12 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
RoBERTa-wwm-large-ext F1:95.32 EM:90.54 F1:95.06 EM:90.70 batch=32, length=512, epoch=2, lr=2.5e-5, warmup=0.1


CMRC2018 阅读理解 Reading Comprehension for Simplified Chinese (F1, EM):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base F1:85.48 EM:64.77 F1:88.10 EM:71.60 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
BERT-wwm-ext-base F1:86.68 EM:66.96 F1:89.62 EM:73.95 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ERNIE-base F1:87.30 EM:66.89 F1:90.57 EM:74.70 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ALBERT-base F1:85.86 EM:64.76 F1:89.66 EM:72.90 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-large F1:87.36 EM:67.31 F1:90.81 EM:75.95 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-xlarge F1:88.99 EM:69.08 F1:92.09 EM:76.30 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-xxlarge F1:87.47 EM:66.43 F1:90.77 EM:75.15 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-tiny F1:73.95 EM:48.31 F1:76.21 EM:53.35 batch=32, epoch3, length=512, lr=2e-4, warmup=0.1
RoBERTa-large F1:88.61 EM:69.94 F1:92.04 EM:78.50 batch=32, epoch2, length=256, lr=3e-5, warmup=0.1
xlnet-mid F1:85.63 EM:65.31 F1:86.11 EM:66.95 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-ext F1:87.28 EM:67.89 F1:90.41 EM:75.20 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-large-ext F1:89.42 EM:70.59 F1:92.11 EM:77.95 batch=32, epoch2, length=512, lr=2.5e-5, warmup=0.1


注: 现在榜上数据为cmrc2018的2k测试集子集作为测试,而并非cmrc2018官方完整测试集。如需完整测试cmrc2018阅读理解数据集仍需通过cmrc2018平台提交

(https://worksheets.codalab.org/worksheets/0x96f61ee5e9914aee8b54bd11e66ec647)。%E3%80%82)


CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base 82.20 82.04 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
BERT-wwm-ext-base 83.36 82.9 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ERNIE-base 82.46 82.28 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-base 70.99 71.77 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-large 75.10 74.18 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-xlarge 81.20 80.57 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-xxlarge 83.61 83.15 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-tiny 43.47 43.53 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-large 85.31 84.50 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
xlnet-mid 83.76 83.47 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-wwm-ext 83.78 83.62 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-wwm-large-ext 85.81 85.37 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06


C3 成语阅读理解填空 中文多选阅读理解 Multiple-Choice Chinese Machine Reading Comprehension (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base 65.70 64.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
BERT-wwm-ext-base 67.80 68.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ERNIE-base 65.50 64.10 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-base 60.43 59.58 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-large 64.07 64.41 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-xlarge 69.75 70.32 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-xxlarge 73.66 73.28 batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-tiny 50.58 50.26 batch=32, length=512, epoch=8, lr=5e-5, warmup=0.1
RoBERTa-large 67.79 67.55 batch=24, length=256, epoch=8, lr=2e-5, warmup=0.1
xlnet-mid 66.17 67.68 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
RoBERTa-wwm-ext 67.06 66.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
RoBERTa-wwm-large-ext 74.48 73.82 batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1
相关文章
|
存储 人工智能 算法
编写 GPT 提示词的公式 + 资源分享
GPT 能够给我们带来很大的帮助,因此我们要好好利用它。我们希望 GPT 输出令我们满意的内容,影响 GPT 输出内容的因素有模型和输入(Prompt,提示词)。 - 模型:我们可以选择不同的 GPT 产品,它们的模型可能不同,譬如 ChatGPT、Claude、文心一言、通义千问等。如果有能力的话,可以对开源的模型进行微调,或者自己训练模型。 - 提示词:我们可以学习如何编写好的提示词,这样 GPT 输出的内容就会更符合我们的预期。
185 0
|
2月前
|
自然语言处理 测试技术
明确了:文本数据中加点代码,训练出的大模型更强、更通用
【9月更文挑战第18天】《To Code, or Not To Code? Exploring Impact of Code in Pre-training》一文探讨了在大型语言模型(LLMs)预训练中引入代码数据的影响。研究显示,包含代码数据能显著提升模型的总体性能,尤其在自然语言推理和代码任务上表现突出。作者通过广泛的消融实验验证了这一结论,但同时也指出需关注潜在的负面效应及模型架构等因素的影响。更多详细信息,请参阅论文原文:[链接](https://arxiv.org/abs/2408.10914)。
56 10
|
4月前
|
缓存 JavaScript 前端开发
优化中文编程语言的基准性能
【7月更文挑战第7天】本文探讨了对中文编程语言OTao的优化,涉及衡量性能、基准测试和剖析等关键步骤。通过分析和优化这些热点,可以提升整体性能。
87 3
优化中文编程语言的基准性能
|
4月前
|
人工智能 程序员
ChatGPT无法取代人类程序员! IEEE 35页论文测出困难编码正确率仅为0.66%
【7月更文挑战第20天】IEEE 35页论文揭示ChatGPT在复杂编码任务上的正确率仅0.66%,表明大型语言模型虽能生成语法正确代码,但在逻辑和可读性上不及人类程序员。研究强调AI在深度领域知识与推理上的局限性,提示AI辅助而非替代的角色。[链接:https://ieeexplore.ieee.org/document/10507163]
46 2
|
6月前
火山中文编程 -- 温度转换
火山中文编程 -- 温度转换
30 0
|
6月前
|
存储 Windows
R 语言数值实验中常见技巧整理
R 语言数值实验中常见技巧整理
106 0
R 语言数值实验中常见技巧整理
|
机器学习/深度学习 算法 测试技术
【网安专题10.25】10 TitanFuzz完全自动化执行基于变异的模糊测试:生成式(如Codex)生成种子程序,逐步提示工程+第一个应用LLM填充模型(如InCoder)+差分测试
【网安专题10.25】10 TitanFuzz完全自动化执行基于变异的模糊测试:生成式(如Codex)生成种子程序,逐步提示工程+第一个应用LLM填充模型(如InCoder)+差分测试
218 0
|
机器学习/深度学习 自然语言处理 安全
67个主题,11528 个问题,全新中文大模型多任务基准CMMLU发布
67个主题,11528 个问题,全新中文大模型多任务基准CMMLU发布
348 0
|
机器学习/深度学习 编解码 自然语言处理
错字修改 | 布署1个中文文文本拼蟹纠错模型
错字修改 | 布署1个中文文文本拼蟹纠错模型
307 0
编程基本功:做自解释的测试文档
编程基本功:做自解释的测试文档
60 0
编程基本功:做自解释的测试文档