NLPer福利-中文语言理解基准测【CLUEbenchmark】(下)

简介: NLPer福利-中文语言理解基准测【CLUEbenchmark】(下)

各任务详细对比


Evaluation of Dataset for Different Models


AFQMC 蚂蚁语义相似度 Ant Semantic Similarity (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 69.13% 69.92% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-base 74.16% 73.70% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 73.74% 74.07% batch_size=16, length=128, epoch=3 lr=2e-5
ERNIE-base 74.88% 73.83% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-large 73.32% 74.02% batch_size=16, length=128, epoch=3 lr=2e-5
XLNet-mid 70.73% 70.50% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 74.30% 74.04% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 74.92% 76.55% batch_size=16, length=128, epoch=3 lr=2e-5


TNEWS' 头条新闻分类 Toutiao News Classification (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 53.55% 53.35% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-base 56.09% 56.58% batch_size=16, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 56.77% 56.86% batch_size=16, length=128, epoch=3 lr=2e-5
ERNIE-base 58.24% 58.33% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-large 57.95% 57.84% batch_size=16, length=128, epoch=3 lr=2e-5
XLNet-mid 56.09% 56.24% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 57.51% 56.94% batch_size=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 58.32% 58.61% batch_size=16, length=128, epoch=3 lr=2e-5


IFLYTEK' 长文本分类 Long Text Classification (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xlarge - - batch=32, length=128, epoch=3 lr=2e-5
ALBERT-tiny 48.76 48.71 batch=32, length=128, epoch=10 lr=2e-5
BERT-base 60.37 60.29 batch=32, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base 59.88 59.43 batch=32, length=128, epoch=3 lr=2e-5
ERNIE-base 59.52 58.96 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-large 62.6 62.55 batch=24, length=128, epoch=3 lr=2e-5
XLNet-mid 57.72 57.85 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext 60.8 60.31 batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext 62.75 62.98 batch=24, length=128, epoch=3 lr=2e-5


CMNLI 中文自然语言推理 Chinese Multi-Genre NLI (Accuracy):


模型 开发集 (dev %) 测试集(test %) 训练参数
BERT-base 79.47 79.69 batch=64, length=128, epoch=2 lr=3e-5
BERT-wwm-ext-base 80.92 80.42 batch=64, length=128, epoch=2 lr=3e-5
ERNIE-base 80.37 80.29 batch=64, length=128, epoch=2 lr=3e-5
ALBERT-xxlarge - - -
ALBERT-tiny 70.26 70.61 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-large 82.40 81.70 batch=64, length=128, epoch=2 lr=3e-5
xlnet-mid 82.21 81.25 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-wwm-ext 80.70 80.51 batch=64, length=128, epoch=2 lr=3e-5
RoBERTa-wwm-large-ext 83.20 82.12 batch=64, length=128, epoch=2 lr=3e-5


注:ALBERT-xlarge,在XNLI任务上训练暂时还存在有问题


WSC Winograd模式挑战中文版 The Winograd Schema Challenge,Chinese Version:


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xxlarge - - -
ALBERT-tiny 57.7(52.9) 58.5(52.1) lr=1e-4, batch_size=8, length=128, epoch=50
BERT-base 59.6(56.7) 62.0(57.9) lr=2e-5, batch_size=8, length=128, epoch=50
BERT-wwm-ext-base 59.4(56.7) 61.1(56.2) lr=2e-5, batch_size=8, length=128, epoch=50
ERNIE-base 58.1(54.9) 60.8(55.9) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-large 68.6(58.7) 72.7(63.6) lr=2e-5, batch_size=8, length=128, epoch=50
XLNet-mid 60.9(56.8) 64.4(57.3) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-wwm-ext 67.2(57.7) 67.8(63.5) lr=2e-5, batch_size=8, length=128, epoch=50
RoBERTa-wwm-large-ext 69.7(64.5) 74.6(69.4) lr=2e-5, batch_size=8, length=128, epoch=50


CSL 关键词识别 Keyword Recognition (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
ALBERT-xlarge 80.23 80.29 batch_size=16, length=128, epoch=2, lr=5e-6
ALBERT-tiny 74.36 74.56 batch_size=4, length=256, epoch=5, lr=1e-5
BERT-base 79.63 80.23 batch_size=4, length=256, epoch=5, lr=1e-5
BERT-wwm-ext-base 80.60 81.00 batch_size=4, length=256, epoch=5, lr=1e-5
ERNIE-base 79.43 79.10 batch_size=4, length=256, epoch=5, lr=1e-5
RoBERTa-large 81.87 81.36 batch_size=4, length=256, epoch=5, lr=5e-6
XLNet-mid 82.06 81.26 batch_size=4, length=256, epoch=3, lr=1e-5
RoBERTa-wwm-ext 80.67 80.63 batch_size=4, length=256, epoch=5, lr=1e-5
RoBERTa-wwm-large-ext 82.17 82.13 batch_size=4, length=256, epoch=5, lr=1e-5


DRCD 繁体阅读理解 Reading Comprehension for Traditional Chinese (F1, EM):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base F1:92.30 EM:86.60 F1:91.46 EM:85.49 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
BERT-wwm-ext-base F1:93.27 EM:88.00 F1:92.63 EM:87.15 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ERNIE-base F1:92.78 EM:86.85 F1:92.01 EM:86.03 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ALBERT-large F1:93.90 EM:88.88 F1:93.06 EM:87.52 batch=32, length=512, epoch=3, lr=2e-5, warmup=0.05
ALBERT-xlarge F1:94.63 EM:89.68 F1:94.70 EM:89.78 batch_size=32, length=512, epoch=3, lr=2.5e-5, warmup=0.06
ALBERT-xxlarge F1:93.69 EM:89.97 F1:94.62 EM:89.67 batch_size=32, length=512, epoch=2, lr=3e-5, warmup=0.1
ALBERT-tiny F1:81.51 EM:71.61 F1:80.67 EM:70.08 batch=32, length=512, epoch=3, lr=2e-4, warmup=0.1
RoBERTa-large F1:94.93 EM:90.11 F1:94.25 EM:89.35 batch=32, length=256, epoch=2, lr=3e-5, warmup=0.1
xlnet-mid F1:92.08 EM:84.40 F1:91.44 EM:83.28 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
RoBERTa-wwm-ext F1:94.26 EM:89.29 F1:93.53 EM:88.12 batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1
RoBERTa-wwm-large-ext F1:95.32 EM:90.54 F1:95.06 EM:90.70 batch=32, length=512, epoch=2, lr=2.5e-5, warmup=0.1


CMRC2018 阅读理解 Reading Comprehension for Simplified Chinese (F1, EM):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base F1:85.48 EM:64.77 F1:88.10 EM:71.60 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
BERT-wwm-ext-base F1:86.68 EM:66.96 F1:89.62 EM:73.95 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ERNIE-base F1:87.30 EM:66.89 F1:90.57 EM:74.70 batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ALBERT-base F1:85.86 EM:64.76 F1:89.66 EM:72.90 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-large F1:87.36 EM:67.31 F1:90.81 EM:75.95 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-xlarge F1:88.99 EM:69.08 F1:92.09 EM:76.30 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-xxlarge F1:87.47 EM:66.43 F1:90.77 EM:75.15 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
ALBERT-tiny F1:73.95 EM:48.31 F1:76.21 EM:53.35 batch=32, epoch3, length=512, lr=2e-4, warmup=0.1
RoBERTa-large F1:88.61 EM:69.94 F1:92.04 EM:78.50 batch=32, epoch2, length=256, lr=3e-5, warmup=0.1
xlnet-mid F1:85.63 EM:65.31 F1:86.11 EM:66.95 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-ext F1:87.28 EM:67.89 F1:90.41 EM:75.20 batch=32, epoch2, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-large-ext F1:89.42 EM:70.59 F1:92.11 EM:77.95 batch=32, epoch2, length=512, lr=2.5e-5, warmup=0.1


注: 现在榜上数据为cmrc2018的2k测试集子集作为测试,而并非cmrc2018官方完整测试集。如需完整测试cmrc2018阅读理解数据集仍需通过cmrc2018平台提交

(https://worksheets.codalab.org/worksheets/0x96f61ee5e9914aee8b54bd11e66ec647)。%E3%80%82)


CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base 82.20 82.04 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
BERT-wwm-ext-base 83.36 82.9 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ERNIE-base 82.46 82.28 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-base 70.99 71.77 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-large 75.10 74.18 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-xlarge 81.20 80.57 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-xxlarge 83.61 83.15 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
ALBERT-tiny 43.47 43.53 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-large 85.31 84.50 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
xlnet-mid 83.76 83.47 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-wwm-ext 83.78 83.62 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06
RoBERTa-wwm-large-ext 85.81 85.37 batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06


C3 成语阅读理解填空 中文多选阅读理解 Multiple-Choice Chinese Machine Reading Comprehension (Accuracy):


模型 开发集(dev) 测试集(test) 训练参数
BERT-base 65.70 64.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
BERT-wwm-ext-base 67.80 68.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ERNIE-base 65.50 64.10 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-base 60.43 59.58 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-large 64.07 64.41 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-xlarge 69.75 70.32 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-xxlarge 73.66 73.28 batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1
ALBERT-tiny 50.58 50.26 batch=32, length=512, epoch=8, lr=5e-5, warmup=0.1
RoBERTa-large 67.79 67.55 batch=24, length=256, epoch=8, lr=2e-5, warmup=0.1
xlnet-mid 66.17 67.68 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
RoBERTa-wwm-ext 67.06 66.50 batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1
RoBERTa-wwm-large-ext 74.48 73.82 batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1
相关文章
|
7月前
|
存储 人工智能 算法
编写 GPT 提示词的公式 + 资源分享
GPT 能够给我们带来很大的帮助,因此我们要好好利用它。我们希望 GPT 输出令我们满意的内容,影响 GPT 输出内容的因素有模型和输入(Prompt,提示词)。 - 模型:我们可以选择不同的 GPT 产品,它们的模型可能不同,譬如 ChatGPT、Claude、文心一言、通义千问等。如果有能力的话,可以对开源的模型进行微调,或者自己训练模型。 - 提示词:我们可以学习如何编写好的提示词,这样 GPT 输出的内容就会更符合我们的预期。
112 0
|
3月前
|
自然语言处理 测试技术 算法
|
8月前
|
自然语言处理 JavaScript Python
中文语言大模型体验小记
中文语言大模型体验小记
179 0
|
5月前
|
数据采集 人工智能 监控
【网安AIGC专题11.1】论文13:理解和解释代码,GPT-3大型语言模型&学生创建的代码解释比较+错误代码的解释(是否可以发现并改正)
【网安AIGC专题11.1】论文13:理解和解释代码,GPT-3大型语言模型&学生创建的代码解释比较+错误代码的解释(是否可以发现并改正)
90 0
|
11月前
|
机器学习/深度学习 人工智能 安全
OpenAI发布GPT-4,做题能力更强,还接受图片输入,看懂梗图无障碍!
OpenAI发布GPT-4,做题能力更强,还接受图片输入,看懂梗图无障碍!
181 0
|
11月前
|
SQL 人工智能 JSON
Prompt learning 教学技巧篇:通过增加示例、引导词、特殊符号指令等方式让chatgpt输出更好的答案
Prompt learning 教学技巧篇:通过增加示例、引导词、特殊符号指令等方式让chatgpt输出更好的答案
|
11月前
|
人工智能 自然语言处理 搜索推荐
ChatGPT 中文指令指南,教会你如何使用chatgpt实现中文你想要的答案
ChatGPT 中文指令指南,教会你如何使用chatgpt实现中文你想要的答案
|
前端开发 中间件
ChatGPT都只能, 解释一半的代码, 是啥样的
我发现啊, 有的人还不会用ChatGPT, 真的有点出乎我的认知, 我觉得一些理所应知的事情, 他们不知
242 0
编程基本功:做自解释的测试文档
编程基本功:做自解释的测试文档
47 0
编程基本功:做自解释的测试文档
|
Python
测试 OFA英文图像描述
测试 OFA英文图像描述
83 0
测试 OFA英文图像描述