各任务详细对比
Evaluation of Dataset for Different Models
AFQMC 蚂蚁语义相似度 Ant Semantic Similarity (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
ALBERT-xxlarge | - | - | - |
ALBERT-tiny | 69.13% | 69.92% | batch_size=16, length=128, epoch=3 lr=2e-5 |
BERT-base | 74.16% | 73.70% | batch_size=16, length=128, epoch=3 lr=2e-5 |
BERT-wwm-ext-base | 73.74% | 74.07% | batch_size=16, length=128, epoch=3 lr=2e-5 |
ERNIE-base | 74.88% | 73.83% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-large | 73.32% | 74.02% | batch_size=16, length=128, epoch=3 lr=2e-5 |
XLNet-mid | 70.73% | 70.50% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-ext | 74.30% | 74.04% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-large-ext | 74.92% | 76.55% | batch_size=16, length=128, epoch=3 lr=2e-5 |
TNEWS' 头条新闻分类 Toutiao News Classification (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
ALBERT-xxlarge | - | - | - |
ALBERT-tiny | 53.55% | 53.35% | batch_size=16, length=128, epoch=3 lr=2e-5 |
BERT-base | 56.09% | 56.58% | batch_size=16, length=128, epoch=3 lr=2e-5 |
BERT-wwm-ext-base | 56.77% | 56.86% | batch_size=16, length=128, epoch=3 lr=2e-5 |
ERNIE-base | 58.24% | 58.33% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-large | 57.95% | 57.84% | batch_size=16, length=128, epoch=3 lr=2e-5 |
XLNet-mid | 56.09% | 56.24% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-ext | 57.51% | 56.94% | batch_size=16, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-large-ext | 58.32% | 58.61% | batch_size=16, length=128, epoch=3 lr=2e-5 |
IFLYTEK' 长文本分类 Long Text Classification (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
ALBERT-xlarge | - | - | batch=32, length=128, epoch=3 lr=2e-5 |
ALBERT-tiny | 48.76 | 48.71 | batch=32, length=128, epoch=10 lr=2e-5 |
BERT-base | 60.37 | 60.29 | batch=32, length=128, epoch=3 lr=2e-5 |
BERT-wwm-ext-base | 59.88 | 59.43 | batch=32, length=128, epoch=3 lr=2e-5 |
ERNIE-base | 59.52 | 58.96 | batch=32, length=128, epoch=3 lr=2e-5 |
RoBERTa-large | 62.6 | 62.55 | batch=24, length=128, epoch=3 lr=2e-5 |
XLNet-mid | 57.72 | 57.85 | batch=32, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-ext | 60.8 | 60.31 | batch=32, length=128, epoch=3 lr=2e-5 |
RoBERTa-wwm-large-ext | 62.75 | 62.98 | batch=24, length=128, epoch=3 lr=2e-5 |
CMNLI 中文自然语言推理 Chinese Multi-Genre NLI (Accuracy):
模型 | 开发集 (dev %) | 测试集(test %) | 训练参数 |
BERT-base | 79.47 | 79.69 | batch=64, length=128, epoch=2 lr=3e-5 |
BERT-wwm-ext-base | 80.92 | 80.42 | batch=64, length=128, epoch=2 lr=3e-5 |
ERNIE-base | 80.37 | 80.29 | batch=64, length=128, epoch=2 lr=3e-5 |
ALBERT-xxlarge | - | - | - |
ALBERT-tiny | 70.26 | 70.61 | batch=64, length=128, epoch=2 lr=3e-5 |
RoBERTa-large | 82.40 | 81.70 | batch=64, length=128, epoch=2 lr=3e-5 |
xlnet-mid | 82.21 | 81.25 | batch=64, length=128, epoch=2 lr=3e-5 |
RoBERTa-wwm-ext | 80.70 | 80.51 | batch=64, length=128, epoch=2 lr=3e-5 |
RoBERTa-wwm-large-ext | 83.20 | 82.12 | batch=64, length=128, epoch=2 lr=3e-5 |
注:ALBERT-xlarge,在XNLI任务上训练暂时还存在有问题
WSC Winograd模式挑战中文版 The Winograd Schema Challenge,Chinese Version:
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
ALBERT-xxlarge | - | - | - |
ALBERT-tiny | 57.7(52.9) | 58.5(52.1) | lr=1e-4, batch_size=8, length=128, epoch=50 |
BERT-base | 59.6(56.7) | 62.0(57.9) | lr=2e-5, batch_size=8, length=128, epoch=50 |
BERT-wwm-ext-base | 59.4(56.7) | 61.1(56.2) | lr=2e-5, batch_size=8, length=128, epoch=50 |
ERNIE-base | 58.1(54.9) | 60.8(55.9) | lr=2e-5, batch_size=8, length=128, epoch=50 |
RoBERTa-large | 68.6(58.7) | 72.7(63.6) | lr=2e-5, batch_size=8, length=128, epoch=50 |
XLNet-mid | 60.9(56.8) | 64.4(57.3) | lr=2e-5, batch_size=8, length=128, epoch=50 |
RoBERTa-wwm-ext | 67.2(57.7) | 67.8(63.5) | lr=2e-5, batch_size=8, length=128, epoch=50 |
RoBERTa-wwm-large-ext | 69.7(64.5) | 74.6(69.4) | lr=2e-5, batch_size=8, length=128, epoch=50 |
CSL 关键词识别 Keyword Recognition (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
ALBERT-xlarge | 80.23 | 80.29 | batch_size=16, length=128, epoch=2, lr=5e-6 |
ALBERT-tiny | 74.36 | 74.56 | batch_size=4, length=256, epoch=5, lr=1e-5 |
BERT-base | 79.63 | 80.23 | batch_size=4, length=256, epoch=5, lr=1e-5 |
BERT-wwm-ext-base | 80.60 | 81.00 | batch_size=4, length=256, epoch=5, lr=1e-5 |
ERNIE-base | 79.43 | 79.10 | batch_size=4, length=256, epoch=5, lr=1e-5 |
RoBERTa-large | 81.87 | 81.36 | batch_size=4, length=256, epoch=5, lr=5e-6 |
XLNet-mid | 82.06 | 81.26 | batch_size=4, length=256, epoch=3, lr=1e-5 |
RoBERTa-wwm-ext | 80.67 | 80.63 | batch_size=4, length=256, epoch=5, lr=1e-5 |
RoBERTa-wwm-large-ext | 82.17 | 82.13 | batch_size=4, length=256, epoch=5, lr=1e-5 |
DRCD 繁体阅读理解 Reading Comprehension for Traditional Chinese (F1, EM):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
BERT-base | F1:92.30 EM:86.60 | F1:91.46 EM:85.49 | batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
BERT-wwm-ext-base | F1:93.27 EM:88.00 | F1:92.63 EM:87.15 | batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
ERNIE-base | F1:92.78 EM:86.85 | F1:92.01 EM:86.03 | batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
ALBERT-large | F1:93.90 EM:88.88 | F1:93.06 EM:87.52 | batch=32, length=512, epoch=3, lr=2e-5, warmup=0.05 |
ALBERT-xlarge | F1:94.63 EM:89.68 | F1:94.70 EM:89.78 | batch_size=32, length=512, epoch=3, lr=2.5e-5, warmup=0.06 |
ALBERT-xxlarge | F1:93.69 EM:89.97 | F1:94.62 EM:89.67 | batch_size=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
ALBERT-tiny | F1:81.51 EM:71.61 | F1:80.67 EM:70.08 | batch=32, length=512, epoch=3, lr=2e-4, warmup=0.1 |
RoBERTa-large | F1:94.93 EM:90.11 | F1:94.25 EM:89.35 | batch=32, length=256, epoch=2, lr=3e-5, warmup=0.1 |
xlnet-mid | F1:92.08 EM:84.40 | F1:91.44 EM:83.28 | batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
RoBERTa-wwm-ext | F1:94.26 EM:89.29 | F1:93.53 EM:88.12 | batch=32, length=512, epoch=2, lr=3e-5, warmup=0.1 |
RoBERTa-wwm-large-ext | F1:95.32 EM:90.54 | F1:95.06 EM:90.70 | batch=32, length=512, epoch=2, lr=2.5e-5, warmup=0.1 |
CMRC2018 阅读理解 Reading Comprehension for Simplified Chinese (F1, EM):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
BERT-base | F1:85.48 EM:64.77 | F1:88.10 EM:71.60 | batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1 |
BERT-wwm-ext-base | F1:86.68 EM:66.96 | F1:89.62 EM:73.95 | batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1 |
ERNIE-base | F1:87.30 EM:66.89 | F1:90.57 EM:74.70 | batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1 |
ALBERT-base | F1:85.86 EM:64.76 | F1:89.66 EM:72.90 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
ALBERT-large | F1:87.36 EM:67.31 | F1:90.81 EM:75.95 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
ALBERT-xlarge | F1:88.99 EM:69.08 | F1:92.09 EM:76.30 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
ALBERT-xxlarge | F1:87.47 EM:66.43 | F1:90.77 EM:75.15 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
ALBERT-tiny | F1:73.95 EM:48.31 | F1:76.21 EM:53.35 | batch=32, epoch3, length=512, lr=2e-4, warmup=0.1 |
RoBERTa-large | F1:88.61 EM:69.94 | F1:92.04 EM:78.50 | batch=32, epoch2, length=256, lr=3e-5, warmup=0.1 |
xlnet-mid | F1:85.63 EM:65.31 | F1:86.11 EM:66.95 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
RoBERTa-wwm-ext | F1:87.28 EM:67.89 | F1:90.41 EM:75.20 | batch=32, epoch2, length=512, lr=3e-5, warmup=0.1 |
RoBERTa-wwm-large-ext | F1:89.42 EM:70.59 | F1:92.11 EM:77.95 | batch=32, epoch2, length=512, lr=2.5e-5, warmup=0.1 |
注: 现在榜上数据为cmrc2018的2k测试集子集作为测试,而并非cmrc2018官方完整测试集。如需完整测试cmrc2018阅读理解数据集仍需通过cmrc2018平台提交
(https://worksheets.codalab.org/worksheets/0x96f61ee5e9914aee8b54bd11e66ec647)。%E3%80%82)
CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
BERT-base | 82.20 | 82.04 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
BERT-wwm-ext-base | 83.36 | 82.9 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ERNIE-base | 82.46 | 82.28 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ALBERT-base | 70.99 | 71.77 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ALBERT-large | 75.10 | 74.18 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ALBERT-xlarge | 81.20 | 80.57 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ALBERT-xxlarge | 83.61 | 83.15 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
ALBERT-tiny | 43.47 | 43.53 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
RoBERTa-large | 85.31 | 84.50 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
xlnet-mid | 83.76 | 83.47 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
RoBERTa-wwm-ext | 83.78 | 83.62 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
RoBERTa-wwm-large-ext | 85.81 | 85.37 | batch=24, length=64, epoch=3, lr=2e-5, warmup=0.06 |
C3 成语阅读理解填空 中文多选阅读理解 Multiple-Choice Chinese Machine Reading Comprehension (Accuracy):
模型 | 开发集(dev) | 测试集(test) | 训练参数 |
BERT-base | 65.70 | 64.50 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
BERT-wwm-ext-base | 67.80 | 68.50 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ERNIE-base | 65.50 | 64.10 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ALBERT-base | 60.43 | 59.58 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ALBERT-large | 64.07 | 64.41 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ALBERT-xlarge | 69.75 | 70.32 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ALBERT-xxlarge | 73.66 | 73.28 | batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1 |
ALBERT-tiny | 50.58 | 50.26 | batch=32, length=512, epoch=8, lr=5e-5, warmup=0.1 |
RoBERTa-large | 67.79 | 67.55 | batch=24, length=256, epoch=8, lr=2e-5, warmup=0.1 |
xlnet-mid | 66.17 | 67.68 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
RoBERTa-wwm-ext | 67.06 | 66.50 | batch=24, length=512, epoch=8, lr=2e-5, warmup=0.1 |
RoBERTa-wwm-large-ext | 74.48 | 73.82 | batch=16, length=512, epoch=8, lr=2e-5, warmup=0.1 |