【大语言模型-论文精读】用于医疗领域摘要任务的大型语言模型评估综述(下)

简介: 【大语言模型-论文精读】用于医疗领域摘要任务的大型语言模型评估综述(下)

【大语言模型-论文精读】用于医疗领域摘要任务的大型语言模型评估综述(上)+https://developer.aliyun.com/article/1628946

5.2 参数高效微调

即使 LLM 可能在庞大的语料库上进行了预训练,但它在需要领域特定知识或处理细微输入的任务中仍会遇到困难。为了应对这些挑战,可以采用使用量化和低秩适配器的参数高效微调 (PEFT) 和监督微调 (SFT) 方法,其中模型在针对当前任务量身定制的提示/响应对的专门数据集上进行训练。微调 LLM 中的每个权重可能需要大量的时间和计算资源。在这些情况下,量化和低秩适配器被添加到 PEFT 的微调过程中。量化通过对 LLM 权重使用较低精度的数据类型(通常为 4 位和 8 位)来减少训练的时间和内存成本 [85]。低秩适配器 (LoRA) 冻结 LLM 的权重并将其分解为更少数量的可训练参数,最终也降低了 SFT 的成本 [86]。 PEFT 通过嵌入特定于任务的知识来帮助完善 LLM,确保模型能够在特定环境中做出准确响应。这些数据集的创建至关重要——性能改进与用于微调的提示/响应对的质量和相关性直接相关。目标是通过 PEFT 将重点缩小到特定于任务的行为,调整 LLM 以在特定用例(例如医疗诊断或法律推理)中表现更好。

5.3 利用人感知的损失函数进行参数高效微调(Parameter Efficient Fine-Tuning with Human-Aware Loss Function)

在某些应用中,微调的重点是使 LLM 与人类的价值观和偏好保持一致,尤其是当模型有可能生成有偏见、不正确或有害的内容时。这种对齐称为人类对齐训练,由集成到训练过程中的高质量人类反馈驱动。该领域一种广受认可的方法是带人类反馈的强化学习 (RLHF) [87]。RLHF 用于更新 LLM,引导其输出在奖励量表上得分更高的输出。在奖励模型阶段,使用带有人类反馈注释的数据集来确定特定响应的奖励(通常为标量)。然后,通过称为近端策略优化 (PPO) [88] 的过程训练 LLM 以产生将获得更高奖励的响应。这个迭代过程可确保模型符合人类的期望,但它可能耗费大量资源,需要大量内存、时间和计算能力。

为了应对这些计算挑战,出现了新的范式,通过直接优化基于人类偏好的 LLM,无需使用直接偏好优化 (DPO) [89] 的奖励模型,从而简化了人类对齐训练。DPO 将对齐过程重新表述为人类感知损失函数 (HALO),该函数在人类偏好数据集上进行了优化,其中提示与偏好和不喜欢的响应配对(图 4)。这种方法对于将 LLM 与人类偏好对齐特别有前景,并且可以应用于序数响应,例如人类评估标准中常见的李克特量表。虽然 PPO 通过将输出与人类偏好对齐来提高 LLM 性能,但它通常样本效率低下,并且可能受到奖励黑客攻击的影响 [90]。相比之下,DPO 直接根据人类偏好优化模型输出,而无需明确的奖励模型,使其更具样本效率,并且与人类价值观更加一致。DPO 通过直接关注期望的结果来简化训练过程,从而实现更稳定和可解释的对齐。虽然这些方法已成功应用于其他领域 [91, 92, 93],但它们在医学领域的应用尚未得到充分探索。为了克服劳动力限制,可以将来自人类评估标准的小规模训练数据纳入使用 DPO 为人体对齐设计的损失函数中。

在过去的一年里,出现了许多用于对齐训练方法的 DPO 变体,这些方法可以通过修改底层模型和损失函数来防止过度拟合和规避 DPO 的建模假设(图 5)。联合偏好优化(JPO)[94] 和简单偏好优化(SimPO)[95] 等替代方法都是从 DPO 衍生而来的。这些方法引入了正则化项并修改了损失函数,以防止过早收敛并确保在更广泛的输入范围内实现更稳健的对齐。其他替代方法,如卡尼曼-特沃斯基优化(KTO)[96] 和多元对齐框架(PAL)[97],使用了 DPO 所依赖的 Bradley-Terry 偏好模型的替代方案。这些方法中使用的替代建模假设可以防止在没有直接偏好数据和异构人类偏好的情况下 DPO 的对齐失效。

LLM 有望实现自动化评估,但与其他自动化评估方法一样,它也面临着重大挑战。一个主要问题是 LLM 及其相关培训策略的快速发展。这种快速发展往往超过了在实践中使用之前彻底验证基于 LLM 的评估器的能力。在某些情况下,新的优化技术在其前身尚未经过同行评审之前就被引入,而这些进步可能缺乏足够的数学依据。LLM 的发展速度可能使得分配时间和资源进行适当的验证变得困难,这可能会损害其可靠性。

此外,尽管 LLM 取得了进步,但它仍然对收到的提示和输入很敏感。随着 LLM 不断更新和更改其内部知识表示,并且其提示也发生变化,输出可能会有很大差异。所使用的确切 LLM 或模型版本也可能增加另一层可变性。根据 LLM 的内部结构和预训练模式,相同的提示和输入可能会产生不同的结果。LLM 还因自我中心偏见而受到关注,这可能会影响评估,因为越来越多的 LLM 生成的文本出现在源文本中 [112]。因此,使用 LLM 作为评估器必须进行严格的测试和安全检查以降低风险。确保其回应的公平性也至关重要,特别是在医疗保健等敏感领域,偏见或污名化的语言可能会造成严重后果。这些挑战凸显了持续评估、测试和改进的必要性,以使基于 LLM 的评估器既可靠又安全,可用于医疗评估。

随着 GenAI 的创新速度超过这些技术验证的速度,开发可靠的评估策略变得越来越重要。在医疗保健领域,对临床安全的关注还必须应对医疗专业人员的时间限制。虽然人工评估标准具有高度的可靠性和准确性,但它们受到担任评估员的医疗专业人员所需的时间投入的严重限制。具有讽刺意味的是,被评估的技术通常旨在减轻这些专业人员的认知负担,但他们需要进一步投入时间来进行绩效评估。

如果针对临床领域进行了适当的设计,自动化评估将为人工评估提供一种有希望的替代方案。然而,传统的非 LLM 自动化评估迄今为止还不够,未能始终如一地达到人工评估标准的严格性 [5, 13]。这些指标经常忽略幻觉,无法评估推理质量,并且难以确定生成文本的相关性。随着 LLM 作为人工评估者的潜在替代方案被引入,考虑临床领域的独特要求至关重要。精心设计的 LLM 评估器(即“LLM 法官”)可以将人工评估的高可靠性与自动化方法的效率相结合,同时避免现有自动化指标所存在的缺陷。如果有效执行,这种基于 LLM 的评估可以兼具两者的优点,既能确保临床安全,又不会牺牲评估质量。


后记

如果您对我的博客内容感兴趣,欢迎三连击(点赞,关注和评论),我将持续为您带来计算机人工智能前沿技术(尤其是AI相关的大语言模型,深度学习,计算机视觉相关方向)最新学术论文及工程实践方面的内容分享,助力您更快更准更系统地了解 AI前沿技术

参考文献

[1] Patterson BW, Hekman DJ, Liao FJ, Hamedani AG, Shah MN, Afshar M. Call me Dr Ishmael: trends in electronic health record notes available at emergency department visits and admissions. JAMIA Open. 2024 Apr;7(2):ooae039.

[2] Team G, Georgiev P, Lei VI, Burnell R, Bai L, Gulati A, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024 Aug. ArXiv:2403.05530 [cs]. Available from: arxiv.org/abs/2403.0… .

[3] Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A Survey of Large Language Models. 2023 Jun. ArXiv:2303.18223 [cs]. Available from: arxiv.org/abs/2303.1… .

[4] Moramarco F, Papadopoulos Korfiatis A, Perera M, Juric D, Flann J, Reiter E, et al. Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 5739–5754. Available from: aclanthology.org/202… .

[5] Croxford E, Gao Y, Patterson B, To D, Tesch S, Dligach D, et al. Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses. 2024 Apr:2024.03.20.24304620. Available from: www.medrxiv.org/cont… 1101/2024.03.20.24304620v2 .

[6] Singh H, Khanna A, Spitzmueller C, Meyer AND. Recommendations for using the Revised Safer Dx Instrument to help measure and improve diagnostic safety. Diagnosis. 2019 Nov;6:315–323.

[7] Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing Electronic Note Quality Using the Physician Documentation Quality Instrument (PDQI-9). Applied Clinical Informatics. 2012 Apr;3(2):164–174.

[8] Schaye V, Miller L, Kudlowitz D, Chun J, Burk-Rafel J, Cocks P, et al. Development of a Clinical Reasoning Documentation Assessment Tool for Resident and Fellow Admission Notes: a Shared Mental Model for Feedback. Journal of General Internal Medicine. 2022 Feb;37(3):507–512.

[9] Kawamura R, Harada Y, Sugimoto S, Nagase Y, Katsukura S, Shimizu T. Incidence of Diagnostic Errors Among Unexpectedly Hospitalized Patients Using an Automated Medical History–Taking System With a Differential Diagnosis Generator: Retrospective Observational Study. JMIR Medical Informatics. 2022 Jan;10(1):e35225. Company: JMIR Medical Informatics Distributor: JMIR Medical Informatics Institution: JMIR Medical Informatics Label: JMIR Medical Informatics publisher: JMIR Publications Inc., Toronto, Canada.

[10] Tierney AA, Gayre G, Hoberman B, Mattern B, Ballesca M, Kipnis P, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catalyst. 2024 [11] Eshel R, Bellolio F, Boggust A, Shapiro NI, Mullan AF, Heaton HA, et al. Comparison of clinical note quality between an automated digital intake tool and the standard note in the emergency department. The American Journal of Emergency Medicine. 2023;63:79–85.

[12] Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, Abdulnour RE, et al. Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Internal Medicine. 2024 May;184(5):581–583.

[13] Sai AB, Mohankumar AK, Khapra MM. A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys. 2023;55(2).

[14] Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023 Jul:1–9.

[15] Otmakhova Y, Verspoor K, Baldwin T, Lau JH. The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 5098–5111. Available from: aclanthology.org/202… .

[16] Adams G, Zucker J, Elhadad N. A Meta-Evaluation of Faithfulness Metrics for Long-Form HospitalCourse Summarization. 2023 Mar. ArXiv:2303.03948 [cs]. Available from: arxiv.org/abs/ 2303.03948 .

[17] Guo Y, Qiu W, Wang Y, Cohen T. Automated Lay Language Summarization of Biomedical Scientific Reviews. 2022 Jan. ArXiv:2012.12573 [cs]. Available from: arxiv.org/abs/2012.1… .

[18] Wallace BC, Saha S, Soboczenski F, Marshall IJ. Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization; 2020. Available from: https: //arxiv.org/abs/2008.11293v2 .

[19] Abacha AB, Yim Ww, Michalopoulos G, Lin T. An Investigation of Evaluation Metrics for Automated Medical Note Generation. 2023 May. ArXiv:2305.17364 [cs]. Available from: arxiv.org/abs/ 2305.17364 .

[20] Yadav S, Gupta D, Abacha AB, Demner-Fushman D. Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards. 2021 Jun. ArXiv:2107.00176 [cs]. Available from: arxiv.org/abs/2107.0… .

[21] Moor M, Huang Q, Wu S, Yasunaga M, Zakka C, Dalmia Y, et al. Med-Flamingo: a Multimodal Medical Few-shot Learner. 2023 Jul. ArXiv:2307.15189 [cs]. Available from: arxiv.org/abs/ 2307.15189 .

[22] Dalla Serra F, Clackett W, MacKinnon H, Wang C, Deligianni F, Dalton J, et al. Multimodal Generation of Radiology Reports using Knowledge-Grounded Extraction of Entities and Relations. In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online only: Association for Computational Linguistics; 2022. p. 615–624. Available from: aclanthology.org/202… .

[23] Cai P, Liu F, Bajracharya A, Sills J, Kapoor A, Liu W, et al. Generation of Patient After-Visit Summaries to Support Physicians. In: Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics; 2022. p. 6234–6247. Available from: aclanthology.org/202… .

[24] Umapathi LK, Pal A, Sankarasubbu M. Med-HALT: Medical Domain Hallucination Test for Large Language Models. 2023 Jul. ArXiv:2307.15343 [cs, stat]. Available from: arxiv.org/abs/ 2307.15343 .

[25] Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966 Feb;10:707.

[26] Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, et al. Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding. In: Calzolari N, B´ echet F, Blache P, Choukri K, Cieri C, Declerck T, et al., editors. Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2022. p. 5484–5493. Available from: aclanthology.org/202… .

[27] Goldsack T, Scarton C, Shardlow M, Lin C. Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. In: Demner-Fushman D, Ananiadou S, Miwa M, Roberts K, Tsujii J, editors. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing. Bangkok, Thailand: Association for Computational Linguistics; 2024. p. 122–131. Available from: aclanthology.org/202… .

[28] Gupta D, Demner-Fushman D. Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, editors. Proceedings of the 21st Workshop on Biomedical Language Processing. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 264–274. Available from: aclanthology.org/202… .

[29] Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74-81. Available from: aclanthology.org/W04… .

[30] Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: Evaluating Text Generation with BERT. 2020 Feb. ArXiv:1904.09675 [cs]. Available from: arxiv.org/abs/1904.0… .

[31] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019 May. ArXiv:1810.04805 [cs]. Available from: arxiv.org/ abs/1810.04805 .

[32] Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Goldstein J, Lavie A, Lin CY, Voss C, editors. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics; 2005. p. 65–72. Available from: aclanthology.org/W05… .

[33] Louis A, Nenkova A. Automatically Assessing Machine Summary Content Without a Gold Standard. Computational Linguistics. 2013 Jun;39(2):267–300.

[34] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE; 2015. p. 4566–4575. Available from: ieeexplore.ieee.org/… .

[35] Gao Y, Sun C, Passonneau RJ. Automated Pyramid Summarization Evaluation. 2019.

[36] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ’02. USA: Association for Computational Linguistics; 2002. p. 311–318. Available from: https: //doi.org/10.3115/1073083.1073135 .

[37] Cohan A, Goharian N. Revisiting Summarization Evaluation for Scientific Articles. 2016.

[38] Lin J, Demner-Fushman D. Automatically Evaluating Answers to Definition Questions. In: Mooney R, Brew C, Chien LF, Kirchhoff K, editors. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Vancouver, British Columbia, Canada: Association for Computational Linguistics; 2005. p. 931–938. Available from: https:// aclanthology.org/H05-1117 .

[39] Hovy E, Lin CY, Zhou L, Fukumoto J. Automated Summarization Evaluation with Basic Elements. In: Calzolari N, Choukri K, Gangemi A, Maegaard B, Mariani J, Odijk J, et al., editors. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). Genoa, Italy: European Language Resources Association (ELRA); 2006. Available from: www.lrec-conf. org/proceedings/lrec2006/pdf/438_pdf.pdf .

[40] Turian JP, Shen L, Melamed ID. Evaluation of machine translation and its evaluation. In: Proceedings of Machine Translation Summit IX: Papers. New Orleans, USA; 2003. Available from: https:// aclanthology.org/2003.mtsummit-papers.51 .

[41] Su KY, Wu MW, Chang JS. A New Quantitative Quality Measure for Machine Translation Systems. In: COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics; 1992. Available from: aclanthology.org/C92… .

[42] Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A Study of Translation Edit Rate with Targeted Human Annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. Cambridge, Massachusetts, USA: Association for Machine Translation in the Americas; 2006. p. 223–231. Available from: aclanthology.org/ 2006.amta-papers.25 .

[43] Panja J, Naskar SK. ITER: Improving Translation Edit Rate through Optimizable Edit Costs. In: Bojar O, Chatterjee R, Federmann C, Fishel M, Graham Y, Haddow B, et al., editors. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. Belgium, Brussels: Association for Computational Linguistics; 2018. p. 746–750. Available from: aclanthology.org/W18… .

[44] Leusch G, Ueffing N, Ney H. CDER: Efficient MT Evaluation Using Block Movements. In: McCarthy D, Wintner S, editors. 11th Conference of the European Chapter of the Association for Computational Linguistics. Trento, Italy: Association for Computational Linguistics; 2006. p. 241–248. Available from: aclanthology.org/E06… .

[45] Popovi´ c M. chrF: character n-gram F-score for automatic MT evaluation. In: Bojar O, Chatterjee R, Federmann C, Haddow B, Hokamp C, Huck M, et al., editors. Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics; 2015. p. 392–395. Available from: aclanthology.org/W15… .

[46] Wang W, Peter JT, Rosendahl H, Ney H. CharacTer: Translation Edit Rate on Character Level. In: Bojar O, Buck C, Chatterjee R, Federmann C, Guillou L, Haddow B, et al., editors. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Berlin, Germany: Association for Computational Linguistics; 2016. p. 505–510. Available from: https://aclanthology. org/W16-2342 .

[47] Stanchev P, Wang W, Ney H. EED: Extended Edit Distance Measure for Machine Translation. In: Bojar O, Chatterjee R, Federmann C, Fishel M, Graham Y, Haddow B, et al., editors. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Florence, Italy: Association for Computational Linguistics; 2019. p. 514–520. Available from: https://aclanthology. org/W19-5359 .

[48] Lo Ck. YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources. In: Bojar O, Chatterjee R, Federmann C, Fishel M, Graham Y, Haddow B, et al., editors. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Florence, Italy: Association for Computational Linguistics; 2019. p. 507–513. Available from: aclanthology.org/W19… .

[49] Nema P, Khapra MM. Towards a Better Metric for Evaluating Question Generation Systems. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 3950–3959. Available from: aclanthology.org/D18… .

[50] Gao Y, Dligach D, Miller T, Xu D, Churpek MM, Afshar M. Summarizing Patients Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. 2022 Sep. ArXiv:2208.08408 [cs]. Available from: arxiv.org/abs/2208.0… .

[51] Rei R, Stewart C, Farinha AC, Lavie A. COMET: A Neural Framework for MT Evaluation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics; 2020. p. 2685–2702. Available from: https:// aclanthology.org/2020.emnlp-main.213 .

[52] Sellam T, Das D, Parikh AP. BLEURT: Learning Robust Metrics for Text Generation. 2020 May. ArXiv:2004.04696 [cs]. Available from: arxiv.org/abs/2004.0… .

[53] Lin Z, Liu C, Ng HT, Kan MY. Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation. In: Li H, Lin CY, Osborne M, Lee GG, Park JC, editors. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 1006–1014. Available from: aclanthology.org/P12… .

[54] Stanojevi´ c M, Sima’an K. Fitting Sentence Level Translation Evaluation with Many Dense Features. In: Moschitti A, Pang B, Daelemans W, editors. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014. p. 202–206. Available from: aclanthology.org/D14… .

[55] Ma Q, Graham Y, Wang S, Liu Q. Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task. In: Bojar O, Buck C, Chatterjee R, Federmann C, Graham Y, Haddow B, et al., editors. Proceedings of the Second Conference on Machine Translation. Copenhagen, Denmark: Association for Computational Linguistics; 2017. p. 598–603. Available from: aclanthology.org/W17… .

[56] Sharif N, White L, Bennamoun M, Ali Shah SA. Learning-based Composite Metrics for Improved Caption Evaluation. In: Shwartz V, Tabassum J, Voigt R, Che W, de Marneffe MC, Nissim M, editors. Proceedings of ACL 2018, Student Research Workshop. Melbourne, Australia: Association for Computational Linguistics; 2018. p. 14–20. Available from: aclanthology.org/P18… .

[57] Chen Q, Zhu X, Ling ZH, Wei S, Jiang H, Inkpen D. Enhanced LSTM for Natural Language Inference. In: Barzilay R, Kan MY, editors. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics; 2017. p. 1657–1668. Available from: aclanthology.org/P17… .

[58] Shimanaka H, Kajiwara T, Komachi M. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In: Bojar O, Chatterjee R, Federmann C, Fishel M, Graham Y, Haddow B, et al., editors. Proceedings of the Third Conference on Machine Translation: Shared Task Papers. Belgium, Brussels: Association for Computational Linguistics; 2018. p. 751–758. Available from: aclanthology.org/W18… .

[59] Shimanaka H, Kajiwara T, Komachi M. Machine Translation Evaluation with BERT Regressor. 2019 Jul. ArXiv:1907.12679 [cs]. Available from: arxiv.org/abs/1907.1… .

[60] Zhang S, Liu Y, Meng F, Chen Y, Xu J, Liu J, et al. Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 2377–2389. Available from: aclanthology.org/202… .

[61] Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on Human Language Technology Research -. San Diego, California: Association for Computational Linguistics; 2002. p. 138. Available from: portal.acm.org/citat… .

[62] Zhao W, Peyrard M, Liu F, Gao Y, Meyer CM, Eger S. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In: Inui K, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 563–578. Available from: https: //aclanthology.org/D19-1053 .

[63] Giannakopoulos G, Karkaletsis V. AutoSummENG and MeMoG in Evaluating Guided Summaries. 2011.

[64] Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic Propositional Image Caption Evaluation. In: Leibe B, Matas J, Sebe N, Welling M, editors. Computer Vision – ECCV 2016. Cham: Springer International Publishing; 2016. p. 382–398.

[65] Mathur N, Baldwin T, Cohn T. Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation. In: Korhonen A, Traum D, M` arquez L, editors. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. p. 2799–2808. Available from: aclanthology.org/ P19-1269 .

[66] Echizen’ya H, Araki K, Hovy E. Word Embedding-Based Automatic MT Evaluation Metric using Word Position Information. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 1874–1883. Available from: aclanthology.org/ N19-1186 .

[67] Kusner M, Sun Y, Kolkin N, Weinberger K. From Word Embeddings To Document Distances. In: Proceedings of the 32nd International Conference on Machine Learning. PMLR; 2015. p. 957–966. Available from: proceedings.mlr.pres… .

[68] Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. In: Korhonen A, Traum D, M` arquez L, editors. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. p. 4344–4355. Available from: aclanthology.org/P19… .

[69] Kane H, Kocyigit MY, Abdalla A, Ajanoh P, Coulibali M. NUBIA: NeUral Based Interchange ability Assessor for Text Generation. 2020 May. ArXiv:2004.14667 [cs]. Available from: arxiv.org/ abs/2004.14667 .

[70] Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-Alignment Pretraining for Biomedical Entity Representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; 2021. p. 4228-38. Available from: www.aclweb.org/antho… 2021.naacl-main.334 .

[71] Alsentzer E, Murphy JR, Boag W, Weng W, Jin D, Naumann T, et al. Publicly Available Clinical BERT Embeddings. CoRR. 2019;abs/1904.03323. Available from: arxiv.org/abs/1904.0… .

[72] Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al… Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing; 2020.

[73] Delbrouck JB. UMLS Scorer; 2023. Available from: storage.googleapis.c… vilmedic{_}dataset/packages/medcon/UMLSScorer.zip .

[74] Yuan W, Neubig G, Liu P. BARTScore: Evaluating Generated Text as Text Generation. 2021 Oct. ArXiv:2106.11520 [cs]. Available from: arxiv.org/abs/2106.1… .

[75] Son S, Park J, Hwang Ji, Lee J, Noh H, Lee Y. HaRiM+: Evaluating Summary Quality with Hallucination Risk. 2022.

[76] Akter M, Bansal N, Karmaker SK. Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE? In: Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 1547–1560. Available from: aclanthology.org/202… .

[77] Aracena C, Villena F, Rojas M, Dunstan J. A Knowledge-Graph-Based Intrinsic Test for Benchmarking Medical Concept Embeddings and Pretrained Language Models. 2022.

[78] Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. BART: Denoising Sequenceto-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics; 2020. p. 7871–7880. Available from: aclanthology.org/202… .

[79] Lindberg DA MA Humphreys BL. The Unified Medical Language System. Yearb Med Inform. 1993;1(4):41-51.

[80] Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-Alignment Pretraining for Biomedical Entity Representations. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S, et al., editors. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; 2021. p. 4228–4238. Available from: aclanthology.org/202… .

[81] Christiano P, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences; 2017. Available from: arxiv.org/abs/1706.0… .

[82] OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 Technical Report. 2024 Mar. ArXiv:2303.08774 [cs]. Available from: arxiv.org/abs/2303.0… .

[83] Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023 Dec. ArXiv:2306.05685 [cs]. Available from: http://arxiv. org/abs/2306.05685 .

[84] Lester B, Al-Rfou R, Constant N. The Power of Scale for Parameter-Efficient Prompt Tuning. In: Moens MF, Huang X, Specia L, Yih SWt, editors. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. p. 3045–3059. Available from: aclanthology.org/ 2021.emnlp-main.243 .

[85] Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. 2023 May. ArXiv:2305.14314 [cs]. Available from: arxiv.org/abs/2305.1… .

[86] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: Low-Rank Adaptation of Large Language Models. 2021 Oct. ArXiv:2106.09685 [cs]. Available from: arxiv.org/abs/2106. 09685 .

[87] Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al… Fine-Tuning Language Models from Human Preferences; 2019. Available from: arxiv.org/abs/1909.0… .

[88] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal Policy Optimization Algorithms. 2017 Aug. ArXiv:1707.06347 [cs]. Available from: arxiv.org/abs/1707.0… .

[89] Rafailov R, Sharma A, Mitchell E, Ermon S, Manning CD, Finn C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. 2023 May. ArXiv:2305.18290 [cs]. Available from: arxiv.org/abs/2305.1… .

[90] Wen J, Zhong R, Khan A, Perez E, Steinhardt J, Huang M, et al. Language Models Learn to Mislead Humans via RLHF. 2024 Sep. ArXiv:2409.12822 [cs]. Available from: arxiv.org/abs/2409. 12822 .

[91] Cao X, Xu W, Zhao J, Duan Y, Yang X. Research on Large Language Model for Coal Mine Equipment Maintenance Based on Multi-Source Text. APPLIED SCIENCES-BASEL. 2024 Apr;14(7).

[92] Iqbal S, Mehran K, IEEE. Reinforcement Learning Based Optimal Energy Management of A Microgrid; 2022. .

[93] Sun Z, Zhou Y, Hao J, Fan X, Lu Y, Ma C, et al. Improving Contextual Query Rewrite for Conversational AI Agents through User-preference Feedback Learning. In: Wang M, Zitouni I, editors. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. Singapore: Association for Computational Linguistics; 2023. p. 432–439. Available from: aclanthology.org/202… .

[94] Bansal H, Suvarna A, Bhatt G, Peng N, Chang KW, Grover A. Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization. 2024 Mar. ArXiv:2404.00530 [cs]. Available from: arxiv.org/abs/2404.0… .

[95] Meng Y, Xia M, Chen D. SimPO: Simple Preference Optimization with a Reference-Free Reward. 2024 May. ArXiv:2405.14734 [cs]. Available from: arxiv.org/abs/2405.1… .

[96] Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: Model Alignment as Prospect Theoretic Optimization. 2024 Jun. ArXiv:2402.01306. Available from: arxiv.org/abs/2402. 01306 .

[97] Rosset C, Cheng CA, Mitra A, Santacroce M, Awadallah A, Xie T. Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences. 2024 Apr. ArXiv:2404.03715 [cs]. Available from: arxiv.org/abs/2404.0… .

[98] Liu T, Zhao Y, Joshi R, Khalman M, Saleh M, Liu PJ, et al. Statistical Rejection Sampling Improves Preference Optimization. 2024 Jan. ArXiv:2309.06657 [cs]. Available from: arxiv.org/abs/ 2309.06657 .

[99] Azar MG, Rowland M, Piot B, Guo D, Calandriello D, Valko M, et al. A General Theoretical Paradigm to Understand Learning from Human Preferences. 2023 Nov. ArXiv:2310.12036 [cs, stat]. Available from: arxiv.org/abs/2310.1… .

[100] Mitchell E. A note on DPO with noisy preferences and relationship to IPO; 2023. V1.1.

[101] Hong J, Lee N, Thorne J. ORPO: Monolithic Preference Optimization without Reference Model. 2024 Mar. ArXiv:2403.07691 [cs]. Available from: arxiv.org/abs/2403.0… .

[102] Chowdhury SR, Kini A, Natarajan N. Provably Robust DPO: Aligning Language Models with Noisy Feedback. 2024 Apr. ArXiv:2403.00409 [cs]. Available from: arxiv.org/abs/2403.0… .

[103] Jung S, Han G, Nam DW, On KW. Binary Classifier Optimization for Large Language Model Alignment. 2024 Apr. ArXiv:2404.04656 [cs]. Available from: arxiv.org/abs/2404.0… .

[104] Gorbatovski A, Shaposhnikov B, Malakhov A, Surnachev N, Aksenov Y, Maksimov I, et al. Learn Your Reference Model for Real Good Alignment. 2024 May. ArXiv:2404.09656 [cs]. Available from: arxiv.org/abs/2404.0… .

[105] Xu H, Sharaf A, Chen Y, Tan W, Shen L, Van Durme B, et al. Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation. 2024 Jun. ArXiv:2401.08417 [cs]. Available from: arxiv.org/abs/2401.0… .

[106] Wu Y, Sun Z, Yuan H, Ji K, Yang Y, Gu Q. Self-Play Preference Optimization for Language Model Alignment. 2024 Jun. ArXiv:2405.00675 [cs, stat]. Available from: arxiv.org/abs/2405. 00675 .

[107] Ji H, Lu C, Niu Y, Ke P, Wang H, Zhu J, et al. Towards Efficient Exact Optimization of Language Model Alignment. 2024 Jun. ArXiv:2402.00856 [cs]. Available from: arxiv.org/abs/2402.0… .

[108] Melnyk I, Mroueh Y, Belgodere B, Rigotti M, Nitsure A, Yurochkin M, et al. Distributional Preference Alignment of LLMs via Optimal Transport. 2024 Jun. ArXiv:2406.05882 [cs, stat]. Available from: arxiv.org/abs/2406.0… .

[109] Pang RY, Yuan W, Cho K, He H, Sukhbaatar S, Weston J. Iterative Reasoning Preference Optimization. 2024 Jun. ArXiv:2404.19733 [cs]. Available from: arxiv.org/abs/2404.1… .

[110] Chen H, He G, Yuan L, Cui G, Su H, Zhu J. Noise Contrastive Alignment of Language Models with Explicit Rewards. 2024 Jul. ArXiv:2402.05369 [cs]. Available from: arxiv.org/abs/2402. 05369 .

[111] Zhong H, Feng G, Xiong W, Cheng X, Zhao L, He D, et al. DPO Meets PPO: Reinforced Token Optimization for RLHF. 2024 Jul. ArXiv:2404.18922 [cs, stat]. Available from: arxiv.org/ abs/2404.18922 .

[112] Koo R, Lee M, Raheja V, Park JI, Kim ZM, Kang D. Benchmarking Cognitive Biases in Large Language Models as Evaluators. 2024 Aug. ArXiv:2309.17012 [cs]. Available from: http://arxiv. org/abs/2309.17012 .

目录
相关文章
|
机器学习/深度学习 自然语言处理 搜索推荐
9月大型语言模型研究论文总结
大型语言模型(llm)在今年发展迅速,随着新一代模型不断地被开发,研究人员和工程师了解最新进展变得非常重要。本文总结9-10月期间发布了一些重要的LLM论文。
94 0
|
4天前
|
机器学习/深度学习 人工智能 自然语言处理
【大语言模型-论文精读】用于医疗领域摘要任务的大型语言模型评估综述(上)
【大语言模型-论文精读】用于医疗领域摘要任务的大型语言模型评估综述(上)
19 2
|
4天前
|
机器学习/深度学习 人工智能 安全
[大语言模型-论文精读] 更大且更可指导的语言模型变得不那么可靠
[大语言模型-论文精读] 更大且更可指导的语言模型变得不那么可靠
9 0
|
4天前
|
数据采集 机器学习/深度学习 人工智能
[大语言模型-论文精读] 利用多样性进行大型语言模型预训练中重要数据的选择
[大语言模型-论文精读] 利用多样性进行大型语言模型预训练中重要数据的选择
13 0
|
4天前
|
机器学习/深度学习 自然语言处理 算法
深度学习-生成式检索-论文速读-2024-09-14(下)
深度学习-生成式检索-论文速读-2024-09-14(下)
14 0
|
4天前
|
机器学习/深度学习 存储 自然语言处理
深度学习-生成式检索-论文速读-2024-09-14(上)
深度学习-生成式检索-论文速读-2024-09-14(上)
10 0
|
4天前
|
机器学习/深度学习 开发框架 人工智能
[大语言模型-论文精读] 悉尼大学-ACL2024-提升大型语言模型的复杂视觉推理能力
[大语言模型-论文精读] 悉尼大学-ACL2024-提升大型语言模型的复杂视觉推理能力
10 0
|
2月前
|
机器学习/深度学习 存储 运维
ICML 2024:清华提出时间序列大模型:面向通用时序分析的生成式Transformer
【8月更文挑战第7天】在2024年ICML大会上,清华大学团队推出“时间序列大模型(LTSM)”——Timer,一种处理大规模时间序列数据的生成式Transformer。该模型通过预训练学习通用特征,支持多种任务如预测与异常检测。Timer采用统一的数据格式S3处理异构序列,并在数据稀缺场景下展现出色性能。尽管如此,模型泛化能力与计算效率仍有待优化。论文详情参见:https://arxiv.org/abs/2402.02368。
575 4
|
5月前
|
机器学习/深度学习 数据采集 自然语言处理
【论文精读】大语言模型融合知识图谱的问答系统研究
论文题目:大语言模型融合知识图谱的问答系统研究
|
5月前
|
存储 自然语言处理 文字识别
MLLM首篇综述 | 一文全览多模态大模型的前世、今生和未来
MLLM首篇综述 | 一文全览多模态大模型的前世、今生和未来
2613 0