【LangChain系列】第九篇:LLM 应用评估简介及实践

本文涉及的产品
交互式建模 PAI-DSW,5000CU*H 3个月
简介: 【5月更文挑战第23天】本文探讨了如何评估复杂且精密的语言模型(LLMs)应用。通过创建QA应用程序,如使用GPT-3.5-Turbo模型,然后构建测试数据,包括手动创建和使用LLM生成示例。接着,通过手动评估、调试及LLM辅助评估来衡量性能。手动评估借助langchain.debug工具提供执行细节,而QAEvalChain则利用LLM的语义理解能力进行评分。这些方法有助于优化和提升LLM应用程序的准确性和效率。

[toc]


随着语言模型(LLMs)的不断进步,它们的应用变得越来越复杂和精密。随着这种复杂性的增加,评估这些基于LLM的应用程序的性能和准确性也变得更具挑战性。在这篇博客文章中,我们将深入探讨LLM应用评估的世界,探讨可以帮助您评估和改进模型性能的框架和工具。

一、创建QA应用程序

import os
from dotenv import load_dotenv, find_dotenv
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import ChatOpenAI

_ = load_dotenv(find_dotenv())
notebook_path = os.path.abspath("__file__")
notebook_directory = os.path.dirname(notebook_path)
csv_file_path = os.path.join(notebook_directory, '..', 'OutdoorClothingCatalog_1000.csv')
loader = CSVLoader(file_path=csv_file_path)
data = loader.load()
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders(
    [loader]
)
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={
   "document_separator": "<<<<>>>>>"},
)

二、构建测试数据

在我们评估LLM应用程序之前,我们需要一组可靠的测试数据。生成测试数据有两种主要方法:

1.手动创建示例

传统的方法涉及手动审查您的数据并制作查询-答案对。假设您正在使用一个服装数据集。您可以浏览描述并创建问题,比如“Cozy Comfort Pullover Set有侧口袋吗?”并提供相应的答案。虽然这种方法让您完全控制示例,但它可能会耗费时间,并且在处理更大的数据集时可能不太容易扩展。

# Hard-coded examples
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set \
        have side pockets?",
        "answer": "Yes",
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection",
    },
]

2.使用LLM生成示例

您也可以使用LLM本身来生成测试数据。LangChain 提供了 QAGenerateChain,它可以从您的文档自动生成查询-答案对。它是一个可以根据您的数据创建假设性问题和答案的AI助手。

from langchain.evaluation.qa import QAGenerateChain
from pprint import pprint
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
new_examples = example_gen_chain.batch([{"doc": t} for t in data[:5]])
pprint(new_examples[0]["qa_pairs"])
# Output
# {'answer': "The approximate weight of the Women's Campside Oxfords per pair is "
#            '1 lb. 1 oz.',
#  'query': "What is the approximate weight of the Women's Campside Oxfords per "
#           'pair?'}

data[0]
# Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", 
#         metadata={'source': '/home/voldemort/Downloads/Code/Langchain_Harrison_Chase/Course_1/OutdoorClothingCatalog_1000.csv', 'row': 0})

通过结合手工制作的示例和LLM生成的示例,您可以快速构建一个强大的测试数据集。

examples.extend([inst["qa_pairs"] for inst in new_examples])

三、手动评估和调试

有了测试数据,现在是时候评估你的LLM应用程序的性能了。最简单的方法是通过应用程序运行示例并检查最终输出。

qa.invoke(examples[-1]["query"])
# Output
# Entering new RetrievalQA chain...
# Finished chain.
# {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and keep the wearer dry and comfortable?',
#  'result': 'The EcoFlex 3L Storm Pants use TEK O2 technology to make them more breathable and keep the wearer dry and comfortable.'}

然而,这种方法可能有局限性,因为它无法提供有关应用程序流程中间步骤或潜在问题的洞察。

1.通过应用程序运行示例

为了更深入了解您的应用程序行为,LangChain提供了langchain.debug工具。当启用时,此实用程序会在应用程序执行的每个步骤中打印出详细信息,包括提示、上下文和中间结果。

import langchain
langchain.debug = True
qa.invoke(examples[0]["query"])

通过检查这个输出,您可以识别检索或提示步骤中的潜在问题,从而让您更有效地找出并解决问题。

"""
Output:

> Entering new RetrievalQA chain...
> Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
> Entering StuffDocumentsChain run with input:
[inputs]
> Entering LLMChain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 73\nname: Cozy Cuddles Knit Pullover Set\n...
}
[llm/start] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 73\nname: Cozy Cuddles Knit Pullover Set\n...
Human: Do the Cozy Comfort Pullover Set have side pockets?"
  ]
}
[llm/end] [1.89s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "Yes, the Cozy Comfort Pullover Set does have side pockets.",
        ...
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "completion_tokens": 14,
      "prompt_tokens": 733,
      "total_tokens": 747
    },
    "model_name": "gpt-3.5-turbo",
    "system_fingerprint": "fp_3b956da36b"
  },
  "run": null
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "output_text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [2.36s] Exiting Chain run with output:
{
  "result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
"""

# Final Output:
# {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
#  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

四、LLM辅助评估

虽然手动评估很有价值,但随着示例数量的增加,它可能会很快变得乏味和主观。这就是LLM辅助评估发挥作用的地方。

1.获取示例的预测

第一步是通过LLM应用程序运行您的示例并收集预测。

predictions = qa.batch(inputs=examples)

2.使用QAEvalChain进行评分

LangChain提供了QAEvalChain,这是一个基于LLM的链,旨在评估您的应用程序预测的正确性。该链使用LLM理解语义相似性的能力,确保即使预测与预期答案不完全匹配,也能准确评分。

from langchain.evaluation import QAEvalChain
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

通过评分输出,您可以快速识别需要改进的领域,并对您的LLM应用程序进行迭代。

for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]["query"])
    print("Real Answer: " + predictions[i]["answer"])
    print("Predicted Answer: " + predictions[i]["result"])
    print("Predicted Grade: " + graded_outputs[i]["results"])
    print()

最终输出类似如下:

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28", and the dimensions of the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: Some key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom, secure fit, and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps and fully lined bottom for a secure fit and coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts is as follows:
- Body: 82% recycled nylon, 18% Lycra® spandex
- Lining: 90% recycled nylon, 10% Lycra® spandex
Predicted Grade: CORRECT

Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is a state-of-the-art air-permeable technology that offers the most breathability tested by the brand.
Predicted Grade: CORRECT
graded_outputs[-1]
# {'results': 'CORRECT'}

小结

评估LLM应用程序是确保其可靠性和性能的关键步骤。通过利用类似LangChain的QAGenerateChain、langchain.debug、QAEvalChain和LangChain评估平台等工具,您可以简化评估过程,深入了解应用程序的行为,并更有效率地进行迭代。无论您是经验丰富的机器学习专业人员还是刚开始学习的人,这些框架和工具都可以帮助您发挥LLM应用程序的全部潜力。

小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!

相关实践学习
使用CLup和iSCSI共享盘快速体验PolarDB for PostgtreSQL
在Clup云管控平台中快速体验创建与管理在iSCSI共享盘上的PolarDB for PostgtreSQL。
AnalyticDB PostgreSQL 企业智能数据中台:一站式管理数据服务资产
企业在数据仓库之上可构建丰富的数据服务用以支持数据应用及业务场景;ADB PG推出全新企业智能数据平台,用以帮助用户一站式的管理企业数据服务资产,包括创建, 管理,探索, 监控等; 助力企业在现有平台之上快速构建起数据服务资产体系
目录
相关文章
|
5天前
|
存储 缓存 安全
LLM应用实战:当图谱问答(KBQA)集成大模型(三)
本文主要是针对KBQA方案基于LLM实现存在的问题进行优化,主要涉及到响应时间提升优化以及多轮对话效果优化,提供了具体的优化方案以及相应的prompt。
14 1
|
5天前
|
存储 自然语言处理 C++
LLM应用实战:当KBQA集成LLM(二)
本文主要是针对KBQA方案基于LLM实现存在的问题进行优化,主要涉及到图谱存储至Es,且支持Es的向量检索,还有解决了一部分基于属性值倒查实体的场景,且效果相对提升。
14 1
|
5天前
|
缓存 自然语言处理 知识图谱
LLM应用实战:当KBQA集成LLM
项目是关于一个博物馆知识图谱,上层做KBQA应用。实现要求是将传统KBQA中的部分模块,如NLU、指代消解、实体对齐等任务,完全由LLM实现,本qiang~针对该任务还是灰常感兴趣的,遂开展了项目研发工作
15 0
|
5天前
|
存储 人工智能 自然语言处理
LangChain让LLM带上记忆
最近两年,我们见识了“百模大战”,领略到了大型语言模型(LLM)的风采,但它们也存在一个显著的缺陷:没有记忆。在对话中,无法记住上下文的 LLM 常常会让用户感到困扰。本文探讨如何利用 LangChain,快速为 LLM 添加记忆能力,提升对话体验。
LangChain让LLM带上记忆
|
11天前
|
Shell 开发工具 Android开发
|
16天前
|
机器学习/深度学习 缓存 算法
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
LLM 大模型学习必知必会系列(十二):VLLM性能飞跃部署实践:从推理加速到高效部署的全方位优化[更多内容:XInference/FastChat等框架]
|
16天前
|
弹性计算 人工智能 JSON
一键云部署:资源编排 ROS 轻松部署 LLM 应用开发平台 Dify
Dify是一款开源的LLM应用开发平台,融合BaaS和LLMOps理念,助力开发者快速构建生产级AI应用。阿里云的ROS提供IaC自动化部署服务,通过JSON/YAML/Terraform模板轻松部署Dify环境。以下是简化的部署步骤: 1. 登录ROS控制台的Dify部署页面。 2. 配置ECS实例参数。 3. 创建资源栈,完成部署后从输出获取Dify服务地址。 ROS模板定义了VPC、VSwitch、ECS实例等资源,通过ROS控制台创建资源栈实现自动化部署。这种方式高效、稳定,体现了IaC的最佳实践。
201 1
|
18天前
|
缓存 人工智能 数据可视化
LLM 大模型学习必知必会系列(十一):大模型自动评估理论和实战以及大模型评估框架详解
LLM 大模型学习必知必会系列(十一):大模型自动评估理论和实战以及大模型评估框架详解
LLM 大模型学习必知必会系列(十一):大模型自动评估理论和实战以及大模型评估框架详解
|
18天前
|
物联网 PyTorch 测试技术
LLM 大模型学习必知必会系列(十):基于AgentFabric实现交互式智能体应用,Agent实战
LLM 大模型学习必知必会系列(十):基于AgentFabric实现交互式智能体应用,Agent实战
LLM 大模型学习必知必会系列(十):基于AgentFabric实现交互式智能体应用,Agent实战
|
19天前
|
物联网 测试技术 API
LLM 大模型学习必知必会系列(九):Agent微调最佳实践,用消费级显卡训练属于自己的Agent!
LLM 大模型学习必知必会系列(九):Agent微调最佳实践,用消费级显卡训练属于自己的Agent!
LLM 大模型学习必知必会系列(九):Agent微调最佳实践,用消费级显卡训练属于自己的Agent!