【LangChain系列】第九篇：LLM 应用评估简介及实践-阿里云开发者社区

[toc]

随着语言模型（LLMs）的不断进步，它们的应用变得越来越复杂和精密。随着这种复杂性的增加，评估这些基于LLM的应用程序的性能和准确性也变得更具挑战性。在这篇博客文章中，我们将深入探讨LLM应用评估的世界，探讨可以帮助您评估和改进模型性能的框架和工具。

一、创建QA应用程序

import os
from dotenv import load_dotenv, find_dotenv
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import ChatOpenAI

_ = load_dotenv(find_dotenv())
notebook_path = os.path.abspath("__file__")
notebook_directory = os.path.dirname(notebook_path)
csv_file_path = os.path.join(notebook_directory, '..', 'OutdoorClothingCatalog_1000.csv')
loader = CSVLoader(file_path=csv_file_path)
data = loader.load()
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders(
    [loader]
)
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={
   "document_separator": "<<<<>>>>>"},
)

二、构建测试数据

在我们评估LLM应用程序之前，我们需要一组可靠的测试数据。生成测试数据有两种主要方法：

1.手动创建示例

传统的方法涉及手动审查您的数据并制作查询-答案对。假设您正在使用一个服装数据集。您可以浏览描述并创建问题，比如“Cozy Comfort Pullover Set有侧口袋吗？”并提供相应的答案。虽然这种方法让您完全控制示例，但它可能会耗费时间，并且在处理更大的数据集时可能不太容易扩展。

# Hard-coded examples
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set \
        have side pockets?",
        "answer": "Yes",
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection",
    },
]

2.使用LLM生成示例

您也可以使用LLM本身来生成测试数据。LangChain 提供了 QAGenerateChain，它可以从您的文档自动生成查询-答案对。它是一个可以根据您的数据创建假设性问题和答案的AI助手。

from langchain.evaluation.qa import QAGenerateChain
from pprint import pprint
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
new_examples = example_gen_chain.batch([{"doc": t} for t in data[:5]])
pprint(new_examples[0]["qa_pairs"])
# Output
# {'answer': "The approximate weight of the Women's Campside Oxfords per pair is "
#            '1 lb. 1 oz.',
#  'query': "What is the approximate weight of the Women's Campside Oxfords per "
#           'pair?'}

data[0]
# Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", 
#         metadata={'source': '/home/voldemort/Downloads/Code/Langchain_Harrison_Chase/Course_1/OutdoorClothingCatalog_1000.csv', 'row': 0})

通过结合手工制作的示例和LLM生成的示例，您可以快速构建一个强大的测试数据集。

examples.extend([inst["qa_pairs"] for inst in new_examples])

三、手动评估和调试

有了测试数据，现在是时候评估你的LLM应用程序的性能了。最简单的方法是通过应用程序运行示例并检查最终输出。

qa.invoke(examples[-1]["query"])
# Output
# Entering new RetrievalQA chain...
# Finished chain.
# {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and keep the wearer dry and comfortable?',
#  'result': 'The EcoFlex 3L Storm Pants use TEK O2 technology to make them more breathable and keep the wearer dry and comfortable.'}

然而，这种方法可能有局限性，因为它无法提供有关应用程序流程中间步骤或潜在问题的洞察。

1.通过应用程序运行示例

为了更深入了解您的应用程序行为，LangChain提供了langchain.debug工具。当启用时，此实用程序会在应用程序执行的每个步骤中打印出详细信息，包括提示、上下文和中间结果。

import langchain
langchain.debug = True
qa.invoke(examples[0]["query"])

通过检查这个输出，您可以识别检索或提示步骤中的潜在问题，从而让您更有效地找出并解决问题。

"""
Output:

> Entering new RetrievalQA chain...
> Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
> Entering StuffDocumentsChain run with input:
[inputs]
> Entering LLMChain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 73\nname: Cozy Cuddles Knit Pullover Set\n...
}
[llm/start] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 73\nname: Cozy Cuddles Knit Pullover Set\n...
Human: Do the Cozy Comfort Pullover Set have side pockets?"
  ]
}
[llm/end] [1.89s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "Yes, the Cozy Comfort Pullover Set does have side pockets.",
        ...
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "completion_tokens": 14,
      "prompt_tokens": 733,
      "total_tokens": 747
    },
    "model_name": "gpt-3.5-turbo",
    "system_fingerprint": "fp_3b956da36b"
  },
  "run": null
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "output_text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [2.36s] Exiting Chain run with output:
{
  "result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
"""

# Final Output:
# {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
#  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

四、LLM辅助评估

虽然手动评估很有价值，但随着示例数量的增加，它可能会很快变得乏味和主观。这就是LLM辅助评估发挥作用的地方。

1.获取示例的预测

第一步是通过LLM应用程序运行您的示例并收集预测。

predictions = qa.batch(inputs=examples)

2.使用QAEvalChain进行评分

LangChain提供了QAEvalChain，这是一个基于LLM的链，旨在评估您的应用程序预测的正确性。该链使用LLM理解语义相似性的能力，确保即使预测与预期答案不完全匹配，也能准确评分。

from langchain.evaluation import QAEvalChain
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

通过评分输出，您可以快速识别需要改进的领域，并对您的LLM应用程序进行迭代。

for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]["query"])
    print("Real Answer: " + predictions[i]["answer"])
    print("Predicted Answer: " + predictions[i]["result"])
    print("Predicted Grade: " + graded_outputs[i]["results"])
    print()

最终输出类似如下：

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28", and the dimensions of the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: Some key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom, secure fit, and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps and fully lined bottom for a secure fit and coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts is as follows:
- Body: 82% recycled nylon, 18% Lycra® spandex
- Lining: 90% recycled nylon, 10% Lycra® spandex
Predicted Grade: CORRECT

Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is a state-of-the-art air-permeable technology that offers the most breathability tested by the brand.
Predicted Grade: CORRECT

graded_outputs[-1]
# {'results': 'CORRECT'}

小结

评估LLM应用程序是确保其可靠性和性能的关键步骤。通过利用类似LangChain的QAGenerateChain、langchain.debug、QAEvalChain和LangChain评估平台等工具，您可以简化评估过程，深入了解应用程序的行为，并更有效率地进行迭代。无论您是经验丰富的机器学习专业人员还是刚开始学习的人，这些框架和工具都可以帮助您发挥LLM应用程序的全部潜力。

小编是一名热爱人工智能的专栏作者，致力于分享人工智能领域的最新知识、技术和趋势。这里，你将能够了解到人工智能的最新应用和创新，探讨人工智能对未来社会的影响，以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞，评论，收藏，让我们一起探索人工智能的奥秘，共同见证科技的进步！