[toc]
随着语言模型(LLMs)的不断进步,它们的应用变得越来越复杂和精密。随着这种复杂性的增加,评估这些基于LLM的应用程序的性能和准确性也变得更具挑战性。在这篇博客文章中,我们将深入探讨LLM应用评估的世界,探讨可以帮助您评估和改进模型性能的框架和工具。
一、创建QA应用程序
import os
from dotenv import load_dotenv, find_dotenv
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import ChatOpenAI
_ = load_dotenv(find_dotenv())
notebook_path = os.path.abspath("__file__")
notebook_directory = os.path.dirname(notebook_path)
csv_file_path = os.path.join(notebook_directory, '..', 'OutdoorClothingCatalog_1000.csv')
loader = CSVLoader(file_path=csv_file_path)
data = loader.load()
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders(
[loader]
)
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=index.vectorstore.as_retriever(),
verbose=True,
chain_type_kwargs={
"document_separator": "<<<<>>>>>"},
)
二、构建测试数据
在我们评估LLM应用程序之前,我们需要一组可靠的测试数据。生成测试数据有两种主要方法:
1.手动创建示例
传统的方法涉及手动审查您的数据并制作查询-答案对。假设您正在使用一个服装数据集。您可以浏览描述并创建问题,比如“Cozy Comfort Pullover Set有侧口袋吗?”并提供相应的答案。虽然这种方法让您完全控制示例,但它可能会耗费时间,并且在处理更大的数据集时可能不太容易扩展。
# Hard-coded examples
examples = [
{
"query": "Do the Cozy Comfort Pullover Set \
have side pockets?",
"answer": "Yes",
},
{
"query": "What collection is the Ultra-Lofty \
850 Stretch Down Hooded Jacket from?",
"answer": "The DownTek collection",
},
]
2.使用LLM生成示例
您也可以使用LLM本身来生成测试数据。LangChain 提供了 QAGenerateChain,它可以从您的文档自动生成查询-答案对。它是一个可以根据您的数据创建假设性问题和答案的AI助手。
from langchain.evaluation.qa import QAGenerateChain
from pprint import pprint
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
new_examples = example_gen_chain.batch([{"doc": t} for t in data[:5]])
pprint(new_examples[0]["qa_pairs"])
# Output
# {'answer': "The approximate weight of the Women's Campside Oxfords per pair is "
# '1 lb. 1 oz.',
# 'query': "What is the approximate weight of the Women's Campside Oxfords per "
# 'pair?'}
data[0]
# Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.",
# metadata={'source': '/home/voldemort/Downloads/Code/Langchain_Harrison_Chase/Course_1/OutdoorClothingCatalog_1000.csv', 'row': 0})
通过结合手工制作的示例和LLM生成的示例,您可以快速构建一个强大的测试数据集。
examples.extend([inst["qa_pairs"] for inst in new_examples])
三、手动评估和调试
有了测试数据,现在是时候评估你的LLM应用程序的性能了。最简单的方法是通过应用程序运行示例并检查最终输出。
qa.invoke(examples[-1]["query"])
# Output
# Entering new RetrievalQA chain...
# Finished chain.
# {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and keep the wearer dry and comfortable?',
# 'result': 'The EcoFlex 3L Storm Pants use TEK O2 technology to make them more breathable and keep the wearer dry and comfortable.'}
然而,这种方法可能有局限性,因为它无法提供有关应用程序流程中间步骤或潜在问题的洞察。
1.通过应用程序运行示例
为了更深入了解您的应用程序行为,LangChain提供了langchain.debug工具。当启用时,此实用程序会在应用程序执行的每个步骤中打印出详细信息,包括提示、上下文和中间结果。
import langchain
langchain.debug = True
qa.invoke(examples[0]["query"])
通过检查这个输出,您可以识别检索或提示步骤中的潜在问题,从而让您更有效地找出并解决问题。
"""
Output:
> Entering new RetrievalQA chain...
> Entering Chain run with input:
{
"query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
> Entering StuffDocumentsChain run with input:
[inputs]
> Entering LLMChain run with input:
{
"question": "Do the Cozy Comfort Pullover Set have side pockets?",
"context": ": 73\nname: Cozy Cuddles Knit Pullover Set\n...
}
[llm/start] Entering LLM run with input:
{
"prompts": [
"System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 73\nname: Cozy Cuddles Knit Pullover Set\n...
Human: Do the Cozy Comfort Pullover Set have side pockets?"
]
}
[llm/end] [1.89s] Exiting LLM run with output:
{
"generations": [
[
{
"text": "Yes, the Cozy Comfort Pullover Set does have side pockets.",
...
}
]
],
"llm_output": {
"token_usage": {
"completion_tokens": 14,
"prompt_tokens": 733,
"total_tokens": 747
},
"model_name": "gpt-3.5-turbo",
"system_fingerprint": "fp_3b956da36b"
},
"run": null
}
[chain/end] [1.89s] Exiting Chain run with output:
{
"text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [1.89s] Exiting Chain run with output:
{
"output_text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [2.36s] Exiting Chain run with output:
{
"result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
"""
# Final Output:
# {'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
# 'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}
四、LLM辅助评估
虽然手动评估很有价值,但随着示例数量的增加,它可能会很快变得乏味和主观。这就是LLM辅助评估发挥作用的地方。
1.获取示例的预测
第一步是通过LLM应用程序运行您的示例并收集预测。
predictions = qa.batch(inputs=examples)
2.使用QAEvalChain进行评分
LangChain提供了QAEvalChain,这是一个基于LLM的链,旨在评估您的应用程序预测的正确性。该链使用LLM理解语义相似性的能力,确保即使预测与预期答案不完全匹配,也能准确评分。
from langchain.evaluation import QAEvalChain
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)
通过评分输出,您可以快速识别需要改进的领域,并对您的LLM应用程序进行迭代。
for i, eg in enumerate(examples):
print(f"Example {i}:")
print("Question: " + predictions[i]["query"])
print("Real Answer: " + predictions[i]["answer"])
print("Predicted Answer: " + predictions[i]["result"])
print("Predicted Grade: " + graded_outputs[i]["results"])
print()
最终输出类似如下:
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT
Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT
Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT
Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28", and the dimensions of the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT
Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: Some key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom, secure fit, and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps and fully lined bottom for a secure fit and coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT
Example 5:
Question: What is the fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts is as follows:
- Body: 82% recycled nylon, 18% Lycra® spandex
- Lining: 90% recycled nylon, 10% Lycra® spandex
Predicted Grade: CORRECT
Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is a state-of-the-art air-permeable technology that offers the most breathability tested by the brand.
Predicted Grade: CORRECT
graded_outputs[-1]
# {'results': 'CORRECT'}
小结
评估LLM应用程序是确保其可靠性和性能的关键步骤。通过利用类似LangChain的QAGenerateChain、langchain.debug、QAEvalChain和LangChain评估平台等工具,您可以简化评估过程,深入了解应用程序的行为,并更有效率地进行迭代。无论您是经验丰富的机器学习专业人员还是刚开始学习的人,这些框架和工具都可以帮助您发挥LLM应用程序的全部潜力。
小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!