Transformers 4.37 中文文档(六)(3)https://developer.aliyun.com/article/1564081
图像加标题
图像加标题是预测给定图像的标题的任务。一个常见的应用是帮助视障人士在不同情况下导航,例如,在线探索图像内容。
为了说明任务,获取一个需要加标题的图像,例如:
照片由Hendo Wang拍摄。
IDEFICS 接受文本和图像提示。但是,要为图像添加字幕,您不必向模型提供文本提示,只需提供预处理后的输入图像。没有文本提示,模型将从 BOS(序列开始)标记开始生成文本,从而创建字幕。
作为模型的图像输入,您可以使用图像对象(PIL.Image
)或可以从中检索图像的 url。
>>> prompt = [ ... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", ... ] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) A puppy in a flower bed• 1
在调用generate
时,最好包含bad_words_ids
,以避免在增加max_new_tokens
时出现错误:当模型要生成一个新的或
标记时,而模型没有生成图像时,会出现错误。您可以像本指南中那样即时设置它,或者像文本生成策略指南中描述的那样存储在
GenerationConfig
中。
提示的图像字幕
您可以通过提供文本提示来扩展图像字幕,模型将继续给出图像。让我们拿另一张图片来说明:
照片由Denys Nevozhai拍摄。
文本和图像提示可以作为单个列表传递给模型的处理器,以创建适当的输入。
>>> prompt = [ ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", ... "This is an image of ", ... ] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) This is an image of the Eiffel Tower in Paris, France.
少量提示
虽然 IDEFICS 展示了出色的零-shot 结果,但您的任务可能需要一定格式的字幕,或者伴随其他限制或要求,增加任务的复杂性。少量提示可用于启用上下文学习。通过在提示中提供示例,您可以引导模型生成类似于给定示例格式的结果。
让我们以埃菲尔铁塔的上一张图片作为模型的示例,并构建一个提示,向模型展示除了学习图像中的对象是什么之外,我们还希望获得一些有趣的信息。然后,让我们看看,如果我们可以为自由女神像的图片获得相同的响应格式:
照片由Juan Mayobre拍摄。
>>> prompt = ["User:", ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", ... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", ... "User:", ... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", ... "Describe this image.\nAssistant:" ... ] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) User: Describe this image. Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. User: Describe this image. Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.
请注意,仅从单个示例(即 1-shot)中,模型已经学会了如何执行任务。对于更复杂的任务,请随时尝试使用更多的示例(例如 3-shot,5-shot 等)。
视觉问题回答
视觉问题回答(VQA)是根据图像回答开放式问题的任务。与图像字幕类似,它可以用于辅助功能应用程序,还可以用于教育(关于视觉材料的推理)、客户服务(基于图像的产品问题)和图像检索。
让我们为这个任务获取一张新的图片:
照片由Jarritos Mexican Soda拍摄。
您可以通过适当的指示将模型从图像字幕转向视觉问题回答:
>>> prompt = [ ... "Instruction: Provide an answer to the question. Use the image to answer.\n", ... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", ... "Question: Where are these people and what's the weather like? Answer:" ... ] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) Instruction: Provide an answer to the question. Use the image to answer. Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.
图像分类
IDEFICS 能够将图像分类为不同的类别,而无需明确在包含来自这些特定类别的标记示例的数据上进行训练。给定一组类别并利用其图像和文本理解能力,模型可以推断图像可能属于哪个类别。
假设我们有这样一个蔬菜摊的图片:
照片由Peter Wendt拍摄。
我们可以指示模型将图像分类为我们拥有的类别之一:
>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] >>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", ... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", ... "Category: " ... ] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. Category: Vegetables
在上面的示例中,我们指示模型将图像分类为单个类别,但是,您也可以提示模型进行排名分类。
图像引导的文本生成
对于更有创意的应用,您可以使用基于图像的文本生成来根据图像生成文本。这可以用于创建产品描述、广告、场景描述等。
让我们提示 IDEFICS 根据一扇红门的简单图像撰写一个故事:
照片由Craig Tidball提供。
>>> prompt = ["Instruction: Use the image to write a story. \n", ... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", ... "Story: \n"] >>> inputs = processor(prompt, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> print(generated_text[0]) Instruction: Use the image to write a story. Story: Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. The little girl ran inside and told her mother about the man. Her mother said, “Don’t worry, honey. He’s just a friendly ghost.” The little girl wasn’t sure if she believed her mother, but she went outside anyway. When she got to the door, the man was gone. The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. He was wearing a long black coat and a top hat. The little girl ran
看起来 IDEFICS 注意到了门廊上的南瓜,并选择了一个关于鬼魂的恐怖万圣节故事。
对于像这样的较长输出,您将受益于调整文本生成策略。这可以帮助您显着提高生成输出的质量。查看文本生成策略以了解更多信息。
批量模式下运行推理
之前的所有部分都展示了 IDEFICS 的一个示例。以非常相似的方式,您可以通过传递提示列表来为一批示例运行推理:
>>> prompts = [ ... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", ... "This is an image of ", ... ], ... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", ... "This is an image of ", ... ], ... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", ... "This is an image of ", ... ], ... ] >>> inputs = processor(prompts, return_tensors="pt").to("cuda") >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> for i,t in enumerate(generated_text): ... print(f"{i}:\n{t}\n") 0: This is an image of the Eiffel Tower in Paris, France. 1: This is an image of a couple on a picnic blanket. 2: This is an image of a vegetable stand.
用于会话使用的 IDEFICS 指导
对于会话使用情况,您可以在🤗 Hub 上找到模型的经过微调的指导版本:HuggingFaceM4/idefics-80b-instruct
和HuggingFaceM4/idefics-9b-instruct
。
这些检查点是在混合监督和指导微调数据集上对各自基本模型进行微调的结果,这可以提高下游性能,同时使模型在会话设置中更易于使用。
会话使用和提示与使用基本模型非常相似:
>>> import torch >>> from transformers import IdeficsForVisionText2Text, AutoProcessor >>> device = "cuda" if torch.cuda.is_available() else "cpu" >>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) >>> processor = AutoProcessor.from_pretrained(checkpoint) >>> prompts = [ ... [ ... "User: What is in this image?", ... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", ... "<end_of_utterance>", ... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>", ... "\nUser:", ... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", ... "And who is that?<end_of_utterance>", ... "\nAssistant:", ... ], ... ] >>> # --batched mode >>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) >>> # --single sample mode >>> # inputs = processor(prompts[0], return_tensors="pt").to(device) >>> # Generation args >>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids >>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) >>> for i, t in enumerate(generated_text): ... print(f"{i}:\n{t}\n")
LLM 提示指南
原文链接:
huggingface.co/docs/transformers/v4.37.2/en/tasks/prompting
像 Falcon、LLaMA 等大型语言模型是预训练的变压器模型,最初训练用于预测给定一些输入文本的下一个标记。它们通常具有数十亿个参数,并且已经在长时间内训练了数万亿个标记。因此,这些模型变得非常强大和多功能,您可以通过用自然语言提示指导模型来解决多个 NLP 任务。
设计这样的提示以确保最佳输出通常被称为“提示工程”。提示工程是一个需要大量实验的迭代过程。自然语言比编程语言更加灵活和表达丰富,但也可能引入一些歧义。同时,自然语言中的提示对变化非常敏感。即使提示中进行轻微修改也可能导致截然不同的输出。
虽然没有确切的配方可以创建适用于所有情况的提示,但研究人员已经制定出一些最佳实践,有助于更一致地实现最佳结果。
本指南涵盖了提示工程的最佳实践,以帮助您制作更好的 LLM 提示并解决各种 NLP 任务。您将学到:
- 提示的基础知识
- LLM 提示的最佳实践
- 高级提示技术:少样本提示和思维链
- 何时进行微调而不是提示
提示工程仅是 LLM 输出优化过程的一部分。另一个重要组成部分是选择最佳的文本生成策略。您可以自定义 LLM 在生成文本时如何选择每个后续标记,而无需修改任何可训练参数。通过调整文本生成参数,您可以减少生成文本中的重复,并使其更连贯和更具人类声音。文本生成策略和参数超出了本指南的范围,但您可以在以下指南中了解更多相关主题:
- 使用 LLM 进行生成
- 文本生成策略
Transformers 4.37 中文文档(六)(5)https://developer.aliyun.com/article/1564083