前言
7月4日下午,世界人工智能大会科学前沿论坛,上海人工智能实验室OpenGVLab发布了InternVL 2.0 版本,中文名书生·万象。它是国内首个在MMMU(多学科问答)上突破60的模型,堪称开源多模态大模型性能新标杆。数学基准MathVista的测试中、书生·万象的得分为66.3%,显著高于其他闭源商业模型和开源模型。在通用图表基准ChartQA、文档类基准DocVQA、信息图表类基准InfographicVQA中以及通用视觉问答基准MMBench (v1.1)中,书生万象也取得了最先进(SOTA)的表现。科学图表基准AI2D的测试中,书生万象大幅领先其他优秀的开源模型,并与商业闭源模型不相上下。
代码开源地址:
https://github.com/OpenGVLab/InternVL
魔搭社区部署微调代码:
https://modelscope.cn/models/OpenGVLab/InternVL2-8B
书生从视觉生根,稳扎稳打一路走来,如今进化为书生·万象多模态大模型。万象代表对多模态大模型的愿景,即理解真实世界一切事物和景象,实现全模态全任务的通用智能。涵盖图像,视频,文字,语音、三维点云等5种模态,通过渐进式对齐训练,实现了与大语言模型对齐的视觉基础模型,模型”从小到大”、数据”从粗到精"的渐进式的训练策略,以1/5成本完成了大模型的训练。在有限资源下展现出优秀的性能表现,同时也是国内首个在MMMU(多学科问答)上突破60的模型。在数学、图表分析、OCR等任务中表现优异,具备处理复杂多模态任务、真实世界感知方面的强大能力。
通专融合,万象理解万物
书生万象具有千亿规模参数,支持图像,视频,文字,语音、三维点云等模态。为了使模型能够支持丰富的输出格式,书生万象使用了向量链接技术,链接各领域专用解码器,打通梯度传输链路,实现通专融合,支持检测、分割、图像生成、视觉问答等百种细分任务,性能媲美各领域的专家模型。为了训练书生万象模型,团队从各类来源构建了最大图文交错数据集OmniCorpus,包含约160亿图像,3万亿文本词元,相比现有开源图文数据集 ,图像数量扩大了三倍,文本数量扩大了十倍。
创新的渐进式对齐训练,实现与大模型对齐的视觉基础模型
传统的预训练范式直接使用大模型+大数据进行一步到位训练,需要大量的算力资源。为了提高训练效率,研究团队采用了创新的渐进式训练策略,先利用小模型在海量带噪数据上进行高效预训练,再使用大模型在较少高质量精选数据上进行高效对齐,模型"从小到大",数据"从粗到精",仅需20%的算力资源即可取得同等效果。 采用这种训练策略,实现了与大模型对齐的视觉基础模型,同时,团队的多模态大模型,展现出优秀的性能,在MathVista(数学)、AI2D(科学图表)、MMBench(通用视觉问答)、MM-NIAH(多模态长文档)等评测上可比肩GPT-4o、Gemini 1.5 Pro等闭源商用大模型。
模型推理
使用transformers推理代码
import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode from modelscope import AutoModel, AutoTokenizer IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height # calculate the existing image aspect ratio target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) # find the closest aspect ratio to the target target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) # calculate the target width and height target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] # resize the image resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) # split the image split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=6): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values path = 'OpenGVLab/InternVL2-4B' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) # set the max number of tiles in `max_num` pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() generation_config = dict( num_beams=1, max_new_tokens=1024, do_sample=False, ) # pure-text conversation (纯文本对话) question = 'Hello, who are you?' response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'Can you tell me a story?' response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # single-image single-round conversation (单图单轮对话) question = '<image>\nPlease describe the image shortly.' response = model.chat(tokenizer, pixel_values, question, generation_config) print(f'User: {question}') print(f'Assistant: {response}') # single-image multi-round conversation (单图多轮对话) question = '<image>\nPlease describe the image in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'Please write a poem according to the image.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # multi-image multi-round conversation (多图多轮对话) pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'What are the similarities and differences between these two images.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # batch inference, single image per sample (单图批处理) pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list) responses = model.batch_chat(tokenizer, pixel_values, num_patches_list=num_patches_list, questions=questions, generation_config=generation_config) for question, response in zip(questions, responses): print(f'User: {question}') print(f'Assistant: {response}') # video multi-round conversation (视频多轮对话) def get_index(bound, fps, max_frame, first_idx=0, num_segments=32): if bound: start, end = bound[0], bound[1] else: start, end = -100000, 100000 start_idx = max(first_idx, round(start * fps)) end_idx = min(round(end * fps), max_frame) seg_size = float(end_idx - start_idx) / num_segments frame_indices = np.array([ int(start_idx + (seg_size / 2) + np.round(seg_size * idx)) for idx in range(num_segments) ]) return frame_indices def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32): vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) max_frame = len(vr) - 1 fps = float(vr.get_avg_fps()) pixel_values_list, num_patches_list = [], [] transform = build_transform(input_size=input_size) frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments) for frame_index in frame_indices: img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB') img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(tile) for tile in img] pixel_values = torch.stack(pixel_values) num_patches_list.append(pixel_values.shape[0]) pixel_values_list.append(pixel_values) pixel_values = torch.cat(pixel_values_list) return pixel_values, num_patches_list video_path = './examples/red-panda.mp4' # pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1) pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=2) pixel_values = pixel_values.to(torch.bfloat16).cuda() video_prefix = '\n'.join([f'Frame{i+1}:<image>' for i in range(len(num_patches_list))]) question = video_prefix + 'What is the red panda doing?' # Frame1:<image>\nFrame2:<image>\n...\nFrame31:<image>\n{question} response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'Describe this video in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}')
模型微调
使用ms-swift对InternVL 2.0模型微调。
代码开源地址:
https://github.com/modelscope/swift
环境准备
# 设置pip全局镜像 (加速下载) pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ # 安装ms-swift git clone https://github.com/modelscope/swift.git cd swift pip install -e '.[llm]'
ms-swift已接入Internvl2系列模型,包括:Internvl2-2B, Internvl2-4B,Internvl2-8B,Internvl2-26B。这里我们以Internvl2-2B模型和coco图像描述数据集为例, 训练模型的图像描述能力。
数据集链接如下
https://www.modelscope.cn/datasets/modelscope/coco_2014_caption
LoRA微调脚本如下所示。该脚本将只对语言模型的qkv矩阵进行lora微调,如果你想对所有linear层都进行微调,可以指定--lora_target_modules ALL。
# Experimental environment: 4090 # 10GB GPU memory CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type internvl2-2b \ --dataset refcoco-unofficial-grounding#4000 \ --gradient_checkpointing true
如果要使用自定义数据集,只需按以下方式进行指定:
--dataset train.jsonl \
自定义数据集支持json和jsonl样式。支持多轮对话,但总的对话轮次中需包含一张图片, 支持传入本地路径或URL。以下是自定义数据集的示例:
{"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]} {"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path"]}
微调后模型的推理:
命令行推理
# Experimental environment: 4090 CUDA_VISIBLE_DEVICES=0 swift infer \ --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx \ --load_dataset_config true
你也可以选择merge-lora并进行推理:
CUDA_VISIBLE_DEVICES=0 swift export \ --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx \ --merge_lora true CUDA_VISIBLE_DEVICES=0 swift infer \ --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx-merged \ --load_dataset_config true
训练损失可视化:
资源占用:
验证集推理示例
[PROMPT] <|im_start|> system 你是由上海人工智能实验室开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。<|im_end|><|im_start|>user <img>[92546 * 1792]</img> please describe the image.<|im_end|><|im_start|>assistant [OUTPUT]A beach with people flying kites and sitting on the sand.<|im_end|> [LABELS]A crowd of people on a beach flying kites. [IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000218119.jpg'] -------------------------------------------------- [PROMPT] <|im_start|> system 你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。<|im_end|><|im_start|>user <img>[92546 * 1792]</img> please describe the image.<|im_end|><|im_start|>assistant [OUTPUT]A large airplane parked at an airport gate.<|im_end|> [LABELS]An airplane sitting in front of a group of people. [IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000512334.jpg']
对应图片
模型部署
第一步:使用lmdeploy v0.5.0, 需要先设置chat template. 创建如下json文件chat_template.json.
{ "model_name":"internlm2", "meta_instruction":"你是由上海人工智能实验室开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。", "stop_words":["<|im_start|>", "<|im_end|>"] }
第二步:使用lmdeploy部署internvl的api服务
lmdeploy serve api_server /mnt/workspace/InternVL2-2B --model-name InternVL2-2B --server-port 23333 --chat-template chat_template.json
第三步:使用OpenAI样式接口需要安装OpenAI
pip install openai
接口调用
from openai import OpenAI client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1') model_name = client.models.list().data[0].id response = client.chat.completions.create( model="InternVL2-2B", messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': '描述这幅画', }, { 'type': 'image_url', 'image_url': { 'url': 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg', }, }], }], temperature=0.8, top_p=0.8) print(response)
点击链接👇直达原文
https://www.modelscope.cn/models?name=InternVL%202.0&page=1?from=alizishequ__text