前言

7月4日下午，世界人工智能大会科学前沿论坛，上海人工智能实验室OpenGVLab发布了InternVL 2.0 版本，中文名书生·万象。它是国内首个在MMMU(多学科问答)上突破60的模型，堪称开源多模态大模型性能新标杆。数学基准MathVista的测试中、书生·万象的得分为66.3%，显著高于其他闭源商业模型和开源模型。在通用图表基准ChartQA、文档类基准DocVQA、信息图表类基准InfographicVQA中以及通用视觉问答基准MMBench (v1.1)中，书生万象也取得了最先进（SOTA）的表现。科学图表基准AI2D的测试中，书生万象大幅领先其他优秀的开源模型，并与商业闭源模型不相上下。

代码开源地址：

https://github.com/OpenGVLab/InternVL

魔搭社区部署微调代码：

https://modelscope.cn/models/OpenGVLab/InternVL2-8B

书生从视觉生根，稳扎稳打一路走来，如今进化为书生·万象多模态大模型。万象代表对多模态大模型的愿景，即理解真实世界一切事物和景象，实现全模态全任务的通用智能。涵盖图像，视频，文字，语音、三维点云等5种模态，通过渐进式对齐训练，实现了与大语言模型对齐的视觉基础模型，模型”从小到大”、数据”从粗到精"的渐进式的训练策略，以1/5成本完成了大模型的训练。在有限资源下展现出优秀的性能表现，同时也是国内首个在MMMU(多学科问答)上突破60的模型。在数学、图表分析、OCR等任务中表现优异，具备处理复杂多模态任务、真实世界感知方面的强大能力。

通专融合，万象理解万物

书生万象具有千亿规模参数，支持图像，视频，文字，语音、三维点云等模态。为了使模型能够支持丰富的输出格式，书生万象使用了向量链接技术，链接各领域专用解码器，打通梯度传输链路，实现通专融合，支持检测、分割、图像生成、视觉问答等百种细分任务，性能媲美各领域的专家模型。为了训练书生万象模型，团队从各类来源构建了最大图文交错数据集OmniCorpus，包含约160亿图像，3万亿文本词元，相比现有开源图文数据集，图像数量扩大了三倍，文本数量扩大了十倍。

创新的渐进式对齐训练，实现与大模型对齐的视觉基础模型

传统的预训练范式直接使用大模型+大数据进行一步到位训练，需要大量的算力资源。为了提高训练效率，研究团队采用了创新的渐进式训练策略，先利用小模型在海量带噪数据上进行高效预训练，再使用大模型在较少高质量精选数据上进行高效对齐，模型"从小到大"，数据"从粗到精"，仅需20%的算力资源即可取得同等效果。采用这种训练策略，实现了与大模型对齐的视觉基础模型，同时，团队的多模态大模型，展现出优秀的性能，在MathVista（数学）、AI2D（科学图表）、MMBench（通用视觉问答）、MM-NIAH（多模态长文档）等评测上可比肩GPT-4o、Gemini 1.5 Pro等闭源商用大模型。

模型推理

使用transformers推理代码

import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from modelscope import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height
    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images
def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values
path = 'OpenGVLab/InternVL2-4B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(
    num_beams=1,
    max_new_tokens=1024,
    do_sample=False,
)
# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}')
print(f'Assistant: {response}')
# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
# multi-image multi-round conversation (多图多轮对话)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}')
    print(f'Assistant: {response}')
# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())
    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list
video_path = './examples/red-panda.mp4'
# pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=2)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = '\n'.join([f'Frame{i+1}:<image>' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1:<image>\nFrame2:<image>\n...\nFrame31:<image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')
question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')

模型微调

使用ms-swift对InternVL 2.0模型微调。

代码开源地址：

https://github.com/modelscope/swift

环境准备

# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

ms-swift已接入Internvl2系列模型，包括：Internvl2-2B, Internvl2-4B,Internvl2-8B,Internvl2-26B。这里我们以Internvl2-2B模型和coco图像描述数据集为例, 训练模型的图像描述能力。

数据集链接如下

https://www.modelscope.cn/datasets/modelscope/coco_2014_caption

LoRA微调脚本如下所示。该脚本将只对语言模型的qkv矩阵进行lora微调，如果你想对所有linear层都进行微调，可以指定--lora_target_modules ALL。

# Experimental environment: 4090
# 10GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_type internvl2-2b \
    --dataset refcoco-unofficial-grounding#4000 \
    --gradient_checkpointing true

如果要使用自定义数据集，只需按以下方式进行指定：

--dataset train.jsonl \

自定义数据集支持json和jsonl样式。支持多轮对话，但总的对话轮次中需包含一张图片, 支持传入本地路径或URL。以下是自定义数据集的示例：

{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path"]}

微调后模型的推理：

命令行推理

# Experimental environment: 4090
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx \
    --load_dataset_config true

你也可以选择merge-lora并进行推理：

CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx \
    --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/internvl2-2b/vx-xxx/checkpoint-xxx-merged \
    --load_dataset_config true

训练损失可视化：

资源占用:

验证集推理示例

[PROMPT] <|im_start|> system
你是由上海人工智能实验室开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。<|im_end|><|im_start|>user
 <img>[92546 * 1792]</img>  
please describe the image.<|im_end|><|im_start|>assistant
[OUTPUT]A beach with people flying kites and sitting on the  sand.<|im_end|>
[LABELS]A crowd of people on a beach flying kites.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000218119.jpg']
--------------------------------------------------
[PROMPT] <|im_start|> system
你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。<|im_end|><|im_start|>user
 <img>[92546 * 1792]</img>  
please describe the image.<|im_end|><|im_start|>assistant
[OUTPUT]A large airplane parked at an airport  gate.<|im_end|>
[LABELS]An airplane sitting in front of a group of people.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000512334.jpg']

对应图片

模型部署

第一步：使用lmdeploy v0.5.0, 需要先设置chat template. 创建如下json文件chat_template.json.

{
    "model_name":"internlm2",
    "meta_instruction":"你是由上海人工智能实验室开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。",
    "stop_words":["<|im_start|>", "<|im_end|>"]
}

第二步：使用lmdeploy部署internvl的api服务

lmdeploy serve api_server /mnt/workspace/InternVL2-2B --model-name InternVL2-2B --server-port 23333 --chat-template chat_template.json

第三步：使用OpenAI样式接口需要安装OpenAI

pip install openai

接口调用

from openai import OpenAI
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model="InternVL2-2B",
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': '描述这幅画',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

点击链接👇直达原文

https://www.modelscope.cn/models?name=InternVL%202.0&page=1?from=alizishequ__text

开源VLM新标杆 InternVL 2.0 怎么用？部署、微调尽在魔搭社区！

前言

通专融合，万象理解万物

创新的渐进式对齐训练，实现与大模型对齐的视觉基础模型

模型推理

模型微调

环境准备

微调后模型的推理：

模型部署

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

开源VLM新标杆 InternVL 2.0 怎么用？部署、微调尽在魔搭社区！

前言

通专融合，万象理解万物

创新的渐进式对齐训练，实现与大模型对齐的视觉基础模型

模型推理

模型微调

环境准备

微调后模型的推理：

模型部署

热门文章

最新文章

相关课程

相关电子书

相关实验场景