视觉Agent来了!智谱AI开源CogAgent,支持GUI图形界面问答(附魔搭推理微调最佳实践)

本文涉及的产品
交互式建模 PAI-DSW,每月250计算时 3个月
模型训练 PAI-DLC,100CU*H 3个月
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
简介: 近日,智谱AI开源了VLM领域的最新工作 CogAgent。

序言

近日,智谱AI开源了VLM领域的最新工作 CogAgent

CogAgent 是基于CogVLM改进的模型,是一个擅长于GUI理解和导航的180亿参数规模的视觉语言模型,CogAgent-18B 拥有110亿视觉参数和70亿语言参数。

CogAgent-18B 在9个跨模态基准测试上取得了 SOTA 的通用性能,包括VQAv2、OK-VQA、TextVQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet、和 POPE。在AITW、Mind2Web等GUI操作数据集上取得了SOTA的性能。

除了CogVLM已有的全部功能(视觉多轮对话、视觉定位)外,CogAgent还:

  1. 支持更高分辨率的视觉输入和对话问答。支持1120*1120超高分辨率的图像输入;
  2. 具备视觉Agent的能力,针对任意GUI截图,对于用户给定的任务,CogAgent均能返回计划、下一个动作、含坐标的具体操作;
  3. 提升了GUI相关的问答能力,可以针对任意GUI截图进行问答,例如网页、PPT、手机软件,甚至能够解说原神界面。
  4. 通过预训练与微调,在OCR相关任务上的能力大幅提升。  


模型体验

CogAgent本次特别的支持了GUI问答,后续在RPA等场景,也可以发挥很大的作用。本次体验中,我们通过上传一张桌面截图,考考CogAgent的能力。

GUI(屏幕截图)的Agent任务:

使用Agent模板如下:

en_template_task = [
    "Can you advise me on how to <TASK>?",
    "I'm looking for guidance on how to <TASK>.",
    "What steps do I need to take to <TASK>?",
    "Could you provide instructions for <TASK>?",
    "I'm wondering what the process is for <TASK>.",
    "How can I go about <TASK>?",
    "I need assistance with planning to <TASK>.",
    "Do you have any recommendations for <TASK>?",
    "Please share some tips for <TASK>.",
    "I'd like to know the best way to <TASK>.",
    "What's the most effective way to <TASK>?",
    "I'm seeking advice on accomplishing <TASK>.",
    "Could you guide me through the steps to <TASK>?",
    "I'm unsure how to start with <TASK>.",
    "Is there a strategy for successfully <TASK>?",
    "What's the proper procedure for <TASK>?",
    "How should I prepare for <TASK>?",
    "I'm not sure where to begin with <TASK>.",
    "I need some insights on <TASK>.",
    "Can you explain how to tackle <TASK>?",
    "I'm interested in the process of <TASK>.",
    "Could you enlighten me on <TASK>?",
    "What are the recommended steps for <TASK>?",
    "Is there a preferred method for <TASK>?",
    "I'd appreciate your advice on <TASK>.",
    "Can you shed light on <TASK>?",
    "What would be the best approach to <TASK>?",
    "How do I get started with <TASK>?",
    "I'm inquiring about the procedure for <TASK>.",
    "Could you share your expertise on <TASK>?",
    "I'd like some guidance on <TASK>.",
    "What's your recommendation for <TASK>?",
    "I'm seeking your input on how to <TASK>.",
    "Can you provide some insights into <TASK>?",
    "How can I successfully accomplish <TASK>?",
    "What steps are involved in <TASK>?",
    "I'm curious about the best way to <TASK>.",
    "Could you show me the ropes for <TASK>?",
    "I need to know how to go about <TASK>.",
    "What are the essential steps for <TASK>?",
    "Is there a specific method for <TASK>?",
    "I'd like to get some advice on <TASK>.",
    "Can you explain the process of <TASK>?",
    "I'm looking for guidance on how to approach <TASK>.",
    "What's the proper way to handle <TASK>?",
    "How should I proceed with <TASK>?",
    "I'm interested in your expertise on <TASK>.",
    "Could you walk me through the steps for <TASK>?",
    "I'm not sure where to begin when it comes to <TASK>.",
    "What should I prioritize when doing <TASK>?",
    "How can I ensure success with <TASK>?",
    "I'd appreciate some tips on <TASK>.",
    "Can you provide a roadmap for <TASK>?",
    "What's the recommended course of action for <TASK>?",
    "I'm seeking your guidance on <TASK>.",
    "Could you offer some suggestions for <TASK>?",
    "I'd like to know the steps to take for <TASK>.",
    "What's the most effective way to achieve <TASK>?",
    "How can I make the most of <TASK>?",
    "I'm wondering about the best approach to <TASK>.",
    "Can you share your insights on <TASK>?",
    "What steps should I follow to complete <TASK>?",
    "I'm looking for advice on <TASK>.",
    "What's the strategy for successfully completing <TASK>?",
    "How should I prepare myself for <TASK>?",
    "I'm not sure where to start with <TASK>.",
    "What's the procedure for <TASK>?",
    "Could you provide some guidance on <TASK>?",
    "I'd like to get some tips on how to <TASK>.",
    "Can you explain how to tackle <TASK> step by step?",
    "I'm interested in understanding the process of <TASK>.",
    "What are the key steps to <TASK>?",
    "Is there a specific method that works for <TASK>?",
    "I'd appreciate your advice on successfully completing <TASK>.",
    "Can you shed light on the best way to <TASK>?",
    "What would you recommend as the first step to <TASK>?",
    "How do I initiate <TASK>?",
    "I'm inquiring about the recommended steps for <TASK>.",
    "Could you share some insights into <TASK>?",
    "I'm seeking your expertise on <TASK>.",
    "What's your recommended approach for <TASK>?",
    "I'd like some guidance on where to start with <TASK>.",
    "Can you provide recommendations for <TASK>?",
    "What's your advice for someone looking to <TASK>?",
    "I'm seeking your input on the process of <TASK>.",
    "How can I achieve success with <TASK>?",
    "What's the best way to navigate <TASK>?",
    "I'm curious about the steps required for <TASK>.",
    "Could you show me the proper way to <TASK>?",
    "I need to know the necessary steps for <TASK>.",
    "What's the most efficient method for <TASK>?",
    "I'd appreciate your guidance on <TASK>.",
    "Can you explain the steps involved in <TASK>?",
    "I'm looking for recommendations on how to approach <TASK>.",
    "What's the right way to handle <TASK>?",
    "How should I manage <TASK>?",
    "I'm interested in your insights on <TASK>.",
    "Could you provide a step-by-step guide for <TASK>?",
    "I'm not sure how to start when it comes to <TASK>.",
    "What are the key factors to consider for <TASK>?",
    "How can I ensure a successful outcome with <TASK>?",
    "I'd like some tips and tricks for <TASK>.",
    "Can you offer a roadmap for accomplishing <TASK>?",
    "What's the preferred course of action for <TASK>?",
    "I'm seeking your expert advice on <TASK>.",
    "Could you suggest some best practices for <TASK>?",
    "I'd like to understand the necessary steps to complete <TASK>.",
    "What's the most effective strategy for <TASK>?",
]

将其中的<TASK>替换为用双引号包围的任务指令。该方法可以获得模型推测的Plan和Next Action。若在句末加上(with grounding),则模型会进一步返回含坐标的形式化表示。

首先我们上传一张电脑的截屏:

然后问他:I'm looking for a software to "edit my photo with grounding"

我们可以看到,CogAgent给我们返回了edit photo的步骤,以及下一步action是点击屏幕的PhotoShop,以及正确的指出了PhotoShop的坐标信息。

然后我们试一下多轮对话的能力,我们再问他:I want to "calculate the average score of students with grounding"

我们可以看到,Cogagent给我们建议使用excel的步骤,以及下一步action是点击屏幕的excel软件,以及正确的指出了excel的坐标信息。

同时CogAgent的官方文档中,也给出了更多更加好玩的PC端和移动端的玩法,大家都可以来试一下!

模型下载和推理

现在CogAgent系列已经上线魔搭社区,开发者们可以下载使用。

模型链接:

cogagent-chat:

https://modelscope.cn/models/ZhipuAI/cogagent-chat/summary

cogagent-vqa:

https://www.modelscope.cn/models/ZhipuAI/cogagent-vqa/summary

使用魔搭社区pipeline函数推理 cogagent-chat

from modelscope import pipeline
pipe = pipeline(task='chat', model='ZhipuAI/cogagent-chat', llm_first=True, device_map='cuda')
messages_en = {
    'messages': [{
        'role': 'user',
        'content': [{'image': 'einstein.png'}, {'text': 'Who is him?'}]
    }]
}
gen_kwargs = {"max_length": 2048,
              "temperature": 0.9,
              "do_sample": False}
print(pipe(messages_en, **gen_kwargs))

使用AutoModel推理代码:

import torch
from PIL import Image
from modelscope import AutoModelForCausalLM, AutoTokenizer
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="ZhipuAI/cogagent-chat", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="AI-ModelScope/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
args, unknown = parser.parse_known_args()
MODEL_PATH = args.from_pretrained
TOKENIZER_PATH = args.local_tokenizer
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
if args.bf16:
    torch_type = torch.bfloat16
else:
    torch_type = torch.float16
print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
if args.quant:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=True,
        trust_remote_code=True
    ).eval()
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=args.quant is not None,
        trust_remote_code=True
    ).to(DEVICE).eval()
while True:
    image_path = input("image path >>>>> ")
    if image_path == "stop":
        break
    image = Image.open(image_path).convert('RGB')
    history = []
    while True:
        query = input("Human:")
        if query == "clear":
            break
        input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]],
        }
        if 'cross_images' in input_by_model and input_by_model['cross_images']:
            inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
        # add any transformers params here.
        gen_kwargs = {"max_length": 2048,
                      "temperature": 0.9,
                      "do_sample": False}
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("</s>")[0]
            print("\nCog:", response)
        history.append((query, response))

显存占用:

模型训练

CogAgent-Chat 和 CogAgent-VQA 模型已经在 SWIFT(https://github.com/modelscope/swift)中支持训练。官方提供的训练示例中使用了原版github训练中使用的数据集captcha-images。该数据集的输入图片为包含字母和数字的图片,标签为识别出来的内容。开发者可以使用如下脚本进行训练:

# Experimental environment: 2 * A100
# 2 * 45GB
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1 \
python llm_sft.py \
    --model_type cogagent-chat \
    --sft_type lora \
    --tuner_backend swift \
    --dtype fp16 \
    --output_dir output \
    --dataset capcha-images \
    --train_dataset_sample -1 \
    --num_train_epochs 2 \
    --max_length 1024 \
    --check_dataset_strategy warning \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dropout_p 0.05 \
    --gradient_checkpointing false \
    --batch_size 1 \
    --weight_decay 0.01 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 16 \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 10
    --push_to_hub false \
    --hub_model_id cogagent-chat-lora \
    --hub_private_repo true \
    --hub_token 'your-sdk-token' \

训练过程需要注意:

  1. 该模型cross-image的Vit联合FusedLayerNorm使用会造成训练跑飞,避免该错误请`pip uninstall apex`
  2. 该模型联合device_map使用时可能会有不同算子处于不同CUDA设备上的问题,请酌情调节device_map配置

训练loss如下:

训练的显存使用情况:

训练后推理可以使用如下脚本:

# Experimental environment: A100
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python llm_infer.py \
    --ckpt_dir "/xxx/xxx/cogagent-chat/vx-xxx/checkpoint-xx" \
    --load_args_from_ckpt_dir true \
    --eval_human true \
    --max_length 4096 \
    --use_flash_attn true \
    --max_new_tokens 2048 \
    --temperature 0.3 \
    --top_p 0.7 \
    --repetition_penalty 1.05 \
    --do_sample true \
    --merge_lora_and_save false \

原始图片:

识别输出:

推理显存使用情况:

这两个脚本都可以在SWIFT examples中找到。

点击查看模型详情~

modelscope.cn/models/ZhipuAI/cogagent-chat/summary

相关文章
|
28天前
|
数据采集 自然语言处理 安全
控制电脑手机的智能体人人都能造,微软开源OmniParser
微软研究团队推出OmniParser,旨在提升GPT-4V等多模态模型在用户界面操作方面的性能。通过解析用户界面截图为结构化元素,OmniParser显著增强了模型的交互能力,使其在多种基准测试中表现出色。该技术开源,促进了社区合作与技术创新,但同时也面临数据质量、计算资源及安全隐私等挑战。
61 14
|
4天前
|
机器学习/深度学习 人工智能 自然语言处理
CogAgent-9B:智谱 AI 开源 GLM-PC 的基座模型,专注于预测和执行 GUI 操作,可应用于自动化交互任务
CogAgent-9B 是智谱AI基于 GLM-4V-9B 训练的专用Agent任务模型,支持高分辨率图像处理和双语交互,能够预测并执行GUI操作,广泛应用于自动化任务。
36 12
CogAgent-9B:智谱 AI 开源 GLM-PC 的基座模型,专注于预测和执行 GUI 操作,可应用于自动化交互任务
|
22天前
|
人工智能 API 语音技术
TEN Agent:开源的实时多模态 AI 代理框架,支持语音、文本和图像的实时通信交互
TEN Agent 是一个开源的实时多模态 AI 代理框架,集成了 OpenAI Realtime API 和 RTC 技术,支持语音、文本和图像的多模态交互,具备实时通信、模块化设计和多语言支持等功能,适用于智能客服、实时语音助手等多种场景。
123 15
TEN Agent:开源的实时多模态 AI 代理框架,支持语音、文本和图像的实时通信交互
|
1天前
|
人工智能 关系型数据库 分布式数据库
PolarDB-PG AI最佳实践3 :PolarDB AI多模态相似性搜索最佳实践
本文介绍了如何利用PolarDB结合多模态大模型(如CLIP)实现数据库内的多模态数据分析和查询。通过POLAR_AI插件,可以直接在数据库中调用AI模型服务,无需移动数据或额外的工具,简化了多模态数据的处理流程。具体应用场景包括图像识别与分类、图像到文本检索和基于文本的图像检索。文章详细说明了技术实现、配置建议、实战步骤及多模态检索示例,展示了如何在PolarDB中创建模型、生成embedding并进行相似性检索
|
1天前
|
SQL 人工智能 关系型数据库
PolarDB-PG AI最佳实践 2 :PolarDB AI X EAS实现自定义库内模型推理最佳实践
PolarDB通过POLAR_AI插件支持使用SQL调用AI/ML模型,无需专业AI知识或额外部署环境。结合阿里云EAS在线模型服务,可轻松部署自定义模型,在SQL中实现如文本翻译等功能。
|
25天前
|
机器学习/深度学习 人工智能 自然语言处理
GLM-4V-Flash:智谱 AI 免费开放的图像理解大模型 API 接口
智谱AI推出的GLM-4V-Flash是一款专注于图像理解的免费开放大模型,提供API接口支持用户上传图片URL或Base64编码图片获取详细的图像描述。该模型通过深度学习和卷积神经网络技术,简化了图像分析流程,提高了开发效率,适用于内容审核、辅助视障人士、社交媒体、教育和电子商务等多个应用场景。
123 14
GLM-4V-Flash:智谱 AI 免费开放的图像理解大模型 API 接口
|
3天前
|
存储 人工智能 运维
AI + 可观测最佳实践:让业务从“看见”到“洞察”
本文介绍了AI Ops的概念及其在提升系统运维效率、洞察力和可观测性方面的作用。主要内容分为三个部分:一是监控、观测与洞察的区别及挑战,强调了数据整合和语义对齐的重要性;二是AI与计算如何重塑可观测性,通过UModel数字图谱和多模态存储分析架构实现数据联通;三是最佳实践与未来展望,展示了阿里云AI Stack可观测解决方案的应用案例,并总结了可观测性的四个发展阶段,最终愿景是借助AI力量让每个人成为多领域的专家。
|
1天前
|
存储 数据采集 算法
构建AI数据管道:从数据到洞察的高效之旅最佳实践
本文探讨了大模型从数据处理、模型训练到推理的全流程解决方案,特别强调数据、算法和算力三大要素。在数据处理方面,介绍了多模态数据的高效清洗与存储优化;模型训练中,重点解决了大规模数据集和CheckPoint的高效管理;推理部分则通过P2P分布式加载等技术提升效率。案例展示了如何在云平台上实现高性能、低成本的数据处理与模型训练,确保业务场景下的最优表现。
|
5天前
|
人工智能 智能硬件
SPAR:智谱 AI 推出自我博弈训练框架,基于生成者和完善者两个角色的互动,提升了执行准确度和自我完善能力
SPAR 是智谱团队推出的自我博弈训练框架,旨在提升大型语言模型在指令遵循方面的能力,通过生成者和完善者的互动以及树搜索技术优化模型响应。
19 0
SPAR:智谱 AI 推出自我博弈训练框架,基于生成者和完善者两个角色的互动,提升了执行准确度和自我完善能力
|
22天前
|
机器学习/深度学习 人工智能 自然语言处理
Gemini 2.0:谷歌推出的原生多模态输入输出 + Agent 为核心的 AI 模型
谷歌最新推出的Gemini 2.0是一款原生多模态输入输出的AI模型,以Agent技术为核心,支持多种数据类型的输入与输出,具备强大的性能和多语言音频输出能力。本文将详细介绍Gemini 2.0的主要功能、技术原理及其在多个领域的应用场景。
130 20
Gemini 2.0:谷歌推出的原生多模态输入输出 + Agent 为核心的 AI 模型

热门文章

最新文章