摘要
本文介绍了Qwen2.5-VL,这是Qwen系列的最新旗舰模型,展示了在基础能力和创新功能方面的显著进步。Qwen2.5-VL通过增强的视觉识别、精确的对象定位、稳健的文档解析和长视频理解,实现了对世界的理解和交互的重大飞跃。该模型能够准确地使用边界框或点进行对象定位,提供从发票、表格到图表和布局的结构化数据提取。为了处理复杂输入,Qwen2.5-VL引入了动态分辨率处理和绝对时间编码,使其能够处理不同大小的图像和长达数小时的视频,并实现秒级事件定位。 在方法上,Qwen2.5-VL通过以下四个方面进行了技术创新:1) 在视觉编码器中实现窗口注意力以优化推理效率;2) 引入动态帧率采样,扩展动态分辨率到时间维度,支持不同采样率的全面视频理解;3) 升级MRoPE与绝对时间对齐,促进更复杂的时序学习;4) 精心策划高质量的数据用于预训练和监督微调,将预训练语料库从1.2万亿标记扩展到4.1万亿标记。 实验结果表明,Qwen2.5-VL在多个基准测试中表现出色,甚至超越了一些顶级闭源模型。其强大的文档解析能力、精确的对象定位、超长视频理解和增强的代理功能使其在多模态任务中具备广泛的应用前景。
Abstract:We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
模型评价
在多个重要数据集上,Qwen2.5-VL-72B 模型取得了以下分数: 1. MMMU (Yue et al., 2023):70.2 分 2. MMMU-Pro (Yue et al., 2024):51.1 分 3. MathVista (Lu et al., 2024):74.8 分 4. MATH-Vision (Wang et al., 2024d):38.1 分 5. MMBench-EN (Liu et al., 2023d):88.6 分 6. MuirBench (Wang et al., 2024a):70.7 分 7. MTVQA (Tang et al., 2024):31.7 分 8. MM-MT-Bench (Agrawal et al., 2024):7.6 分 9. CC-OCR (Yang et al., 2024b):79.8 分 10. OCRBench_v2(英语/中文):61.5/63.7 分
论文分类
自然语言处理,计算机视觉,深度学习,Computer Vision and Pattern Recognition (cs.CV),Computation and Language (cs.CL)
更多信息
模型名称
Qwen2.5-VL
模型开发者
阿里云团队
Framework
未提及
模型硬件信息
未提及0