SubagentVL (Self-Calling Chain-of-Thoughts, sCoT)
Model Description
The SubagentVL model is a Vision-Language Model (VLM) trained using the Self-Calling Chain-of-Thoughts (sCoT) visual reasoning paradigm. This novel approach is designed to enhance visual reasoning, particularly on high-resolution images, by reformulating the complex interleaved multimodal Chain-of-Thought (iMCoT) into a language-only reasoning trajectory augmented with self-calling.
The core idea is that a main agent (the VLM itself) decomposes a complex visual query into a sequence of simple, atomic subtasks, which are then delegated to parameter-sharing virtual replicas called subagents. These subagents handle localized visual capabilities like grounding, OCR, or captioning in isolated contexts, returning concise textual outputs that the main agent aggregates to derive the final answer.
- Base Model: Qwen2.5-VL-7B.
- Paradigm: Thinking-with-images-through-self-calling (sCoT).
- Key Advantage: sCoT is significantly easier to incentivize through reinforcement learning than traditional iMCoT methods, leading to substantial training effectiveness and efficiency.
Training Details
Training Paradigm: Agentic Reinforcement Learning
The model was optimized using end-to-end Agentic Reinforcement Learning (RL) to explicitly reward coherent self-reflection and efficient subtask coordination.
- Algorithm: Group Relative Policy Optimization (GRPO) was employed due to its efficiency and stability.
- RL Steps: Training was conducted for 80 iterations (or RL steps). Analysis of training dynamics showed that the model initially learns to solve tasks independently, but by the third stage, the agent consistently issues more subagent calls, indicating a matured coordination strategy.
- Optimization Strategy: Optimization focuses solely on the main agent’s reasoning and action outputs, achieved by applying a token-wise loss mask to exclude gradients on the subagents’ textual responses.
Training Data
The model was trained on subsets of the comprehensive dataset collected by DeepEyes, which integrates diverse visual reasoning sources.
The training corpus specified for this model includes:
- Fine-grained data (47%): Derived from the V* training set, this data consists of high-resolution images and detailed perception questions, designed to maximize tool-based reasoning effectiveness.
- Chart data (30%): Composed of synthetic charts and graph images that enrich the diversity of visual elements and quantitative patterns.
The use of both Fine and Chart data was shown to stabilize the training process and maintain strong scores on high-resolution benchmarks.
Intended Uses and Limitations
Intended Uses
This model is intended for complex visual reasoning tasks that require decomposing a query and performing localized perception, such as:
- Answering questions based on ultra-high-resolution images (up to 8K) where details are confined to small regions.
- Tasks involving tool-calling behaviors like OCR, visual grounding, and captioning through the self-calling mechanism.
- Applications benefiting from a resource-efficient RL-trained agent, as the sCoT paradigm is more scalable than previous iMCoT methods.
Limitations
- General Visual Ability: While sCoT greatly enhances complex reasoning, reinforcement learning focusing only on the reasoning trajectory (masking out subagent responses) showed only marginal gains in low-level visual skills like grounding and OCR compared to DeepEyes.
- Abstract Reasoning: The sources suggest that including data emphasizing abstract symbolic reasoning (like the "Reason" subset of DeepEyes data) can lead to degraded performance on high-resolution visual tasks, as it shifts the model’s focus away from precise region-based perception and tool-calling strategies.
- Tool Constraints: The model relies on strict constraints regarding its tool-calling protocol (requiring a task type, prompt, and bounding box), as relaxing these constraints leads to degenerated calling patterns and lower performance.
Citation
If you find our work helpful, please cite us with the following:
@article{yang2025thinking,
title={Thinking with Images via Self-Calling Agent},
author={Yang, Wenxi and Zhao, Yuzhong and Wan, Fang and Ye, Qixiang},
journal={arXiv preprint arXiv:2512.08511},
year={2025}
}
- Downloads last month
- 21