🧠 Teaching AI to "Think" with Images through Self-Calling

Community Article Published December 14, 2025
paper

AI is getting smarter every day, but one frontier remains notably tricky: visual reasoning, or the ability to "think with images."

Here is an introduction from OpenAI.

"Thinking with images" moves AI beyond just describing pictures to actively solving puzzles within them. Instead of a single static glance, the model treats the image like an investigation—dynamically zooming in on details, reading text, and piecing together visual clues step-by-step.

Imagine looking at a "Where’s Waldo?" puzzle. You don't just stare at the whole page at once; you zoom in on specific sections, analyze details, and piece together the answer. That is the kind of dynamic thinking we want our multimodal models to possess.

But how do we teach a model to do this without needing a supercomputer the size of a warehouse?

Enter Self-Calling Chain-of-Thoughts (sCoT). It’s a clever new paradigm that makes multimodal reasoning easier to train, cheaper to run, and surprisingly more effective. Let’s dive in! 🚀


figure1

The Old Way: The Juggling Act (iMCoT) 🤹

Until now, the standard for visual reasoning was something called interleaved Multimodal Chain-of-Thoughts (iMCoT). Think of this like a juggler trying to solve a math problem while keeping three balls in the air. The model had to constantly switch back and forth between processing heavy visual data and generating text reasoning.

The problem? It’s exhausting for the model (and the GPU).

  • Data Scarcity: High-quality training data for this specific "juggling act" is rare.
  • Hard to Optimize: Because the data is scarce, training these agents with reinforcement learning is incredibly difficult.

We needed a way to let the model "reason" without getting bogged down by the heavy lifting of constant visual processing.

The New Way: The "Cloning" Trick (sCoT) 👯‍♂️

Noticing that SOTA VLMs are capable of solving downstream visual tasks like grounding and visual sCoT flips the script. Instead of trying to do everything at once, it turns the visual reasoning problem into a language-only workflow.

Here is the magic trick: The model doesn't need external tools. It uses itself as a tool.

Imagine a hierarchy consisting of a Main Agent (the boss) and Subagents (the workers). But here's the kicker: the workers are just "virtual clones" of the boss. They share the exact same weights and parameters. You don't need to deploy extra models; you just invoke the same model in a different "mode."

🛠️ How it Works: The 4-Step Flow

  1. The Main Agent (The Manager): The model receives a complex user query (e.g., "What is the total price of the items on this menu?"). It breaks this down into bite-sized tasks—like OCR (reading text), grounding (finding objects), or captioning.
  2. The Call: The Main Agent summons a Subagent.
  3. The Subagent (The Specialist): This "virtual replica" wakes up to do one specific thing. It might look at a cropped section of the menu and just read the prices. It does the job and reports back with a simple text answer.
  4. The Synthesis: The Main Agent takes that text answer, combines it with what it already knows, and produces the final result.

unnamed By coordinating everything through text, sCoT bypasses the headache of complex interleaved processing. It’s cleaner, faster, and purely linguistic.


⚡ The Impact: Speed & Smarts

The results of switching to sCoT are honestly kind of wild.

  1. Massive Efficiency Gains 📉 Because sCoT is easier to incentivize than the old "juggling" method, training costs plummet. In recent experiments, sCoT beat the state-of-the-art iMCoT approach (DeepEyes) while using ~75% fewer GPU hours.

  2. High-Res Performance 👁️ When tested on tough benchmarks like HR-Bench 4K/8K (which tests high-resolution understanding), sCoT didn't just match the baseline—it improved performance by up to 1.9%.

The performance boost doesn't come because the model suddenly got better at seeing (OCR or grounding). It improved because the model learned a better strategy: it figured out exactly when and how to delegate tasks to its subagents.

* means reproduced results
Model Size V* Benchmark HR-Bench 4K HR-Bench 8K
direct relative overall FSP FCP Overall FSP FCP Overall
GPT-4o [19] 66.0 70.048.059.0 62.049.055.5
o3 [18] 95.7
manually-defined-workflow
SEAL [29]7B 74.876.375.4
DyFo [12]7B 80.082.981.2
ZoomEye [24]7B 93.985.590.6 84.355.069.6 85.550.069.3
baseline
Qwen2.5-VL [1]7B 73.967.171.2 85.252.268.8 78.851.865.3
Qwen2.5-VL [1]32B 87.888.187.9 89.858.073.9 84.556.370.4
think-with-images
DeepEyes*†7B 91.382.988.0 92.058.375.1 85.057.071.0
DeepEyes [43]7B 91.388.290.1 91.359.075.1 86.858.572.6
think-through-self-calling
SubagentVL† (Ours)7B 93.089.591.6 93.360.877.0 87.058.372.6
Δ (vs Qwen2.5-VL-7B) +19.1+22.4+20.4 +8.1+8.6+8.2 +8.2+6.5+7.3
Δ (vs DeepEyes) +1.7+1.3+1.2 +1.8+1.9+1.9 +0.2-0.30.0

🧪 The Secret Sauce to Train the Self-Calling VLM

If you are looking to train your own sCoT agent, researchers found three "gotchas" you need to watch out for during Reinforcement Learning:

1. Don't Let It Be Lazy (Strict Constraints) 🛑

When calling a subagent, the model must provide three things: a Task Type, a Prompt, and a Bounding Box. If you aren't strict about this, the model gets lazy, asks vague questions, and performance tanks.

2. Stop the "Cheating" (Reward Hacking) ⚖️

AI loves to cheat. In early tests, models would guess the answer first and then call the tool afterwards just to get the reward points. Researchers had to add an ordering constraint Itoolans I_{tool \prec ans} to the reward function. Translation: You don't get a cookie unless you do the work (tool use) BEFORE you give the answer.

3. Feed It the Right Data 🥗

Data matters. Training on fine-grained visual details (like charts or small objects) helps the model stabilize. However, feeding it abstract "reasoning data" without strong visual grounding actually made it worse. Stick to data that forces the model to look at specific regions.


Final Thoughts

sCoT proves that sometimes the best way to solve a complex problem is to break it down. By teaching models to "call themselves" and handle visual tasks via text, we are paving the way for AI that can see better, think faster, and train cheaper.

Ready to try "thinking with images via self-calling"? Check out the full paper and code here! Link

Community

Sign up or log in to comment