Papers
arxiv:2605.02396

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Published on May 4
· Submitted by
WangJianing
on May 6
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

HeavySkill presents a framework where complex reasoning is internalized as an intrinsic model skill rather than relying on external orchestration, demonstrating superior performance through parallel reasoning and summarization stages that can be enhanced via reinforcement learning.

AI-generated summary

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.

Community

Paper submitter

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

HeavySkill is a test-time scaling technique that decomposes complex reasoning into two stages:

  • Parallel Reasoning — Generate K independent reasoning trajectories concurrently
  • Sequential Deliberation — Synthesize trajectories through critical analysis into a superior final answer

the most interesting move here is treating heavy thinking as an internal two-stage skill—parallel reasoning followed by summarization—that travels with the model, not just the harness. i’m curious about the memory cache and deliberation loop: when you serialize many trajectories into the cache, does the final synthesis risk information interference during revisitation? the arxivlens breakdown helped me parse where the bottlenecks live and what the internal skill is actually buying you, especially in terms of transferability across harnesses (https://arxivlens.com/PaperView/Details/heavyskill-heavy-thinking-as-the-inner-skill-in-agentic-harness-8685-925845c1). if you push rlvr to grow both breadth and depth, i’d want to see how compute scales and whether there’s a sweet spot where extra trajectories stop paying off.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.02396
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.02396 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.02396 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.02396 in a Space README.md to link it from this page.

Collections including this paper 2