CodeScout-4B

📄 Paper💻 Code🤗 Collection

Best efficiency–performance trade-off — outperforms 8× larger Qwen3-32B across all benchmarks.

CodeScout Overview

CodeScout-4B is part of the CodeScout family of open-source RL-trained code search agents. CodeScout models achieve state-of-the-art repository-level code localization using nothing more than a standard Unix terminal — no static analysis, no repository graphs, no language-specific tooling.

Key Highlights

  • Consistently outperforms 8× larger Qwen3-32B on all benchmarks
  • Surpasses RepoNavigator-14B by 2–10% in file F1 and 8–11% in function F1
  • Exceeds GPT-5 with RepoNavigator by 9% in file F1 and 5% in function F1 on SWE-Bench Verified
  • Best efficiency–performance trade-off in the CodeScout family

Results

Performance on SWE-Bench code localization (instance-averaged F1 scores):

Benchmark CodeScout-1.7B CodeScout-4B CodeScout-14B
SWE-Bench Verified — File F1 55.46 68.52 68.57
SWE-Bench Verified — Func F1 28.22 36.78 40.32
SWE-Bench Pro — File F1 40.96 51.77 53.63
SWE-Bench Pro — Func F1 18.24 29.03 28.74
SWE-Bench Lite — File F1 56.57 67.03 71.84
SWE-Bench Lite — Func F1 27.07 39.87 44.43

File-level F1 vs Model Size Function-level F1 vs Model Size

Code localization performance on SWE-Bench Verified. CodeScout (⭐) achieves superior or competitive results over larger open-source LLMs and narrows the gap with closed-source frontier models.

Training

CodeScout-4B is trained directly from Qwen3-4B-Instruct-2507 using GSPO reinforcement learning.

  • Training data: 1,600 instances from SWE-Smith (39K filtered, 128 repos)
  • RL steps: 200
  • Batch size: 8, with 8 rollouts per instance
  • Max context length: 40K tokens
  • Max turns per episode: 6
  • Reward: Multi-level F1 (file + module + function)
  • Hardware: 8×H100 GPUs
  • Learning rate: 1e-6 (constant)

How It Works

CodeScout uses the OpenHands-Bash scaffold — an agent equipped with only a Terminal tool (supporting standard Unix commands like rg, find, grep, ls) and a LocalizationFinish tool for structured output submission. The agent iteratively navigates the repository to identify relevant files, classes, and functions related to a given issue.

The model is trained with GSPO (Group Sequence Policy Optimization) using multi-level F1 rewards at the file, module, and function level.

Intended Use

CodeScout-4B is designed for repository-level code localization: given a GitHub issue description and a code repository, it identifies the relevant files, classes, and functions that need to be modified. It is intended to be used as a localization subagent within larger coding agent pipelines.

Limitations

  • Trained and evaluated exclusively on Python repositories
  • Designed for code localization, not code editing or issue resolution
  • Performance may vary on repositories significantly different from the training distribution
  • Requires the OpenHands-Bash scaffold for optimal performance

Citation

@misc{sutawika2026codescouteffectiverecipereinforcement,
      title={CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents}, 
      author={Lintang Sutawika and Aditya Bharat Soni and Bharath Sriraam R R and Apurva Gandhi and Taha Yassine and Sanidhya Vijayvargiya and Yuchen Li and Xuhui Zhou and Yilin Zhang and Leander Melroy Maben and Graham Neubig},
      year={2026},
      eprint={2603.17829},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.17829}, 
}
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenHands/CodeScout-4B

Finetuned
(1413)
this model

Datasets used to train OpenHands/CodeScout-4B

Collection including OpenHands/CodeScout-4B

Paper for OpenHands/CodeScout-4B