SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
Abstract
Continual skill learning methods for LLM agents show mixed performance across diverse tasks, with improvements dependent on task structure and feedback mechanisms rather than model scaling.
Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.
Community
SkillLearnBench is the first benchmark for evaluating continual learning methods that automatically generate agent skills.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings (2026)
- SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support (2026)
- Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning (2026)
- SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents (2026)
- SkillNet: Create, Evaluate, and Connect AI Skills (2026)
- SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? (2026)
- XSkill: Continual Learning from Experience and Skills in Multimodal Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.20087 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper