PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 8 days ago • 95 • 3
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning Paper • 2602.19455 • Published Feb 23 • 1
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation • 21 items • Updated 5 days ago • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published May 26 • 9
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 19 days ago • 5
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 11 days ago • 41
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 11 days ago • 41
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 11 days ago • 41
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 19 days ago • 5
view reply Appreciate the nice writeup. Can we add a) Leaderboard, b) Benchmark https://github.com/IBM/AssetOpsBench
view article Article Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic ibm-research • 28 days ago • 88
view article Article ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM ibm-research • May 27 • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published May 26 • 9
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published May 26 • 9