Benchmark
updated
Scaling Computer-Use Grounding via User Interface Decomposition and
Synthesis
Paper
•
2505.13227
•
Published
•
45
facebook/natural_reasoning
Viewer
•
Updated
•
1.15M
•
1.25k
•
546
Viewer
•
Updated
•
5.68M
•
15.4k
•
390
Search Arena: Analyzing Search-Augmented LLMs
Paper
•
2506.05334
•
Published
•
17
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Paper
•
2506.07977
•
Published
•
41
Viewer
•
Updated
•
824
•
6.6k
•
238
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive
Programming?
Paper
•
2506.11928
•
Published
•
24
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
•
2506.15569
•
Published
•
12
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
•
2506.14028
•
Published
•
93
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper
•
2506.11763
•
Published
•
73
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement
Learning
Paper
•
2506.09049
•
Published
•
37
Viewer
•
Updated
•
3.35k
•
171
•
48
Can LLMs Identify Critical Limitations within Scientific Research? A
Systematic Evaluation on AI Research Papers
Paper
•
2507.02694
•
Published
•
19
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems
at Once
Paper
•
2507.10541
•
Published
•
29
HuggingFaceTB/SmolLM3-3B-Base
Text Generation
•
3B
•
Updated
•
9.61k
•
145
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Paper
•
2507.08616
•
Published
•
14
The Generative Energy Arena (GEA): Incorporating Energy Awareness in
Large Language Model (LLM) Human Evaluations
Paper
•
2507.13302
•
Published
•
4
Viewer
•
Updated
•
140
•
198
•
6
AbGen: Evaluating Large Language Models in Ablation Study Design and
Evaluation for Scientific Research
Paper
•
2507.13300
•
Published
•
19
DrafterBench: Benchmarking Large Language Models for Tasks Automation in
Civil Engineering
Paper
•
2507.11527
•
Published
•
32
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
•
2507.10787
•
Published
•
12
WideSearch: Benchmarking Agentic Broad Info-Seeking
Paper
•
2508.07999
•
Published
•
110
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
•
2508.13186
•
Published
•
19
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming
Competitions
Paper
•
2508.16402
•
Published
•
14
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
•
2508.14704
•
Published
•
43
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid
Mamba-Transformer Reasoning Model
Paper
•
2508.14444
•
Published
•
39
UQ: Assessing Language Models on Unsolved Questions
Paper
•
2508.17580
•
Published
•
15
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image
Generation
Paper
•
2508.17472
•
Published
•
26
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Paper
•
2508.15804
•
Published
•
15
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World
Tasks via MCP Servers
Paper
•
2508.20453
•
Published
•
63
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
•
2509.01396
•
Published
•
57
Viewer
•
Updated
•
8.61k
•
638
•
15
Viewer
•
Updated
•
12.1k
•
66.8k
•
403
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
•
2509.04013
•
Published
•
4
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric
Knowledge
Paper
•
2509.07968
•
Published
•
14
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex
Software Engineering
Paper
•
2509.09614
•
Published
•
7
GenExam: A Multidisciplinary Text-to-Image Exam
Paper
•
2509.14232
•
Published
•
21
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Paper
•
2501.01290
•
Published
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP
Use
Paper
•
2509.24002
•
Published
•
174
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Paper
•
2509.26536
•
Published
•
34
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
Paper
•
2510.09507
•
Published
•
10
PICABench: How Far Are We from Physically Realistic Image Editing?
Paper
•
2510.17681
•
Published
•
62
LiveTradeBench: Seeking Real-World Alpha with Large Language Models
Paper
•
2511.03628
•
Published
•
12
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
•
2511.15065
•
Published
•
74
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
•
2511.17729
•
Published
•
16
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
•
2511.20561
•
Published
•
32
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Paper
•
2511.22173
•
Published
•
14
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
•
2512.04324
•
Published
•
150
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Paper
•
2512.12730
•
Published
•
43
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
•
2512.14051
•
Published
•
40
MMGR: Multi-Modal Generative Reasoning
Paper
•
2512.14691
•
Published
•
114
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments
Paper
•
2512.19432
•
Published
•
11
FrontierCS: Evolving Challenges for Evolving Intelligence
Paper
•
2512.15699
•
Published
•
5
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Paper
•
2512.15560
•
Published
•
24