pinned
Runtime error
Agents
6
AfroBench
π₯
Comprehensive benchmark of LLMs on African Languages
computational linguistics, natural language processing
Forecasting Downstream Performance of LLMs With Proxy Metrics
Structured Distillation of Web Agent Capabilities Enables Generalization
Comprehensive benchmark of LLMs on African Languages
Leaderboard for mSTEB benchmark
Visualize web interaction recordings
Leaderboard for AgentRewardBench
Explore agent trajectories and judgments in web benchmarks
SafeArena Leaderboard