Towards Scalable Pre-training of Visual Tokenizers for Generation Paper • 2512.13687 • Published 23 days ago • 100
World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty Paper • 2512.05927 • Published Dec 5, 2025 • 11
CaptionQA: Is Your Caption as Useful as the Image Itself? Paper • 2511.21025 • Published Nov 26, 2025 • 27
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation Paper • 2512.03534 • Published Dec 3, 2025 • 20
Video Generation Models Are Good Latent Reward Models Paper • 2511.21541 • Published Nov 26, 2025 • 45
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation Paper • 2511.11002 • Published Nov 14, 2025 • 3
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Paper • 2511.14582 • Published Nov 18, 2025 • 18
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models Paper • 2511.02712 • Published Nov 4, 2025 • 4
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum Paper • 2510.27571 • Published Oct 31, 2025 • 17
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning Paper • 2510.23473 • Published Oct 27, 2025 • 84
Cache-to-Cache: Direct Semantic Communication Between Large Language Models Paper • 2510.03215 • Published Oct 3, 2025 • 97 • 9
UniVideo: Unified Understanding, Generation, and Editing for Videos Paper • 2510.08377 • Published Oct 9, 2025 • 71
Self-Improvement in Multimodal Large Language Models: A Survey Paper • 2510.02665 • Published Oct 3, 2025 • 20 • 6
LongLive: Real-time Interactive Long Video Generation Paper • 2509.22622 • Published Sep 26, 2025 • 184
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts Paper • 2506.10357 • Published Jun 12, 2025 • 21