Abstract
Self-refining video sampling improves motion coherence and physics alignment by using a pre-trained video generator as its own denoising autoencoder for iterative refinement with uncertainty-aware region selection.
Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70\% human preference compared to the default sampler and guidance-based sampler.
Community
[TL;DR] We present self-refining video sampling method that reuses a pre-trained video generator as a denoising autoencoder to iteratively refine latents. With ~50% additional NFEs, it improves physical realism (e.g., motion coherence and physics alignment) without any external verifier, training, or dataset.
Very interesting results!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PhyRPR: Training-Free Physics-Constrained Video Generation (2026)
- Inference-time Physics Alignment of Video Generative Models with Latent World Models (2026)
- InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem (2025)
- CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos (2025)
- Light-X: Generative 4D Video Rendering with Camera and Illumination Control (2025)
- Scaling Zero-Shot Reference-to-Video Generation (2025)
- V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXivlens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/self-refining-video-sampling-7626-ab589746
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper