SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Abstract
SVG-T2I, a scaled SVG framework, enables high-quality text-to-image synthesis directly in the Visual Foundation Model feature domain, achieving competitive performance in generative tasks.
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
Community
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
Hi everyone! π
Weβre excited to introduce SVG-T2I, an experimental research project aimed at providing the community with a representation-based text-to-image generation framework for further exploration and study.
All code and model weights are fully open-sourced. If you find this work interesting or useful, weβd greatly appreciate your support with an Upvote on Hugging Face and a Star on GitHub βπ€
Links:
- π€ Hugging Face Paper: https://huggingface.co/papers/2512.11749
- π» Code: https://github.com/KlingTeam/SVG-T2I
- π¦ Model Weights: https://huggingface.co/KlingTeam/SVG-T2I
- π arXiv: https://arxiv.org/abs/2512.11749
SVG-T2I is a pure VFM-based text-to-image generation framework that performs diffusion modeling directly in the representation space, completely removing the need for traditional VAEs.
The primary goal of this work is to validate the scalability and effectiveness of representation-based generation at scale, while also providing the community with a fully open and end-to-end solution, including training code, inference and evaluation pipelines, and pre-trained checkpoints.
We hope this project can serve as a useful foundation for future research on representation generation and related directions. π
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper