Papers
arxiv:2512.11749

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Published on Dec 12
Β· Submitted by taesiri on Dec 15
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

SVG-T2I, a scaled SVG framework, enables high-quality text-to-image synthesis directly in the Visual Foundation Model feature domain, achieving competitive performance in generative tasks.

AI-generated summary

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

Community

Paper submitter

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

Hi everyone! πŸ‘‹
We’re excited to introduce SVG-T2I, an experimental research project aimed at providing the community with a representation-based text-to-image generation framework for further exploration and study.

All code and model weights are fully open-sourced. If you find this work interesting or useful, we’d greatly appreciate your support with an Upvote on Hugging Face and a Star on GitHub β­πŸ€—

Links:

SVG-T2I is a pure VFM-based text-to-image generation framework that performs diffusion modeling directly in the representation space, completely removing the need for traditional VAEs.
The primary goal of this work is to validate the scalability and effectiveness of representation-based generation at scale, while also providing the community with a fully open and end-to-end solution, including training code, inference and evaluation pipelines, and pre-trained checkpoints.

We hope this project can serve as a useful foundation for future research on representation generation and related directions. πŸš€

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.11749 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.11749 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.