UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Abstract
UniT enables human-to-humanoid transfer by creating a unified visual-language representation that bridges kinematic differences through cross-reconstruction mechanisms and shared latent spaces.
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
Community
the tri-branch cross-reconstruction and the shared discrete token space are slick, but i keep wondering how they scale when human and humanoid morphologies diverge a lot. the core assumption of universal visual consequences across embodiments is appealing, but the robustness of that alignment under large kinematic gaps or actuation delays isn’t clearly stress-tested. an ablation removing one branch would reveal which part of the cross-reconstruction actually drives the gains. btw the arxivlens breakdown helped me parse the method details, a nice quick map of the token flows here: https://arxivlens.com/PaperView/Details/unit-toward-a-unified-physical-language-for-human-to-humanoid-policy-learning-and-world-modeling-5072-682c0e25
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA (2026)
- Cross-Hand Latent Representation for Vision-Language-Action Models (2026)
- Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild (2026)
- JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy (2026)
- HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation (2026)
- LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment (2026)
- FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.19734 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper