arxiv:2604.19734

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Published on Apr 21

· Submitted by

Yi Chen on Apr 24

#2 Paper of the day

xpeng-robotics

Upvote

Authors:

Abstract

UniT enables human-to-humanoid transfer by creating a unified visual-language representation that bridges kinematic differences through cross-reconstruction mechanisms and shared latent spaces.

AI-generated summary

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

View arXiv page View PDF Project page GitHub 34 Add to collection

Community

ChenYi99

Paper submitter 1 day ago

This comment has been hidden

avahal

1 day ago

the tri-branch cross-reconstruction and the shared discrete token space are slick, but i keep wondering how they scale when human and humanoid morphologies diverge a lot. the core assumption of universal visual consequences across embodiments is appealing, but the robustness of that alignment under large kinematic gaps or actuation delays isn’t clearly stress-tested. an ablation removing one branch would reveal which part of the cross-reconstruction actually drives the gains. btw the arxivlens breakdown helped me parse the method details, a nice quick map of the token flows here: https://arxivlens.com/PaperView/Details/unit-toward-a-unified-physical-language-for-human-to-humanoid-policy-learning-and-world-modeling-5072-682c0e25