A demo video of our proposed Multimodal Interactive Omni-Avator (M.I.O) framework.
A demo video of our proposed Multimodal Interactive Omni-Avator (M.I.O) framework.
Most existing digital humans remain primarily imitative, reproducing surface patterns of behavior without genuine interactive intelligence: the ability to generate real-time, emotionally coherent responses with consistent personality across voice, face and body motions, appearance.
We model digital humans as autonomous agents with personality-consistent expression, adaptive interaction, and self-evolution, and propose a cascading paradigm composed of five modules: Thinker, Talker, Facial Animator, Body Animator, and Renderer. The Thinker performs contextual reasoning and control, while the remaining modules generate coordinated speech, facial motion, body motion, and final visual appearance in an end-to-end controllable manner.
We further introduce a new benchmark for interactive intelligence evaluating speech, expression, motion, visual style, and personality consistency. Together, these contributions move digital humans beyond superficial imitation toward truly intelligent interaction.
To be updated.