FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai¹ Yuhan Wu² Kunhang Li² You Zhou¹ Bo Zheng¹ Haiyang Liu²

¹ShandaAI Research Tokyo, ²The University of Tokyo

arXiv Code 🤗 Model Real-time Demo Video

FloodDiffusion is a framework for streaming human motion generation. Given time-varying text prompts, such as "raise knees" followed by "squats", it generates smooth, continuous human motions aligned with the text, featuring interesting applications in virtual gaming.

Results with in-the-wild text prompts

FloodDiffusion can generate streaming human motion from time-varying text prompts.

Same text prompt, different input timesteps

For the same text prompt, FloodDiffusion can generate different time aligned motion results with different input timesteps.

Group 1: text prompt "walk forward", "sit down", and "stand up".

Stop by neural language command

FloodDiffusion can stop by neural language command, such as "stand", otherwise it will repeat the last command.

Compare with non-streaming model

For each row, we compare the results of FloodDiffusion (right) with the results of non-streaming model MoMask (left).

Compare with other streaming models

For each row, we compare the results of PRIMAL (left) and MotionStreamer (middle) with our FloodDiffusion (right).

Ablation study

Comparison between w/o bi-directional attention (w. causal) and w/o the lower triangular scheduler (w. random).