FloodDiffusion can generate streaming human motion from time-varying text prompts.
For the same text prompt, FloodDiffusion can generate different time aligned motion results with different input timesteps.
Group 1: text prompt "walk forward", "sit down", and "stand up".
FloodDiffusion can stop by neural language command, such as "stand", otherwise it will repeat the last command.
For each row, we compare the results of FloodDiffusion (right) with the results of non-streaming model MoMask (left).
For each row, we compare the results of PRIMAL (left) and MotionStreamer (middle) with our FloodDiffusion (right).
Comparison between w/o bi-directional attention (w. causal) and w/o the lower triangular scheduler (w. random).