Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.
FLAME learns the denoising process \(p_{\theta}\) from \(\boldsymbol{M}_{t}\) to \(\boldsymbol{M}_{t-1}\) at diffusion time-step \(t\). Input motion is projected and concatenated with language pooler token (CLS), motion-length token (ML), and diffusion time-step token (TS) as input tokens for the transformer decoder. Additional language-side information is fed from a pre-trained frozen language encoder as a cross-attention context. FLAME outputs a \(2\cdot D_{mo}\)-dimensional sequence of vectors as it predicts both the mean and variance of noise at each diffusion time-steps.
@article{kim2022flame,
title={Flame: Free-form language-based motion synthesis & editing},
author={Kim, Jihoon and Kim, Jiseob and Choi, Sungjoon},
journal={arXiv preprint arXiv:2209.00349},
year={2022}
}