FLAME: Free-form Language-based Motion Synthesis & Editing

The 37th AAAI Conference on Artificial Intelligence, 2023

Jihoon Kim^1,2 Jiseob Kim² Sungjoon Choi¹
Korea University¹
Kakao Brain²

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

Architecture

FLAME learns the denoising process \(p_{\theta}\) from \(\boldsymbol{M}_{t}\) to \(\boldsymbol{M}_{t-1}\) at diffusion time-step \(t\). Input motion is projected and concatenated with language pooler token (CLS), motion-length token (ML), and diffusion time-step token (TS) as input tokens for the transformer decoder. Additional language-side information is fed from a pre-trained frozen language encoder as a cross-attention context. FLAME outputs a \(2\cdot D_{mo}\)-dimensional sequence of vectors as it predicts both the mean and variance of noise at each diffusion time-steps.

Text-to-Motion Generation

Left: "A person kicks with his right leg."
Right: "A person kicks with his left leg."

"A person walks forward and then bends down to pick up something."

"A person throws and catches an object."

Left: "Someone performs ballet dance."
Middle: "A person dances in the style of a ballerina."
Right: "A person spins while practicing a ballet routine."

(From left to right)

Left: "A person practices salsa dance."
Middle: "A person starts salsa dance."
Right: "Someone is doing the salsa dance."

Text-based Motion Editing

Lower-body fixed. (From left to right)

1: Reference motion
2: "A person is dancing."
3: "A person dribbles a ball."
4: "A person is clapping."

Lower-body fixed. (From left to right)

1: Reference motion
2: "A person throws and catches a ball."
3: "A person makes a phone call."
4: "A person is playing a violin."

Upper-body fixed. (From left to right)

1: Reference motion
2: "A person walks forward."
3: "A person walks backward."
4: "A person jumps several times."

Motion Prediction & In-betweening

Motion Prediction

Reference: Green
Left: Reference motion
Middle: Motion prediction without prompt
Right: Motion prediction with prompt "a person is clapping."

Motion In-betweening

Reference: Green
Left: Reference motion
Middle: Motion in-betweening without prompt
Right: Motion in-betweening with prompt "a person is clapping."

FLAME: Free-form Language-based Motion Synthesis & Editing

The 37th AAAI Conference on Artificial Intelligence, 2023

Jihoon Kim1,2 Jiseob Kim2 Sungjoon Choi1 Korea University1 Kakao Brain2

Abstract

Architecture

Text-to-Motion Generation

Text-based Motion Editing

Lower-body fixed. (From left to right)

Lower-body fixed. (From left to right)

Upper-body fixed. (From left to right)

Motion Prediction & In-betweening

Motion Prediction

Motion In-betweening

Citation

Jihoon Kim^1,2 Jiseob Kim² Sungjoon Choi¹
Korea University¹
Kakao Brain²