DeepGesture - Speech-Driven Periodic Gesture Generation for Semantic and Phonetic language

ACM ICMI 2024
Ho Chi Minh City, Vietnam, DSc of Computer Science, VNUHCM-University of Science
cars peace

DeepGesture synthesized stylized gestures conditioned on four different text prompts.

Abstract

Gesture generation is a pivotal aspect of multimodal communication systems, enhancing the naturalness and contextuality of human-machine interactions. While significant progress has been made in this field for various languages, gesture generation tailored specifically for Vietnamese remains underexplored. This paper introduces , a novel diffusion-based gesture generation model designed for the Vietnamese language. Unlike traditional approaches relying on English-centric models, DeepGesture leverages Vietnamese-specific tools and techniques. Key to DeepGesture is its adoption of diffusion-based gesture generation, integrating advanced Vietnamese language models like PhoBERT and the VnCoreNLP toolkit. Additionally, DeepGesture employs a fine-tuned Vietnamese version of WaveLM, trained on a custom Vietnamese audio dataset. This work emphasizes the importance of language-specific adaptations in gesture generation systems and opens avenues for similar research in other languages.

cars peace

DeepGesture takes the audio and transcript of a speech as input, synthesizing realistic, stylized full-body gestures that align with the speech content rhythmically and semantically. It allows using a short piece of text, namely a text prompt, a video clip, namely a video prompt, or a motion sequence, namely a motion prompt, to describe a desired style. The gestures are then generated to embody the style as much as possible. And furthermore, our system can be extended to achieve style control of individual body parts through noise combination.

We conduct an extensive set of experiments to evaluate our framework. Our system outperforms all baselines both qualitatively and quantitatively, as evidenced by FGD, SRGR, SC, and SRA metrics, and user study results.

Control With MultiModal Prompts

Our system accepts text, motion, and video prompts as style descriptors and successfully generates realistic gestures with reasonable styles, as required by the corresponding prompts. Some of the results are as follows.

Text Prompt

Motion Prompt

(The left video is the motion prompt, and the right video shows the results.)

Body Part-Level Style Control

Our system allows fine-grained styles control on individual body parts by using noise combination. We employ different prompts to control the styles of various body parts. The resulting motions produce these styles while maintaining a natural coordination among the body parts.

(The highlighted yellow text is the guidance added by LLM for actions.)

BibTeX


@article{
  Thanh2024DeepGesture,
  author = {Thanh, Ngoc},
  title = {DeepGesture - Speech-Driven Periodic Gesture Generation for Semantic and Phonetic},
  journal = {ACM Trans. Graph.},
  issue_date = {August 2024},
  numpages = {18},
  doi = {},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {co-speech gesture synthesis, vietnamese, multi-modality, diffusion models}
}