Gesture generation is a pivotal aspect of multimodal communication systems, enhancing the naturalness and contextuality of human-machine interactions. While significant progress has been made in this field for various languages, gesture generation tailored specifically for Vietnamese remains underexplored. This paper introduces , a novel diffusion-based gesture generation model designed for the Vietnamese language. Unlike traditional approaches relying on English-centric models, DeepGesture leverages Vietnamese-specific tools and techniques. Key to DeepGesture is its adoption of diffusion-based gesture generation, integrating advanced Vietnamese language models like PhoBERT and the VnCoreNLP toolkit. Additionally, DeepGesture employs a fine-tuned Vietnamese version of WaveLM, trained on a custom Vietnamese audio dataset. This work emphasizes the importance of language-specific adaptations in gesture generation systems and opens avenues for similar research in other languages.
DeepGesture takes the audio and transcript of a speech as input, synthesizing realistic, stylized full-body gestures that align with the speech content rhythmically and semantically. It allows using a short piece of text, namely a text prompt, a video clip, namely a video prompt, or a motion sequence, namely a motion prompt, to describe a desired style. The gestures are then generated to embody the style as much as possible. And furthermore, our system can be extended to achieve style control of individual body parts through noise combination.
We conduct an extensive set of experiments to evaluate our framework. Our system outperforms all baselines both qualitatively and quantitatively, as evidenced by FGD, SRGR, SC, and SRA metrics, and user study results.
Our system accepts text, motion, and video prompts as style descriptors and successfully generates realistic gestures with reasonable styles, as required by the corresponding prompts. Some of the results are as follows.
Text Prompt: “the person is angry.”
Text Prompt: “the person is happy and excited.”
Text Prompt: “the person is sad.”
Text Prompt: “a person is holding a cup of coffee in the right hand.”
Text Prompt: “the person is playing the guitar.”
Text Prompt: “standing like a boxer.”
(The left video is the motion prompt, and the right video shows the results.)
Motion Prompt: “high right hand and low left hand .”
Motion Prompt: “sitting.”
Our system allows fine-grained styles control on individual body parts by using noise combination. We employ different prompts to control the styles of various body parts. The resulting motions produce these styles while maintaining a natural coordination among the body parts.
(The highlighted yellow text is the guidance added by LLM for actions.)
@article{
Thanh2024DeepGesture,
author = {Thanh, Ngoc},
title = {DeepGesture - Speech-Driven Periodic Gesture Generation for Semantic and Phonetic},
journal = {ACM Trans. Graph.},
issue_date = {August 2024},
numpages = {18},
doi = {},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {co-speech gesture synthesis, vietnamese, multi-modality, diffusion models}
}