DeepGesture: A conversational gesture synthesis system based on emotions and semantic

Abstract

Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. A lightweight Transformer backbone combines full self-attention and cross-local attention for effective feature fusion of heterogeneous modalities. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness, outperforming baselines on Mean Opinion Score and Fréchet Gesture Distance metrics. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.

Through experimental validation and qualitative analysis, the DeepGesture model - extending DiffuseStyleGesture - demonstrates the ability to generate realistic gestures for both in-distribution and out-of-distribution speech, including synthetic voices such as that of Steve Jobs. This highlights the promise of diffusion-based approaches in modeling complex, expressive, and low-frequency gesture behaviors.

Additionally, we contribute open-source code, including rendering pipelines and data processing tools built on Unity, available on GitHub. These resources provide a solid foundation for future development and reproducibility. The integration of text alongside speech and emotion into the gesture generation pipeline marks an important step toward building fully multimodal agents, capable of more intuitive and human-like interactions in diverse application domains.

Classifier-free Guidance of speech, text and emotion

Similarities

Uses the Diffusion model on gesture data \(\mathbf{x}^{1:M \times D}\), with \(M\) temporal frames and \(D = 1141\) representing motion coordinates per frame (analogous to image width and height).
Uses conditional Diffusion with \(\mathbf{x}_0\) objective.
In stages 4. Feature Encoding and 6. Feature Decoding in \(256\).

Differences

Conditional gesture generation:
- Emotional condition: \(c = [\mathbf{s}, \mathbf{e}, \mathbf{a}, \mathbf{v}]\) and \(c_{\varnothing} = [\varnothing, \varnothing, \mathbf{a}, \mathbf{v}]\).
- Emotion state interpolation between \(\mathbf{e}_1, \mathbf{e}_2\) using: \(c = [\mathbf{s}, \mathbf{e}_1, \mathbf{a}, \mathbf{v}]\) and \(c_{\varnothing} = [\mathbf{s}, \mathbf{e}_2, \mathbf{a}, \mathbf{v}]\).
We use uses Self-Attention: learning the relationship between emotions, seed gestures, and each frame (similar to DALL·E 2's text-image alignment).
We concatenates speech and text (analogous to ControlNet's pixel-wise condition).

DeepGesture: A conversational gesture synthesis system based on emotions and semantic

DeepGesture model architecture overview.

Abstract

Classifier-free Guidance of speech, text and emotion

Similarities

Differences

Emotion-guidance Gesture Generation

Speech-guidance Gesture Generation

Gesture Generation Result Sample

BibTeX