Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans.
Talking Portrait Video Generation
1.1 Various Styles
1.2 Normal
Talking-Listening Interactive Video Generation
2.1 Interactive Pairs
2.2 Various Styles
Long-Form Video Generation
Demonstrating the stability and temporal consistency of our model over extended generation durations.
Comparison on Talking Portrait Video Generation
Each sample is generated by six models from the identical input. Videos are ordered by native aspect ratio — the most portrait baseline sits leftmost, the most square-shaped baseline next to it, and Ours is always placed at the rightmost position (highlighted). All videos share the same height so shapes stay directly comparable. Click Play All to start the whole group in sync; only Ours carries audio.
Comparison on Talking-Listening Interactive Video Generation
Comparison of Audio-Visual Cross-Attention Designs
Qualitative comparison of different audio–visual cross-attention designs. Our design (leftmost) uses 3D Cross-Attention with 1D RoPE and Multi-Head Gated Kernels (MHGK).
Dataset
Our collected conversational dataset contains paired dialogue videos. Each pair captures two participants in a conversation, enabling training and evaluation of interactive talking-listening avatar generation. Below are 10 sample pairs.
BibTeX
@misc{weng2026monologueinteractivetalkinglisteningavatar,
title={Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels},
author={Yuzhe Weng and Haotian Wang and Xinyi Yu and Xiaoyan Wu and Haoran Xu and Shan He and Jun Du},
year={2026},
eprint={2604.10367},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.10367},
}