Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Yuzhe Weng1,*, Haotian Wang1,*, Xinyi Yu1, Xiaoyan Wu2, Haoran Xu2, Shan He2, Jun Du1,†

1 University of Science and Technology of China  ·  2 iFLYTEK

* Equal contribution  ·   Corresponding author

Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans.

1 🗣️

Talking Portrait Video Generation

1.1 Various Styles

3D · Standing Pose
Non-Human · Adult · 3D
Non-Human · Adult · 3D
Non-Human · Teen · 3D
Non-Human · Young · 3D

1.2 Normal

Female · Adult · 3D
Female · Adult · Photo
Female · Young · Photo
Male · Adult · CG
Male · Elder · 3D
Male · Elder · Illustration
Male · Elder · Photo
Male · Young · Photo
AI-Generated · Realistic
AI-Generated · Professional
2 💬

Talking-Listening Interactive Video Generation

2.1 Interactive Pairs

Pair 1 · YouTuber Style
Left · Female
Right · Male
Pair 2 · Cartoon Style
Left · Female
Right · Male

2.2 Various Styles

3 🎬

Long-Form Video Generation

Demonstrating the stability and temporal consistency of our model over extended generation durations.

--:--
--:--
4 ⚖️

Comparison on Talking Portrait Video Generation

Each sample is generated by six models from the identical input. Videos are ordered by native aspect ratio — the most portrait baseline sits leftmost, the most square-shaped baseline next to it, and Ours is always placed at the rightmost position (highlighted). All videos share the same height so shapes stay directly comparable. Click Play All to start the whole group in sync; only Ours carries audio.

Sample 1
--:--
OmniAvatar
--:--
TalkVerse
--:--
Echomimic v3
--:--
Hallo3
--:--
StableAvatar
--:--
Ours
Sample 2
--:--
Echomimic v3
--:--
TalkVerse
--:--
Hallo3
--:--
OmniAvatar
--:--
StableAvatar
--:--
Ours
5 📊

Comparison on Talking-Listening Interactive Video Generation

Sample: cmp-1
StreamAvatar
INFP
Ours
Sample: listen-2
Speaker
L2L
DIM
RLHG
INFP
Ours
6 🔬

Comparison of Audio-Visual Cross-Attention Designs

Qualitative comparison of different audio–visual cross-attention designs. Our design (leftmost) uses 3D Cross-Attention with 1D RoPE and Multi-Head Gated Kernels (MHGK).

Ours · 3D CrossAttn + 1D RoPE + MHGK
2D Cross-Attention
MHGK w/ CFG
7 📦

Dataset

Our collected conversational dataset contains paired dialogue videos. Each pair captures two participants in a conversation, enabling training and evaluation of interactive talking-listening avatar generation. Below are 10 sample pairs.

Pair 1 · S0061
Person A
Person B
Pair 2 · S0086
Person A
Person B
Pair 3 · S0091
Person A
Person B
Pair 4 · S0102
Person A
Person B
Pair 5 · S0262
Person A
Person B
Pair 6 · S0269
Person A
Person B
Pair 7 · S0298
Person A
Person B
Pair 8 · S0336
Person A
Person B
Pair 9 · S0442
Person A
Person B
Pair 10 · S0491
Person A
Person B
📑

BibTeX

@misc{weng2026monologueinteractivetalkinglisteningavatar,
      title={Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels},
      author={Yuzhe Weng and Haotian Wang and Xinyi Yu and Xiaoyan Wu and Haoran Xu and Shan He and Jun Du},
      year={2026},
      eprint={2604.10367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.10367},
}