Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Yuzhe Weng^1,*, Haotian Wang^1,*, Xinyi Yu¹, Xiaoyan Wu², Haoran Xu², Shan He², Jun Du^1,†

¹ University of Science and Technology of China · ² iFLYTEK

^* Equal contribution · ^† Corresponding author

Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans.

1 🗣️

Talking Portrait Video Generation

1.1 Various Styles

3D · Standing Pose

Non-Human · Adult · 3D

Non-Human · Teen · 3D

Non-Human · Young · 3D

1.2 Normal

Female · Adult · 3D

Female · Adult · Photo

Female · Young · Photo

Male · Adult · CG

Male · Elder · 3D

Male · Elder · Illustration

Male · Elder · Photo

Male · Young · Photo

AI-Generated · Realistic

AI-Generated · Professional

2 💬

Talking-Listening Interactive Video Generation

2.1 Interactive Pairs

Pair 1 · YouTuber Style

Left · Female

Right · Male

Pair 2 · Cartoon Style

Left · Female

Right · Male

2.2 Various Styles

3 🎬

Long-Form Video Generation

Demonstrating the stability and temporal consistency of our model over extended generation durations.

--:--

4 ⚖️

Comparison on Talking Portrait Video Generation

Each sample is generated by six models from the identical input. Videos are ordered by native aspect ratio — the most portrait baseline sits leftmost, the most square-shaped baseline next to it, and Ours is always placed at the rightmost position (highlighted). All videos share the same height so shapes stay directly comparable. Click Play All to start the whole group in sync; only Ours carries audio.

Sample 1

--:--

OmniAvatar

--:--

TalkVerse

--:--

Echomimic v3

--:--

Hallo3

--:--

StableAvatar

--:--

Ours

Sample 2

--:--

Echomimic v3

--:--

TalkVerse

--:--

Hallo3

--:--

OmniAvatar

--:--

StableAvatar

--:--

Ours

5 📊

Comparison on Talking-Listening Interactive Video Generation

Sample: cmp-1

StreamAvatar

INFP

Ours

Sample: listen-2

Speaker

L2L

DIM

RLHG

INFP

Ours

6 🔬

Comparison of Audio-Visual Cross-Attention Designs

Qualitative comparison of different audio–visual cross-attention designs. Our design (leftmost) uses 3D Cross-Attention with 1D RoPE and Multi-Head Gated Kernels (MHGK).

Ours · 3D CrossAttn + 1D RoPE + MHGK

2D Cross-Attention

MHGK w/ CFG

7 📦

Dataset

Our collected conversational dataset contains paired dialogue videos. Each pair captures two participants in a conversation, enabling training and evaluation of interactive talking-listening avatar generation. Below are 10 sample pairs.

Pair 1 · S0061

Person A

Person B

Pair 2 · S0086

Person A

Person B

Pair 3 · S0091

Person A

Person B

Pair 4 · S0102

Person A

Person B

Pair 5 · S0262

Person A

Person B

Pair 6 · S0269

Person A

Person B

Pair 7 · S0298

Person A

Person B

Pair 8 · S0336

Person A

Person B

Pair 9 · S0442

Person A

Person B

Pair 10 · S0491

Person A

Person B

📑

BibTeX

@misc{weng2026monologueinteractivetalkinglisteningavatar,
      title={Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels},
      author={Yuzhe Weng and Haotian Wang and Xinyi Yu and Xiaoyan Wu and Haoran Xu and Shan He and Jun Du},
      year={2026},
      eprint={2604.10367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.10367},
}