Issue Description
Conversational AI agents may exhibit repetitive response patterns where the system addresses earlier interaction turns instead of focusing on the most recent user input. This phenomenon manifests as an incremental accumulation of user queries where new speech segments appear merged with preceding utterances within the same session.
Platform/SDK
Service: Agora Conversational AI
Integration: RESTful API
TTS Provider: ByteDance Duplex
Affected Resource: seed-tts-2.0
Affected Speaker: zh_male_naiqimengwa_uranus_bigtts
Root Cause
The recursive behavior originates from a failure in the state management logic of the conversation history module. Specific Text-to-Speech voices, notably the zh_male_naiqimengwa_uranus_bigtts speaker, are not designed to generate word level metadata or subtitle timestamps.
When the enable_words parameter is set to true, "enable_words": true, for these specific voices, the internal media pipeline fails to produce valid text content, resulting in an empty assistant response string. Because the conversation engine requires a non empty string to successfully commit a turn to the session memory, this empty output prevents the assistant's response from being recorded in the context window. During the subsequent turn, the system perceives the previous user input as an unhandled event and merges it with the current input, leading to the redundant processing of legacy context.
Step by Step Solution
Disable Word-Level Subtitle Output:
If you are using this TTS voice and do not need word-by-word subtitles, disable word-level subtitle output.
Update the Configuration:
Change the configuration from:
"enable_words": trueto:
"enable_words": falseSave and Restart the Session:
Save the configuration and restart or recreate the session if needed.
Test the Conversation Again:
Test the conversation again to confirm that the assistant responds only to the latest user input and that user inputs are no longer merged across turns.
Outcome
Disabling the unsupported word level metadata ensures that the assistant's text is correctly identified and saved to the conversation history. This prevents the logic gate from merging speech segments and restores a linear, predictable conversational flow.
"enable_words": false
Corresponding Document/Link
- CSD-78200