Overview
Summary:
Open AI Realtime API integration is commonly used for a 1-to-1 use case (Client Side <--> Server Side). However, on some cases, Open AI Realtime API is used alongside with multiple client side users.
Prerequisites:
-
SDK:
- Latest version of RTC Application (Any client side platform).
- Latest version of Open AI Realtime API (2.2.0 as of the creation of this article).
-
Tools:
- Python 3.11 or higher
- Ubuntu 22.04 LTS or higher
- CentOS 7.0 or higher
Problem Description:
In a scenario where the server side client needs to determine which user is the active speaker, audio volume indication is currently unavailable within the Python SDK API Reference.
Solution:
Server side (Python) - A way to detect the active user from the remote side is to use on_playback_audio_frame_before_mixing which contains the VAD information to detect the active speaker.
For reference on how to achieve active user identification, we may refer to this Github example. (Lines 77-97).
def on_playback_audio_frame_before_mixing(self, agora_local_user, channelId, uid, audio_frame: AudioFrame, vad_result_state:int, vad_result_bytearray:bytearray):
# logger.info(f"on_playback_audio_frame_before_mixing, channelId={channelId}, uid={uid}, type={audio_frame.type}, samples_per_sec={audio_frame.samples_per_sec}, samples_per_channel={audio_frame.samples_per_channel}, bytes_per_sample={audio_frame.bytes_per_sample}, channels={audio_frame.channels}, len={len(audio_frame.buffer)}")
print(f"before_mixing: far = {audio_frame.far_field_flag },rms = {audio_frame.rms}, voice = {audio_frame.voice_prob}, music ={audio_frame.music_prob},pith = {audio_frame.pitch}")
#vad v2 processing: can do in sdk callback
state, bytes = self._vad_instance.process(audio_frame)
print("state = ", state, len(bytes) if bytes != None else 0, vad_result_state, len(vad_result_bytearray) if vad_result_bytearray != None else 0)
# dump to vad for debuging
self._vad_dump.write(audio_frame, bytes, state)
if bytes != None:
if state == 1:
# start speaking: then start send bytes(not audio_frame) to ARS
print("vad v2 start speaking")
elif state == 2:
# continue send bytes to ARS
pass
elif state == 3:
# stop speaking: send bytes to ARS and then then stop ARS
print("vad v2 stop speaking:")
else:
logger.info("unknown state")
return 1
References:
2. https://api-ref.agora.io/en/voice-sdk/python/rtc-py-api.html
3. https://docs.agora.io/en/open-ai-integration/get-started/quickstart?platform=python