I am playing with multi-speaker #voicebot and seems like #openai realtime API is not detecting speakers natively. Surprisingly it keeps saying "can't tell only based on text" as if it's biased more towards transcriptions than full spectrum audio signals.
Probably the best way to do this right now would be to input speaker-tagged 'text' mode item in the API and let it emit audio in response.