
Speech-to-Speech AI: From Dr. Sbaitso to Amazon Nova Sonic
Create AI-powered speech-to-speech applications with Amazon Nova Sonic and Amazon Bedrock. Natural language conversations have evolved dramatically since the days of Dr. Sbaitso's 'TELL ME MORE ABOUT THAT'. Learn how to implement speech-to-speech bidirectional streaming in your applications.

Released by Creative Labs, Dr. Sbaitso was one of the first computer chat programs, created to demonstrate the capabilities of the Sound Blaster sound card. The name is actually an acronym for "Sound Blaster Artificial Intelligent Text to Speech Operator" (to be honest, I only learned it while I was writing this text).


StreamSession
class that manages the audio chunks and puts them in line for processing.S2SBidirectionalStreamClient
class, which handles the back-and-forth communication using AsyncIterable
to create a two-way stream with Amazon Bedrock, using the new capabilities of Amazon Bedrock SDK.createSessionAsyncIterable
method in S2SBidirectionalStreamClient
creates an iterator that feeds into the InvokeModelWithBidirectionalStreamCommand
, a new invoke way from the Amazon Bedrock SDK, sending those audio chunks (in base64) to Amazon Nova Sonic. At the same time, processResponseStream
handles what comes back from the Amazon Bedrock, figuring out the audioOutput and textOutput it receives.S2SBidirectionalStreamClient
keeps track of what is happening using the data structure SessionData
, controlling signals, and event handlers. When the model comes back with something, it goes back through the same WebSocket to the frontend, where the browser plays it using WebAudio API.streamAudioChunk
method in StreamSession
manages a queue through audioBufferQueue
, which avoids things from getting overloaded and keeps all the audio data in the right order, so conversations feel natural and smooth.- User => Server: User speaks into their microphone, and their voice travels to the server.
- Server => Amazon Bedrock: The server forwards user voice to Amazon Bedrock, but doesn't wait for user to finish speaking.
- Amazon Bedrock => Server: As soon as it can, Amazon Bedrock starts sending responses back to the server.
- Server => User: The server immediately forwards the responses to user browser, which plays them.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.