Speech-to-Speech AI: From Dr. Sbaitso to Amazon Nova Sonic

A very personal memory

I will never forget a certain night in the 1990s. It must have been around 1992 or 1993 - I was at my friend Junior's house, crowded with a bunch of other kids around his parents' computer. I am pretty sure it was a 386, or maybe a 286? All I knew back then was that it had a fancy multimedia kit. The source of our fascination was a program called Dr. Sbaitso.

Image of Dr. Sbaitso what is one of the first computer chat programs — DR SBAITSO

Released by Creative Labs, Dr. Sbaitso was one of the first computer chat programs, created to demonstrate the capabilities of the Sound Blaster sound card. The name is actually an acronym for "Sound Blaster Artificial Intelligent Text to Speech Operator" (to be honest, I only learned it while I was writing this text).

The system simulated a digital psychotherapist and became known for its sometimes strange responses and its characteristic robotic voice. It was quite limited in its interactions. One of its most famous phrases was "TELL ME MORE ABOUT THAT”, which it frequently repeated during conversations, back then, with our cluelessness, it felt we were living in a sci-fi movie.

"Can you fall in love?" while everyone laughed beside me.

"TELL ME MORE ABOUT THAT", it would respond with its characteristic artificial voice.

We would not give up. We spent hours making up stories, creating scenarios, trying to convince that digital therapist to fall in love with one of our friends. With each disconnected response, we laughed more and tried even more hard.

It's funny how these memories stick with you. That clunky old PC with its robotic voice seems almost prehistoric now, but it was pure magic to us back then. Three decades later, I'm still amazed every time I think about how far we've come. During this time, the way people and machines interact has changed dramatically. From those first tries with voice synthesis, through different assistant experiments, up to the AI assistants that are now just part of our daily lives. And now here I am, working with Amazon Nova Sonic and Amazon Bedrock, building the kind of natural conversations I could only dream about back then.

Introducing Amazon Nova Sonic

Amazon Nova Sonic is a model that does everything together. Instead of the way where you needed different models and separate steps for STT (Speech-to-Text), processing, and TTS (Text-to-Speech), Amazon Nova Sonic just handles everything in one go, processing audio real-time both ways through complete two-way streaming. This means you can build systems that keep all the important stuff like tone, emotions, and how people actually talk throughout the whole conversation.

Let me show you a quick demo to see how this works. Meet DR ANSMUE (Amazon Nova Sonic Model Usage Example).

An image of DR ANSMUE, a demo using the new capabilities of Amazon Nova Sonic — DR ANSMUE (Amazon Nova Sonic Model Usage Example)

Yeah, I know it is not super original. But cut me some slack, ok? In this example, users can submit their code and chat with our "code therapist" about whatever code they sent in.

Under the hood

What I'm going to show you here is based on the Bidirectional Audio Streaming sample (the one using Amazon Nova Sonic and Amazon Bedrock with TypeScript) that you can find in the Amazon Nova Sonic Speech-to-Speech Model Samples and Bidirectional Streaming API user guide. So, instead of getting into all the specific technical details of this sample, I'll focus on walking you through the key things you need to know for build this kind of application.

When the user starts talking, the web frontend gets the audio from the microphone using WebAudio API and sends it to the server through WebSockets. On the server side, we have got the StreamSession class that manages the audio chunks and puts them in line for processing.

This works in collaboration with the S2SBidirectionalStreamClient class, which handles the back-and-forth communication using AsyncIterable to create a two-way stream with Amazon Bedrock, using the new capabilities of Amazon Bedrock SDK.

The createSessionAsyncIterable method in S2SBidirectionalStreamClient creates an iterator that feeds into the InvokeModelWithBidirectionalStreamCommand, a new invoke way from the Amazon Bedrock SDK, sending those audio chunks (in base64) to Amazon Nova Sonic. At the same time, processResponseStream handles what comes back from the Amazon Bedrock, figuring out the audioOutput and textOutput it receives.

S2SBidirectionalStreamClient keeps track of what is happening using the data structure SessionData, controlling signals, and event handlers. When the model comes back with something, it goes back through the same WebSocket to the frontend, where the browser plays it using WebAudio API.

All of this happens right away. The streamAudioChunk method in StreamSession manages a queue through audioBufferQueue, which avoids things from getting overloaded and keeps all the audio data in the right order, so conversations feel natural and smooth.

To make all of this work smoothly, we are using HTTP/2 to call the Amazon Bedrock. The cool thing is that, unlike HTTP, HTTP/2 lets us send multiple streams over the same connection. That is helpful to keep the audio flowing in both directions without delays. Another nice thing is performance. it compresses headers and figures out which parts to prioritize, so everything feels quicker and more natural. Especially when there is a lot going on at once.

In a nutshell

It might look complex at first, but in a nutshell, what is happening is:

User => Server: User speaks into their microphone, and their voice travels to the server.
Server => Amazon Bedrock: The server forwards user voice to Amazon Bedrock, but doesn't wait for user to finish speaking.
Amazon Bedrock => Server: As soon as it can, Amazon Bedrock starts sending responses back to the server.
Server => User: The server immediately forwards the responses to user browser, which plays them.

This all happens in near real-time, creating a natural conversation where responses can overlap with user speaking, just like in human conversations.

New possibilities

It opens up a bunch of possibilities that used to feel super hard to achieve. Imagine calling customer support and being able to interrupt and actually fix something mid-sentence, instead of waiting for any pauses to finish. Or think about a virtual assistant joining your team meeting and just keeping up, responding like a real person instead of lagging behind.

This kind of thing could be huge for accessibility too. People with visual or motor impairments could interact with systems more easily. And in classrooms, virtual tutors could actually listen and answer questions like a real conversation. Even live translation could feel smoother, like you’re actually talking to someone, not just waiting for a machine to catch up.

Start building with AWS SDK's bidirectional streaming API, Amazon Nova Sonic and Amazon Bedrock today. Check the resources for developers to build, deploy, and scale AI-powered applications, Bidirectional Streaming API user guide, and Amazon Nova Sonic Speech-to-Speech Model Samples to learn more about how to implement it in your own applications.

As I write about these possibilities, I find myself increasingly eager to explore their practical applications. But before diving into new projects, I need to try this one more time with DR ANSMUE:

“Hey Doc, can you fall in love?”

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.