
Voice-Controlled Humanoid Robots Using Amazon Nova Sonic and AWS IoT
This project uses Amazon Nova Sonic and AWS IoT for real-time, hands-free voice control of humanoid robots. By integrating AI speech-to-speech streaming with tool calling, developers can create intuitive and responsive robotic systems. The setup includes AWS IoT-enabled robots and AWS cloud infrastructure for robust and scalable voice command execution.
Published Apr 18, 2025
Last Modified May 14, 2025
Previously, hands-free voice control was complicated, making it difficult to determine the break points for extracting voice commands or user intentions. In the era of large language models (LLMs), we can handle text input commands using tools or function calling. However, enabling voice commands requires developers to use multiple models to convert speech to text, process it through the LLM, and then convert text back to speech. This process makes streaming input challenging, leading to significant delays and errors due to incorrect stopping points.
The ideal solution is an LLM that supports speech-to-speech streaming with integrated tool use, eliminating the need for developers to manage voice commands. Amazon Nova Sonic is the model we've been searching for!


In this blog post, I will focus on the AWS cloud. As for the AWS IoT-enabled robots, they are simply pub/sub Python services that subscribe to an AWS IoT topic. When a new message arrives, it forwards the trigger API call with a buffer queue.
For the AWS cloud, the setup is built using AWS CDK. It includes the AWS IoT Thing construct (ThingWithCert L3 construct) and the aws-apprunner-alpha Service. The client certification and robot client code must be manually deployed to each Raspberry Pi.
If the
keep_alive_interval_sec
parameter isn't explicitly set in the sample code, the client might appear to hang up when, in reality, it’s simply taking a long time to reconnect due to the default setting.The web application is built using Express.js and Node.js, incorporating WebSocket for real-time voice input and speech capabilities within the web browser. We have adapted the sample code and updated the prompt as follows:
With the following tools scheme.
Code to start the streaming session.
The key is to set the "toolChoice" to "any," ensuring that at least one tool is invoked each time. While the model decides which tool to call, there will always be a tool utilized. If you use the default "Auto" setting, it can be challenging to select a tool for an action.
The integration of Amazon Nova Sonic with AWS IoT for seamless speech-to-speech control of humanoid robots represents a significant advancement in hands-free voice command technology. By leveraging real-time AI speech-to-speech streaming and tool calling, developers can now create more intuitive and responsive robotic systems. The combination of AWS IoT-enabled robots and the AWS cloud infrastructure ensures robust and scalable solutions, making it easier to deploy and manage these systems.
This approach not only simplifies the process of voice command extraction but also enhances the accuracy and efficiency of robotic actions. As we continue to explore the capabilities of large language models and real-time AI, the potential for innovative applications in robotics and beyond is immense. Amazon Nova Sonic is a promising step towards a future where voice commands can seamlessly control complex systems, paving the way for more advanced and user-friendly technologies.
Students from Higher Diploma in Cloud and Data Centre Administration
- Koei Tang - AWS Community Builder + AWS Educate Cloud Ambassador
- Angela Chow - AWS Educate Cloud Ambassador
- Kathy WU - AWS Educate Cloud Ambassador
- Hau Yee Leung - AWS Educate Cloud Ambassador
