
Audio mix for podcasts with Amazon Polly and pydub library
This article explains how to automatically mix podcast speech and sounds with Amazon Polly and pydub python library
Amar Meriche
Amazon Employee
Published Jan 17, 2025
Amazon Polly is a cloud-based text-to-speech (TTS) service that converts text into lifelike speech. It uses advanced deep learning technologies to synthesize natural-sounding human speech. With Polly, you can:
- Generate speech in multiple languages and accents
- Choose from a variety of male and female voices
- Create applications that talk, such as e-learning platforms, accessibility applications, or automated news readers
- Customize the pronunciation of words
- Use SSML (Speech Synthesis Markup Language) for more fine-grained control over speech output
It's designed to be scalable and cost-effective for a wide range of use cases, from small-scale personal projects to enterprise-level applications requiring millions of audio files.
Amazon Polly can be used in combination of pydub python library to automate the audio mix when you want to add music, sounds or any other audio to a podcast.
The first step is to create the speech in Amazon Polly interface (CLI can also be used, not covered in this article), using SSML tags:
- Connect to AWS console and access Amazon Polly service
- Choose the type of Engine you want to use (Standard, Neutral, Long Form, or Generative, depending on the Region you are using)
- Choose the Language and Voice you want to use
- Activate the SSML button on the right and type your Input Text using SSML and Speech Marks
- In Additional settings, choose the Sample rate and the File format for the output

Here's the Input text example I used:
In this example I included custom SSML tags (
<mark name="jingle1"/>
), and some pause (<break time="8s"/>
) in my speech to add the jingle music later.You can then Listen to the speech without the jingle by clicking on the Listen button on the top right.
On your computer (don't save to S3), you need now to:
- Download the output file in the format you picked in the Additional settings (MP3) : you'll get a file named
speech_xxxx.mp3
(xxxx
is the date of the download) - Download the SSML speech marks using the "
Speech Marks
" File format and "SSML
" Speech marks types: the dowloaded file is namedspeech_xxxx.marks
(xxxx
is the date of the download)

Now you downloaded both mp3 and marks files, you need to rename them
speech.mp3
and speech_marks.json
.You can add them both in the same directory on your computer as the
mix2audiosources.py
python script (see below) and the background music file (click on the previous link, unzip the JingleB2.zip file and rename it jingle.mp3).mix2audiosources.py
On your computer, in your directory, you'll have the following files:
speech.mp3
speech_marks.json
jingle.mp3
mix2audiosources.py
Now edit the
speech_marks.json
file with a similar content:speech_marks.json
You have to add brackets at the beginning and the end to make this file a proper json file:
speech_marks.json
Install pydub:
pip install pydub
Start the mix2audiosources.py script to mix the speech with the jingle music.
python .\mix2audiosources.py
The jingle will be added automatically in accordance to the
"time"
of the Speech Marks in the speech_marks.json
file.As an output, you'll get the
speech_with_music.mp3
file that will be the superposition of the jingle to the pause in the speech. You can play this output file with any MP3 music player.
Note:
The process described so far is manual, but it is possible to automate the process by implementing a Lambda with more sophisticated code to mix sounds to the speech, and trigger the lambda automatically when the speech file is uploaded to an S3 bucket. The only manual action will be to create the SSML speech with proper Speech Marks and pauses, and use the CLI or the Polly Console to create and upload the speech file in the S3 bucket (could be the object of a new article).
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.