
Ask your video: Analyze visual and audio speech in a RAG application
Learn how to create a containerized application to process and search video content using: AWS Step Functions for orchestration, Amazon ECS for parallel processing, Aurora PostgreSQL with pgvector and Amazon Bedrock for embeddings
Elizabeth Fuentes
Amazon Employee
Published May 29, 2025
In this second part of the series, you'll learn how to implement a containerized version of Ask Your Video using AWS Step Functions for orchestration. The application processes video content in parallel streams, enabling natural language search across visual and audio elements.
In Part 1: Building a RAG System for Video Content Search and Analysis, explored implementing a RAG system using jupyter notebook . While this approach works well for prototypes and small applications, scaling presents a principal challenges: Video processing demands intensive CPU resources, especially during frame extraction and embedding generation.
To address these constraint, this blog demonstrates a containerized application that offers improved scalability and resource management. The containerized architecture provides these key benefits:
- Unlimited processing time using Amazon Elastic Container Service (Amaozon ECS).
- Consistent environment management through Docker containers.
- Robust workflow orchestration with AWS Step Functions.
This architecture creates a application for processing video content at scale.

The solution uses AWS Step Functions to orchestrate a parallel workflow that processes both visual and audio content simultaneously:
- Trigger: When a video is uploaded to Amazon S3, it initiates the Step Functions workflow
- Parallel Processing Branches:
Visual Branch:
- An Amazon ECS task runs a containerized FFmpeg process that extracts frames at 1 FPS
- Each frame is processed to minimize storage costs by comparing frame similarity
- Unique frames are sent to Amazon Bedrock for embedding generation
- Audio Branch:
- Amazon Transcribe processes the audio track with speaker diarization enabled
- The transcription is segmented based on speaker changes and timing
- Text segments are converted to embeddings using Amazon Bedrock - 3 . Convergence:
- A Lambda function processes both streams' outputs
- Generates final embeddings using Amazon Bedrock Titan multimodal model
- Stores vectors in Amazon Aurora PostgreSQL with pgvector
Set up the environment:
- Create a virtual environment:
- Activate the virtual environment:
- Install dependencies:
This CDK project creates the foundational infrastructure for an audio and video processing application that generates embeddings from media files. The infrastructure includes:
- An Amazon ECS cluster named "video-processing"
- A VPC with public and private subnets for secure networking
- SSM parameters to store cluster and VPC information for use by other stacks
This deployment takes approximately 162 s.
After deployment, you can verify the resources in the AWS Cloudformation console:

Check parameters in the Systems Manager Parameter Store, necesary to deploy other stacks that are part of this application:
/videopgvector/cluster-name
: Contains the ECS cluster name/videopgvector/vpc-id
: Contains the VPC ID
This CDK project creates an Amazon Aurora PostgreSQL database with vector capabilities for storing and querying embeddings generated from audio and video files.
The infrastructure includes:
- An Aurora PostgreSQL Serverless v2 cluster with pgvector extension
- Lambda functions for database setup and management
- Security groups and IAM roles for secure access- SSM parameters to store database connection information
This deployment takes approximately 594.29s.

After deployment, you can verify the resources in the AWS Cloudformation console:

Check parameters in the Systems Manager Parameter Store, necesary to deploy other stacks that are part of this application:
/videopgvector/cluster_arn
: Contains the Aurora cluster ARN/videopgvector/secret_arn
: Contains the secret ARN for database credentials/videopgvector/video_table_name
: Contains the table name for video embeddings
This CDK project creates a complete workflow for processing audio and video files to generate embeddings.
The infrastructure includes:
- A Step Functions workflow that orchestrates the entire process
- Lambda functions for various processing steps
- An ECS Fargate task for video frame extraction
- Integration with Amazon Transcribe for audio transcription
- DynamoDB tables for tracking job status
- S3 bucket for storing media files and processing results
Install Docker Desktop and then:
This deployment takes approximately 171s.

After deployment, you can verify the resources in the AWS Cloudformation console:

This CDK project creates a retrieval API for searching and querying embeddings generated from audio and video files.
The infrastructure includes:
- An API Gateway REST API with Cognito authentication.
- Lambda functions for retrieval operations.
- Integration with the Aurora PostgreSQL vector database.
This deployment takes approximately 56.77s.

After deployment, you can verify the resources in the AWS Cloudformation console:

Check parameters in the Systems Manager Parameter Store, necesary to deploy other stacks that are part of this application:
/videopgvector/api_retrieve:
Contains the API endpoint URL/videopgvector/lambda_retreval_name
: Contains the retrieval Lambda function name
Navigate to the test environment:
Upload the video file to the bucket created in the previous deployment.
Check the bucket name as follows:
Then upload the file with this function:
Once the file upload is complete, the Step Functions workflow is automatically triggered. The pipeline will automatically:
- Extract audio and start transcription
- Process video frames and generate embeddings
- Store results in Aurora PostgreSQL
You can test the application in two ways:
- Query:
Open the notebook 01_query_audio_video_embeddings.ipynb and make queries directly to Aurora PostgreSQL, similar to what we did in the previous blog.
- Try the API:
Open the notebook 02_test_webhook.ipynb. This notebook demonstrates how to:
- Upload video files to the S3 bucket for processing
- Test the retrieval API endpoints with different query parameters
- Upload video files to the S3 bucket for processing
You can also see the status in the AWS Step Functions console.

Test the retrieval API endpoints with different query parameters:
To method as
retrieve
: For basic search functionality.
To method as
retrieve_generate
: For enhanced search with generated responses.
This containerized implementation of Ask Your Video demonstrates how you can scale video content processing using AWS Step Functions and Amazon ECS. The parallel processing architecture significantly improves performance while maintaining cost efficiency through optimized resource utilization.
The solution provides several key advantages over traditional approaches:
- Scalability: Handle multiple video files simultaneously without resource constraints
- Reliability: Robust error handling and workflow orchestration through Step Functions
- Cost optimization: Pay only for the compute resources you use with Fargate
- Maintainability: Containerized components ensure consistent deployments across environments
The complete source code and deployment instructions are available in the GitHub repository. Try implementing this solution in your AWS environment and share your feedback on how it performs with your video content.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.