Building a RAG System for Video Content Search and Analysis

Link to the app https://github.com/build-on-aws/langchain-embeddings/blob/main/notebooks/05_create_audio_video_embeddings.ipynb

Creating embeddings for images and text is a common practice in Retrieval-Augmented Generation (RAG) based application. However, handling video content presents unique challenges because videos combine thousands of frames with audio streams that can be converted to text. This post shows you how to transform this complex content into searchable vector representations.

Using Amazon Bedrock to invoke Amazon Titan Foundation Models for generating multimodal embeddings, Amazon Transcribe for converting speech to text, and Amazon Aurora postgreSQL for vector storage and similarity search, you can build an application that understands both visual and audio content, enabling natural language queries to find specific moments in videos.

For this time we will use Amazon Transcribe, but this blog will be updated to implement Amazon Nova Sonic

Solution Architecture

Create Amazon Aurora PostgreSQL with this Amazon CDK Stack

Visual Content Processing:

Extract frames: A VideoProcessor class that uses the ffmpeg libavcodec library to process video and create frames at one-second intervals (customizable through FPS settings).

You can personalize the intervals value by changing the FPS value in command.

Generate embeddings: for each extracted frame. Embeddins are created with the Amazon Titan Multimodal Embeddings G1 model using Amazon Bedrock Invoke Model API.

Select key frames: The CompareFrames class identifies significant visual changes when frame similarity falls below 0.8, using Cosine Similarity

🚨 97 out of 2097 frames is a big difference, especially when we talk about storage costs.

Key frames vector storage: Using the AuroraPostgres class key frame embeddings are stored in Amazon Aurora PostgreSQL using pgvector, enabling efficient similarity search capabilities.

The list of embeddings for image should look like this:

For audio content:

Speech-to-Text Conversion: The AudioProcessing class extracts and processes audio using Amazon Transcribe StartTranscriptionJob API. With IdentifyMultipleLanguages as True, Transcribe uses Amazon Comprehend to identify the language in the audio, If you know the language of your media file, specify it using the LanguageCode parameter.

ShowSpeakerLabels parameter as True enables speaker partitioning (diarization) in the transcription output. Speaker partitioning labels the speech from individual speakers in the media file and include MaxSpeakerLabels to specify the maximum number of speakers, in this case is 10.

Text Segmentation: The transcript is divided into into speaker segments with the second in which it was said. This allows for precise alignment with corresponding video frames.

Generate Text Embeddings: Each transcript segment is processed using Amazon Titan Text Embeddings.

The list of embeddings for text should look like this:

Vector Storage: Text segment embeddings are stored in Amazon Aurora PostgreSQL using the AuroraPostgres class.

Multimodal Search Capabilities:

Amazon Titan Multimodal Embeddings transforms both text and images into vector representations, enabling powerful cross-modal search capabilities. When you submit a text query, the system searches across both visual and audio content to find relevant moments in your videos.

The solution supports multiple search approaches

Vector Similarity Search

Using the AuroraPostgres class and based on the pvector repository, two types of searches can be performed:

- Cosine Similarity: Measures vector alignment (range: -1 to 1), higher scores (closer to 1) indicate greater similarity. Ideal for semantic matching across modalities.

I tested the notebook with my AWS re:Invent 2024 sesion AI self-service support with knowledge retrieval using PostgreSQL. I ask for Aurora and it brings me images and texts where it mentions:

There you can see my friend Guillermo Ruiz 😆

- L2 Distance (Euclidean): Measures spatial distance between vectors. Lower scores indicate closer matches. Useful for fine-grained similarity detection

In this case, I found the same answer.

Search by image

An image is entered, which I convert into a vector to perform the search in the Aurora PostgreSQL vector base.

I send this image for the search:

This is one of the answers that the query gave me:

Retrieval-Augmented Generation (RAG)

Combines retrieval with context from both visual and audio content, prividing natural language responses.

In the code you'll find a modification of the langchain custom Retriever that allows you to retrieve images and text as the context.

When I do custom retrieve to search for my name: Elizabeth, I will get a list of text and images where it mentions:

To generate the response with the Large Language Models (LLMs), which is the core of the RAG, I use Amazon Bedrock Converse API to invoke Amazon Nova Pro as model_id, the follow system prompt

Answer the user's questions based on the below context. If the context has an image, indicate that it can be reviewed for further feedback.If the context doesn't contain any relevant information to the question, don't make something up and just say "I don't know". (IF YOU MAKE SOMETHING UP BY YOUR OWN YOU WILL BE FIRED). For each statement in your response provide a [n] where n is the document number that provides the response.

I'm a very tough boss 🤣

Using RAG, I can ask more complex questions like: What is the session about?

The retriever delivered this context list:

This is the final awner that I recived:

The session appears to be about discussing the challenges and frustrations faced during customer service interactions, particularly with an airline. The speaker highlights issues such as long wait times, call drops, lack of continuity with agents, repetitive information requests, and disconnected experiences across different communication channels. Additionally, the session touches on the disappointment with a modern chatbot that failed to provide a better experience due to system failures and lack of personalization. The overall theme seems to be the poor customer experience resulting from these disconnected and inefficient service channels. [1][2]

The repo has a small library that I created using Amazon Q Cli that allows you to upload a local video to an Amazon S3 bucket.

I hope you find it useful and can try it out with your own videos. Share your experience in this post; I'd love to hear from you.

Get your builderID to use Amazon Q dev for free.

Implementation Notes

While this notebook demonstrates the core functionality, implementing a demo solution requires a more robust architecture. To address this need, I've developed "Ask Your Video" - a complete serverless solution that you can deploy using AWS Cloud Development Kit (CDK) 😏.

In the next part of this series, I'll provide a detailed, step-by-step explanation of how "Ask Your Video" works. Until then, you can explore the complete solution, including deployment instructions and architecture diagrams, in this GitHub repository .

Stay tuned for Part 2!

Gracias,
Eli

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.