Building a RAG System for Video Content Search and Analysis

Link to the app https://github.com/build-on-aws/langchain-embeddings/blob/main/notebooks/05_create_audio_video_embeddings.ipynb

Creating embeddings for images and text is a common practice in Retrieval-Augmented Generation (RAG) based application. However, handling video content presents unique challenges because videos combine thousands of frames with audio streams that can be converted to text. This post shows you how to transform this complex content into searchable vector representations.

Using Amazon Bedrock to invoke Amazon Titan Foundation Models for generating multimodal embeddings, Amazon Transcribe for converting speech to text, and Amazon Aurora postgreSQL for vector storage and similarity search, you can build an application that understands both visual and audio content, enabling natural language queries to find specific moments in videos.

For this time we will use Amazon Transcribe, but this blog will be updated to implement Amazon Nova Sonic

Solution Architecture

Create Amazon Aurora PostgreSQL with this Amazon CDK Stack

Visual Content Processing:

Image not found

Extract frames: A VideoProcessor class that uses the ffmpeg libavcodec library to process video and create frames at one-second intervals (customizable through FPS settings).

Image not found

You can personalize the intervals value by changing the FPS value in command.

1
2
3
4
5
6
7
8
9
command = [
            'ffmpeg',
            '-v', 'quiet',
            '-stats',
            '-i', file_location,
            '-vf', 'fps=1,scale=1024:-1',
            '-y',
            f'{output_dir}/sec_%05d.jpg'
        ]

Generate embeddings: for each extracted frame. Embeddins are created with the Amazon Titan Multimodal Embeddings G1 model using Amazon Bedrock Invoke Model API.

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_image_embeddings(image_bytes):

        input_image = base64.b64encode(image_bytes).decode('utf8')

        body = json.dumps({"inputImage": input_image, "embeddingConfig": {"outputEmbeddingLength": embedding_dimension}})
        response = bedrock_runtime.invoke_model(
            body=body,
            modelId=default_model_id,
            accept="application/json",
            contentType="application/json",
        )
        response_body = json.loads(response.get("body").read())
        return response_body.get("embedding")

Image not found

Select key frames: The CompareFrames class identifies significant visual changes when frame similarity falls below 0.8, using Cosine Similarity

1
2
3
4
5
6
7
8
9
10
11
12
13
def filter_relevant_frames(vectors,difference_threshold = 0.8):
        selected_frames = []
        current_index = 0
        
        for index, vec in enumerate(vectors):
            
            sim = self.cosine_similarity(vectors[current_index], vec)
            if sim < difference_threshold:
                selected_frames.append(current_index)
                current_index = index
                
        selected_frames.append(current_index)
        return selected_frames

Image not found

🚨 97 out of 2097 frames is a big difference, especially when we talk about storage costs.

Key frames vector storage: Using the AuroraPostgres class key frame embeddings are stored in Amazon Aurora PostgreSQL using pgvector, enabling efficient similarity search capabilities.

1
2
3
4
5
6
7
def insert(rows):
        for row in rows:
            sql = f"\
                INSERT INTO bedrock_integration.knowledge_bases (id, embedding, chunks, time, metadata, date, source, sourceurl, topic, content_type, language) \
                VALUES ('{row['id']}', '{row['embedding']}', '{row['chunks']}','{row['time']}','{row['metadata']}', '{row['date']}', \
                    '{row['source']}','{row['sourceurl']}','{row['topic']}','{row['content_type']}', '{row['language']}')"
            self.execute_statement(sql)

The list of embeddings for image should look like this:

Image not found

For audio content:

Image not found

Speech-to-Text Conversion: The AudioProcessing class extracts and processes audio using Amazon Transcribe StartTranscriptionJob API. With IdentifyMultipleLanguages as True, Transcribe uses Amazon Comprehend to identify the language in the audio, If you know the language of your media file, specify it using the LanguageCode parameter.

ShowSpeakerLabels parameter as True enables speaker partitioning (diarization) in the transcription output. Speaker partitioning labels the speech from individual speakers in the media file and include MaxSpeakerLabels to specify the maximum number of speakers, in this case is 10.

1
2
3
4
5
6
7
8
9
10
transcribe_client.start_transcription_job(
                    TranscriptionJobName=job_name,
                    IdentifyLanguage=True, 
                    OutputBucketName = bucket,
                    OutputKey = f"{prefix}/{file}/transcribe.json",
                    Media={ 'MediaFileUri': s3_uri},
                    Settings={
                    'ShowSpeakerLabels': True,
                    'MaxSpeakerLabels': 10
                })

Text Segmentation: The transcript is divided into into speaker segments with the second in which it was said. This allows for precise alignment with corresponding video frames.

Generate Text Embeddings: Each transcript segment is processed using Amazon Titan Text Embeddings.

1
2
3
4
5
6
7
8
9
10
def get_text_embeddings(text):
        body = json.dumps({"inputText": text, "embeddingConfig": {"outputEmbeddingLength": embedding_dimension}})
        response = bedrock_runtime.invoke_model(
            body=body,
            modelId=default_model_id,
            accept="application/json",
            contentType="application/json",
        )
        response_body = json.loads(response.get("body").read())
        return response_body.get("embedding")

The list of embeddings for text should look like this:

Image not found

Vector Storage: Text segment embeddings are stored in Amazon Aurora PostgreSQL using the AuroraPostgres class.

Multimodal Search Capabilities:

Amazon Titan Multimodal Embeddings transforms both text and images into vector representations, enabling powerful cross-modal search capabilities. When you submit a text query, the system searches across both visual and audio content to find relevant moments in your videos.

The solution supports multiple search approaches

Vector Similarity Search

Using the AuroraPostgres class and based on the pvector repository, two types of searches can be performed:

- Cosine Similarity: Measures vector alignment (range: -1 to 1), higher scores (closer to 1) indicate greater similarity. Ideal for semantic matching across modalities.

1
2
search_query = "<you question>"
docs = retrieve(search_query, how="cosine", k=10)

I tested the notebook with my AWS re:Invent 2024 sesion AI self-service support with knowledge retrieval using PostgreSQL. I ask for Aurora and it brings me images and texts where it mentions:

Image not found

There you can see my friend Guillermo Ruiz 😆

1
2
3
4
text:
memory . A place where all the information is stored and can easily be retrievable , and that's where the vector database comes in . This is the the first building block . And a vector database stores and retrieves data in the form of vector embeddeds or mathematical representations . This allows us to find similarities between data rather than relying on the exact keyword match that is what usually happens up to today . This is essential for systems like retrieval ofmented generation or RAC , which combines external knowledge with the AI response to deliver those accurate and context aware response . And by the way , I think yesterday we announced the re-rank API for RAC . So now your rack applications , you can score and it will prioritize those documents that have the most accurate information . So at the end will be even faster and cheaper building rack . We're gonna use Amazon Aurora postgrade SQL with vector support that will give us a scalable and fully managed solution for our AI tasks .
similarity:0.5754164493071239
metadata:{"speaker":"spk_0","second":321}

- L2 Distance (Euclidean): Measures spatial distance between vectors. Lower scores indicate closer matches. Useful for fine-grained similarity detection

1
2
search_query = "<you question>"
docs = retrieve(search_query, how="l2", k=10)

In this case, I found the same answer.

Search by image

An image is entered, which I convert into a vector to perform the search in the Aurora PostgreSQL vector base.

1
docs = retrieve(videomanager.read_image_from_local(one_image), how="cosine", k=3)

I send this image for the search:

Image not found

This is one of the answers that the query gave me:

Image not found

Retrieval-Augmented Generation (RAG)

Combines retrieval with context from both visual and audio content, prividing natural language responses.

In the code you'll find a modification of the langchain custom Retriever that allows you to retrieve images and text as the context.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        search_vector = embedding_generation.get_embeddings(query)
        result = aurora.similarity_search(search_vector, how=self.how, k=self.k)
        rows = json.loads(result.get("formattedRecords"))

        matching_documents = []

        for row in rows:
            document_kwargs = dict(
                metadata=dict(**json.loads(row.get("metadata")), content_type = row.get("content_type"), source=row.get("sourceurl")))
            
            if self.how == "cosine":
                document_kwargs["similarity"] = row.get("similarity")
            elif self.how == "l2":
                document_kwargs["distance"] = row.get("distance")

            if row.get("content_type") == "text":
                matching_documents.append( Document( page_content=row.get("chunks"), **document_kwargs ))
            if row.get("content_type") == "image":
                matching_documents.append( Document( page_content=row.get("source"),**document_kwargs ))

        return matching_documents

When I do custom retrieve to search for my name: Elizabeth, I will get a list of text and images where it mentions:

1
2
3
4
[Document(metadata={'second': 1009, 'content_type': 'image', 'source': 's3://MY-BUCKET/videos/DEV315.mp4'}, page_content='./tmp/DEV315/sec_01010.jpg'),
 Document(metadata={'speaker': 'spk_0', 'second': 1081, 'content_type': 'text', 'source': 'https://MY-BUCKET/embeddings-demo-1234/videos/DEV315.mp4/transcribe.json'}, page_content="AI . We always say , if you train everything to have generic information , responses will not be good , but if you provide the context and deep dive on the things that you wanted to take , it will really reply the way we want . And now that we have everything , it's time for the fun part that is how we build the AI agents , and I will let Ellie to take over and speak about this . So"),
 Document(metadata={'speaker': 'spk_0', 'second': 1, 'content_type': 'text', 'source': 'https://MY-BUCKET/embeddings-demo-1234/videos/DEV315.mp4/transcribe.json'}, page_content="Good afternoon . Uh , thanks for joining our session . My name is Guiller Mitz and I'm senior developer advocate at AWS and with me today is my colleague Elizabeth Fuentes . Um , Have you guys ever made a mistake that is worth storytelling ? Last year we were invited for the first time to speak here at Livent . And we were preparing the whole trip , booking the flights , getting our bags packed and ready to go , but suddenly things turn a little bit over when we found that we booked to the to the wrong Vegas . Vegas , New Mexico . Yeah , you laugh but 700 miles away from here . I never heard that it was another Vegas in the Mapa . It's funny because it's just a single letter . That changed the whole experience coming to reinvent . It also makes me sure that I need to wear some glasses to start reading . But with all this , The next logical step is like , gosh , I need to change tickets , I need to start calling support . So we went with the call support and I'm not sure how it goes on your side , but the experience is something I never forget but not for the right reasons ."),
 Document(metadata={'speaker': 'spk_1', 'second': 2089, 'content_type': 'text', 'source': 'https://MY-BUCKET/embeddings-demo-1234/videos/DEV315.mp4/transcribe.json'}, page_content='JavaScript . And thankThanks sir .')]

To generate the response with the Large Language Models (LLMs), which is the core of the RAG, I use Amazon Bedrock Converse API to invoke Amazon Nova Pro as model_id, the follow system prompt

Answer the user's questions based on the below context. If the context has an image, indicate that it can be reviewed for further feedback.If the context doesn't contain any relevant information to the question, don't make something up and just say "I don't know". (IF YOU MAKE SOMETHING UP BY YOUR OWN YOU WILL BE FIRED). For each statement in your response provide a [n] where n is the document number that provides the response.

I'm a very tough boss 🤣

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def answer(model_id,system_prompt,content) -> str:
    """Get completion from Claude model based on conversation history.

    Returns:
        str: Model completion text
    """

    # Invoke model
    kwargs = dict(
        modelId=model_id,
        inferenceConfig=dict(maxTokens=max_tokens),
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],

    )

    kwargs["system"] = [{"text": system_prompt}]

    response = bedrock_runtime.converse(**kwargs)
    
    return response.get("output",{}).get("message",{}).get("content", [])

Using RAG, I can ask more complex questions like: What is the session about?

Image not found

The retriever delivered this context list:

1
2
3
4
[Document(metadata={'second': 133, 'content_type': 'image', 'source': 's3://MY-BUCKET/videos/DEV315.mp4'}, page_content='./tmp/DEV315/sec_00134.jpg'),
 Document(metadata={'speaker': 'spk_0', 'second': 73, 'content_type': 'text', 'source': 'https://MY-BUCKET/embeddings-demo-1234/videos/DEV315.mp4/transcribe.json'}, page_content="not for the right reasons . The wait time usually feels endless . And we got several call drops and you have to recall again and try to get with the same agent that was speaking before rather than telling again the whole story that I'm an idiot and I lost the tickets to a different site . And also we got here a lot of repeating information and they didn't have really the history on that side . So it felt really disconnected . Not sure if you guys have gone over this thing that you felt like in a loop , a never ending story , but this uh airline had a new modern chatbot that you could speak directly with an agent . So we went and tried the agent . It's like , OK , this is a modern thing , maybe it goes much better . But something that we hope that was interactive and fun , it turned also into a problem . Systems were failing , they were not able to load that information there . A lot of questions again , what's your name , Guillermo , which is the flight , and all these things coming on ."),
 Document(metadata={'second': 609, 'content_type': 'image', 'source': 's3://MY-BUCKET/videos/DEV315.mp4'}, page_content='./tmp/DEV315/sec_00610.jpg'),
 Document(metadata={'speaker': 'spk_0', 'second': 132, 'content_type': 'text', 'source': 'https://MY-BUCKET/embeddings-demo-1234/videos/DEV315.mp4/transcribe.json'}, page_content="things coming on . The problem is that Oh , it went over huh . The thing is , how we feel all these channels , how they are disconnected , right ? First is the inconsistency there . How often have you reached out and depends on the interaction , you sometimes get quick answers , but sometimes it's like kind of a hell . We also have lack of personalization . Traditional systems treats you as anybody else . They don't understand the unique needs or their requirements that you may have on that side , but also there's limited knowledge . And the support agents usually don't have all the database and all the information in a single place , go and fix . It also turns into a pain point on that side . And last but not least is the data security . Uh , here the problem is that with all the interactions we have , we are giving them too much information and we don't really know if there's gonna be a data breach on that side . With all these things together , it really turned into a poor customer experience .")]

This is the final awner that I recived:

The session appears to be about discussing the challenges and frustrations faced during customer service interactions, particularly with an airline. The speaker highlights issues such as long wait times, call drops, lack of continuity with agents, repetitive information requests, and disconnected experiences across different communication channels. Additionally, the session touches on the disappointment with a modern chatbot that failed to provide a better experience due to system failures and lack of personalization. The overall theme seems to be the poor customer experience resulting from these disconnected and inefficient service channels. [1][2]

The repo has a small library that I created using Amazon Q Cli that allows you to upload a local video to an Amazon S3 bucket.

I hope you find it useful and can try it out with your own videos. Share your experience in this post; I'd love to hear from you.

Get your builderID to use Amazon Q dev for free.

Implementation Notes

While this notebook demonstrates the core functionality, implementing a demo solution requires a more robust architecture. To address this need, I've developed "Ask Your Video" - a complete serverless solution that you can deploy using AWS Cloud Development Kit (CDK) 😏.

In the next part of this series, I'll provide a detailed, step-by-step explanation of how "Ask Your Video" works. Until then, you can explore the complete solution, including deployment instructions and architecture diagrams, in this GitHub repository .

Stay tuned for Part 2!

Gracias,
Eli

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Building a RAG System for Video Content Search and Analysis

Learn how to build an application that transforms video content into searchable vectors using Amazon Bedrock, Amazon Transcribe, and Amazon Aurora postgreSQL

Solution Architecture

Visual Content Processing:

For audio content:

Image not found

Multimodal Search Capabilities:

Vector Similarity Search

Search by image

Implementation Notes

Comments