Chat with videos and video understanding with generative AI

Authored by Yudho Ahmad Diponegoro (Sr. Solutions Architect at AWS) and Pengfei Zhang (Sr. Solutions Architect at AWS)

Videos can contain visual scenes, in-video texts, and voice. With the multi-modality, there can be much information that one can derive from a video which can then be used for various use cases. Think of summarizing videos based on both visual and text, extracting information, making highlights or clips, video Q&A, search & recognition, and categorization.

Video Understanding Solution is a deployable open source code sample which uses services in AWS (Amazon Web Services) for video understanding. It leverages generative AI to extract information from videos and generate rich metadata which can be further used for specific purpose. This solution is built with AWS CDK (Cloud Development Kit) and can be deployed as is or be customized for your tailored purpose before deploying.

Functions and use cases

Users can use this solution to generate video summary of various languages, including videos with both visual and voice as well as visual-only videos. Users can also extract the entities (e.g. companies, concepts) mentioned in the videos, along with the sentiment and reason. Users can ask chat with AI regarding the video, including questions like "Write an interesting tagline about the video." The functionality also includes finding a specific segment about the video with the start/stop seconds and searching stored videos with semantic, for example, videos about Amazon's culture and history.

In terms of use cases, you can think of various possibilities. For example, you can use this solution to generate summary of video recordings. You can also potentially use it for identifying trend or your brand sentiment from the short videos posted. Further, you can save time by asking & getting answers from longer videos without having to watch it. You can also extend the solution to make clips about specific parts of the video. It is important to adhere to the acceptable use terms or usage policy from the various providers whose products are used in this solution including but not limited to AWS, Anthropic, and Cohere.

This solution is open source. You can customize the solution or extend it by adding your own components before deploying it into your own environment. This includes modifying the infrastructure, the code, or the prompts.

The Video Understanding Solution was showcased in AWS Summits in Singapore, Thailand, and Indonesia. When speaking with the attendees, I identified possible use cases across industries including media, advertising, education, and even real estate.

The architecture, science, and engineering behind

The solution uses multiple AWS services and large language models (LLMs) through Amazon Bedrock. It is shown in the diagram below.

Image not found

Video Understanding Solution architecture

Once deployed, users can upload videos to the designated bucket in Amazon Simple Storage Service (S3) directly or through the Video Understanding Solution UI. The upload through this solution UI will use multipart upload for big videos to accelerate the upload progress. If you want to connect your existing application with this solution, you can also configure your main application to upload videos to this designated bucket so that they can be further processed by Video Understanding Solution.

Once the videos are uploaded, a series of asynchronous processes will be triggered and orchestrated by AWS Step Functions. This includes speech-to-text transcription by Amazon Transcribe, unless when it is disabled. You can choose to disable the transcription for cost and processing time optimization when the videos have no audio channel. A function in AWS Lambda will be triggered to insert an entry of the video in the database in Amazon Aurora PostgreSQL.

The asynchronous process also includes the main processing done in a container orchestrated by AWS Fargate. This process uses FFmpeg through OpenCV to extract the video frames into images which happens in parallel using concurrent.futures module. The Python code then sends these images along with a crafted prompt to Claude 3 Haiku LLM through Amazon Bedrock. Below is the prompt used to extract the visual information.

1
2
3
4
5
6
7
8
9
10
11
system_prompt = "You are an expert in extracting information from video frames. Each video frame is an image. You will extract the scene, text, and caption."
task_prompt =   "Extract information from this image and output a JSON with this format:\n" \
                "{\n" \
                "\"scene\" : \"String\",\n" \
                "\"caption\" : \"String\",\n" \
                "\"text\" : [\"String\", \"String\" , . . .],\n" \
                "}\n" \
                "For \"scene\", look carefully, think hard, and describe what you see in the image in detail, yet succinct. \n" \
                "For \"caption\", look carefully, think hard, and give a SHORT caption (3-8 words) that best describes what is happening in the image. This is intended for visually impaired ones. \n" \
                "For \"text\", list the text you see in that image confidently. If nothing return empty list.\n"
vqa_response = self.call_vqa(image_data=base64.b64encode(image).decode("utf-8"), system_prompt = system_prompt, task_prompt=task_prompt) 

The LLM will return a JSON string with information on the scenes and more. When needed, you can customize the prompts in this part of the code and redeploy to ask the LLM to extract specific information. The main processing also calls Amazon Rekognition when needed, unless disabled, to process the image frame and get additional information.

The main processing will combine all extracted information from the LLMs and AWS services and sort them by the video timing. This can include visual scene, voice, detected text in the video, and more. This combined information is outputted into a rich metadata file in S3 which your other application can also consume to provide further custom processes. This file is also the one that powers multiple features in the Video Understanding Solution, such as Q&A and finding a specific segment in the video about something. Below is an example of a part in that rich metadata generated by the solution for the edited video version from this article.

1
2
3
4
5
6
7
8
9
10
11
70.0:Scene:The image shows a blue wooden desk or table in a warehouse or storage facility. The desk has a sign on it that says 'Amazonians around the world still use door desks today'.
70.0:Texts:Amazonians around the world still use door desks today
70.8:Voice: Speaker 3 in en-US: You
70.9:Voice: Speaker 3 in en-US: go
71.0:Scene:The image shows a simple wooden desk or table with a text message displayed on it. The desk is placed in a setting with what appears to be curtains or fabric in the background, suggesting an indoor or enclosed space. The text message on the desk is the main focus of the image.
71.7:Voice: Speaker 3 in en-US: write
71.8:Voice: Speaker 3 in en-US: yourself
72.0:Scene:The image shows a blue wooden desk or table with a text overlay on it. The desk is placed in what appears to be a warehouse or storage facility, with shelves and other equipment visible in the background.
72.2:Voice: Speaker 3 in en-US: a
72.2:Voice: Speaker 3 in en-US: door
72.4:Voice: Speaker 3 in en-US: desk

The numbers on the left represent the seconds into the video. The numbers are followed by the category of the information, whether it is a visual scene, a visual text detected in the video frame, or voice. When it is a voice, it tries to distinguish the speaker who speaks the lines and detect the language used.

This metadata seems fragmented when seen by human. However, with the attention mechanism used by many LLMs nowadays, LLMs should be able to link these fragments, find what is important, develop the correlation, and build an internal temporal understanding on how this video plays from the beginning till the end. This is evident that when I tested this by asking question like "Show part where <something> appears in the video". The solution was able to return the information on the second when the requested scene appears. Also, while the voice metadata is scattered in different lines in that metadata file, the LLM was able to answer question like "What did <somebody> say about <something>?"

This metadata format also allows LLM to link information between different metadata type. For example, in the last question mentioned in the previous paragraph, the name of the person was shown as a visual/overlay text in the video when the person appears and speaks. The fact that the LLM was able to answer that question proved that it was able to infer the correlation between the voice metadata and visual metadata. Of course all these are dependent on the LLM choice. This solution uses Claude 3 Haiku and Claude 3 Sonnet models. [Note: that the choice of the models were based on the availability at the launch time of the solution. The solution may use the newer model versions like Claude 3.5 model family in the future]

By default, the main processing will use Claude 3 Sonnet model to generate a summary and a list of extracted entities with the associated sentiments & reason for giving that sentiment, which can be viewed from the Video Understanding Solution UI or be accessed directly in S3.

For the summary generation, part of the system prompt used is shown below.

1
2
3
4
5
6
7
8
"You are an expert video analyst who reads a Video Timeline and creates summary of the video.\n" \
"The Video Timeline is a text representation of a video.\n" \
"The Video Timeline contains the visual scenes, the visual texts, human voice, celebrities, and human faces in the video.\n" \
"Visual objects (objects) represents what objects are visible in the video at that second. This can be the objects seen in camera, or objects from a screen sharing, or any other visual scenarios.\n" \
"Visual scenes (scene) represents the description of how the video frame look like at that second.\n" \
"Visual texts (text) are the text visible in the video. It can be texts in real world objects as recorded in video camera, or those from screen sharing, or those from presentation recording, or those from news, movies, or others. \n" \
"Human voice (voice) is the transcription of the video.\n" \
. . .

When the video is long enough such that the metadata will exceed the context window of the LLM, a rolling/chaining technique is used. The Python code will split the metadata into chunks and generate summary for each chunk. It will call LLM to summarize the chunks so far to include in the prompt of the next chunk summary generation. The below code shows the prompt to generate summary when the chunk is not the first and is not the last chunk. Note that XML tags are used as one of the prompt engineering techniques advised for Claude model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
prompt = f"The video has {number_of_chunks} parts.\n\n" \
f"The below Video Timeline is only for part {chunk_number+1} of the video.\n\n" \
f"{core_prompt}\n\n" \
"Below is the summary of all previous part/s of the video:\n\n" \
f"{self.video_rolling_summary}\n\n" \
"<Task>\n" \
"Describe the summary of the video so far in paragraph format.\n" \
"In your summary, retain important details from the previous part/s summaries. Your summary MUST include the summary of all parts of the video so far.\n"\
"You can make reasonable extrapolation of the actual video given the Video Timeline.\n" \
"DO NOT mention 'Video Timeline' or 'video timeline'.\n" \
"Give the summary directly without any other sentence.\n" \
f"{self.prompt_translate()}\n" \
"</Task>\n\n"

chunk_summary = self.call_llm(system_prompt, prompt, prefilled_response, stop_sequences=["<Task>"])
self.video_rolling_summary = chunk_summary

The summary of the video is converted into embedding vector using Cohere Embed Multilingual model through Amazon Bedrock and stored in pgVector plugin in the Aurora database cluster. This allows search to be perform in the Video Understanding Solution UI to find relevant videos using semantic search.

Demos

While the video, the generated summary, the extracted entities and sentiments, and the rich metadata can be accessed directly from the S3 bucket, these data along with additional functionalities can be accessed through the deployed UI. This section shows the demos of the functionalities using the UI.

The solution will generate summary for the videos as shown in the screenshot below. The animated demo is available here.

Image not found

Summary generation

The solution also helps extract the entities discussed or mentioned in the video as shown below and demoed in here.

Image not found

Entity extraction

The video Q&A or chat function is shown in the screenshot below and in the full demo here.

Image not found

Chat

Roadmap

In addition to the existing functionalities, I also consider to implement the following features based on our interaction with customers and developers:

clip generations (with the real video segments stitching) from the original video using semantic query;
ability to use native video-to-text models;
allowing prompt customization from the UI;
ability to synthesize speech to succinctly explain the video scenes to help the visually impaired people;
APIs for the video Q&A with WebSocket to allow the chat function to be accessed programmatically;
visual and/or semantic search within the video;
video categorization (auto-labeling) support;
analytics function and integration with Amazon QuickSight;
live video feed support.

Please be welcomed as contributors by submitting pull requests in the Video Understanding Solution repository.

Conclusion

The Video Understanding Solution is a deployable open source solution on AWS to extract information from videos using generative AI and perform multiple functionalities including generating summaries and video Q&A (chat). Users can self-deploy the solution and extend/customize the solution as needed.

Please consider to visit the solution in the repository and learn the prerequisites and the deployment steps. Feel free to ask in the comments. Thank you for reading.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Chat with videos and video understanding with generative AI

A blog post about the open source deployable Video Understanding Solution with its use cases and the architecture, science, and engineering behind.

Functions and use cases

The architecture, science, and engineering behind

Demos

Roadmap

Conclusion

Comments