Chat with videos and video understanding with generative AI

Authored by Yudho Ahmad Diponegoro (Sr. Solutions Architect at AWS) and Pengfei Zhang (Sr. Solutions Architect at AWS)

Videos can contain visual scenes, in-video texts, and voice. With the multi-modality, there can be much information that one can derive from a video which can then be used for various use cases. Think of summarizing videos based on both visual and text, extracting information, making highlights or clips, video Q&A, search & recognition, and categorization.

Video Understanding Solution is a deployable open source code sample which uses services in AWS (Amazon Web Services) for video understanding. It leverages generative AI to extract information from videos and generate rich metadata which can be further used for specific purpose. This solution is built with AWS CDK (Cloud Development Kit) and can be deployed as is or be customized for your tailored purpose before deploying.

Functions and use cases

Users can use this solution to generate video summary of various languages, including videos with both visual and voice as well as visual-only videos. Users can also extract the entities (e.g. companies, concepts) mentioned in the videos, along with the sentiment and reason. Users can ask chat with AI regarding the video, including questions like "Write an interesting tagline about the video." The functionality also includes finding a specific segment about the video with the start/stop seconds and searching stored videos with semantic, for example, videos about Amazon's culture and history.

In terms of use cases, you can think of various possibilities. For example, you can use this solution to generate summary of video recordings. You can also potentially use it for identifying trend or your brand sentiment from the short videos posted. Further, you can save time by asking & getting answers from longer videos without having to watch it. You can also extend the solution to make clips about specific parts of the video. It is important to adhere to the acceptable use terms or usage policy from the various providers whose products are used in this solution including but not limited to AWS, Anthropic, and Cohere.

This solution is open source. You can customize the solution or extend it by adding your own components before deploying it into your own environment. This includes modifying the infrastructure, the code, or the prompts.

The Video Understanding Solution was showcased in AWS Summits in Singapore, Thailand, and Indonesia. When speaking with the attendees, I identified possible use cases across industries including media, advertising, education, and even real estate.

The architecture, science, and engineering behind

The solution uses multiple AWS services and large language models (LLMs) through Amazon Bedrock. It is shown in the diagram below.

Video Understanding Solution architecture

Once deployed, users can upload videos to the designated bucket in Amazon Simple Storage Service (S3) directly or through the Video Understanding Solution UI. The upload through this solution UI will use multipart upload for big videos to accelerate the upload progress. If you want to connect your existing application with this solution, you can also configure your main application to upload videos to this designated bucket so that they can be further processed by Video Understanding Solution.

Once the videos are uploaded, a series of asynchronous processes will be triggered and orchestrated by AWS Step Functions. This includes speech-to-text transcription by Amazon Transcribe, unless when it is disabled. You can choose to disable the transcription for cost and processing time optimization when the videos have no audio channel. A function in AWS Lambda will be triggered to insert an entry of the video in the database in Amazon Aurora PostgreSQL.

The asynchronous process also includes the main processing done in a container orchestrated by AWS Fargate. This process uses FFmpeg through OpenCV to extract the video frames into images which happens in parallel using concurrent.futures module. The Python code then sends these images along with a crafted prompt to Claude 3 Haiku LLM through Amazon Bedrock. Below is the prompt used to extract the visual information.

The LLM will return a JSON string with information on the scenes and more. When needed, you can customize the prompts in this part of the code and redeploy to ask the LLM to extract specific information. The main processing also calls Amazon Rekognition when needed, unless disabled, to process the image frame and get additional information.

The main processing will combine all extracted information from the LLMs and AWS services and sort them by the video timing. This can include visual scene, voice, detected text in the video, and more. This combined information is outputted into a rich metadata file in S3 which your other application can also consume to provide further custom processes. This file is also the one that powers multiple features in the Video Understanding Solution, such as Q&A and finding a specific segment in the video about something. Below is an example of a part in that rich metadata generated by the solution for the edited video version from this article.

The numbers on the left represent the seconds into the video. The numbers are followed by the category of the information, whether it is a visual scene, a visual text detected in the video frame, or voice. When it is a voice, it tries to distinguish the speaker who speaks the lines and detect the language used.

This metadata seems fragmented when seen by human. However, with the attention mechanism used by many LLMs nowadays, LLMs should be able to link these fragments, find what is important, develop the correlation, and build an internal temporal understanding on how this video plays from the beginning till the end. This is evident that when I tested this by asking question like "Show part where <something> appears in the video". The solution was able to return the information on the second when the requested scene appears. Also, while the voice metadata is scattered in different lines in that metadata file, the LLM was able to answer question like "What did <somebody> say about <something>?"

This metadata format also allows LLM to link information between different metadata type. For example, in the last question mentioned in the previous paragraph, the name of the person was shown as a visual/overlay text in the video when the person appears and speaks. The fact that the LLM was able to answer that question proved that it was able to infer the correlation between the voice metadata and visual metadata. Of course all these are dependent on the LLM choice. This solution uses Claude 3 Haiku and Claude 3 Sonnet models. [Note: that the choice of the models were based on the availability at the launch time of the solution. The solution may use the newer model versions like Claude 3.5 model family in the future]

By default, the main processing will use Claude 3 Sonnet model to generate a summary and a list of extracted entities with the associated sentiments & reason for giving that sentiment, which can be viewed from the Video Understanding Solution UI or be accessed directly in S3.

For the summary generation, part of the system prompt used is shown below.

When the video is long enough such that the metadata will exceed the context window of the LLM, a rolling/chaining technique is used. The Python code will split the metadata into chunks and generate summary for each chunk. It will call LLM to summarize the chunks so far to include in the prompt of the next chunk summary generation. The below code shows the prompt to generate summary when the chunk is not the first and is not the last chunk. Note that XML tags are used as one of the prompt engineering techniques advised for Claude model.

The summary of the video is converted into embedding vector using Cohere Embed Multilingual model through Amazon Bedrock and stored in pgVector plugin in the Aurora database cluster. This allows search to be perform in the Video Understanding Solution UI to find relevant videos using semantic search.

Demos

While the video, the generated summary, the extracted entities and sentiments, and the rich metadata can be accessed directly from the S3 bucket, these data along with additional functionalities can be accessed through the deployed UI. This section shows the demos of the functionalities using the UI.

The solution will generate summary for the videos as shown in the screenshot below. The animated demo is available here.

The solution also helps extract the entities discussed or mentioned in the video as shown below and demoed in here.

The video Q&A or chat function is shown in the screenshot below and in the full demo here.

Roadmap

In addition to the existing functionalities, I also consider to implement the following features based on our interaction with customers and developers:

clip generations (with the real video segments stitching) from the original video using semantic query;
ability to use native video-to-text models;
allowing prompt customization from the UI;
ability to synthesize speech to succinctly explain the video scenes to help the visually impaired people;
APIs for the video Q&A with WebSocket to allow the chat function to be accessed programmatically;
visual and/or semantic search within the video;
video categorization (auto-labeling) support;
analytics function and integration with Amazon QuickSight;
live video feed support.

Please be welcomed as contributors by submitting pull requests in the Video Understanding Solution repository.

Conclusion

The Video Understanding Solution is a deployable open source solution on AWS to extract information from videos using generative AI and perform multiple functionalities including generating summaries and video Q&A (chat). Users can self-deploy the solution and extend/customize the solution as needed.

Please consider to visit the solution in the repository and learn the prerequisites and the deployment steps. Feel free to ask in the comments. Thank you for reading.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.