Chat with videos and video understanding with generative AI
A blog post about the open source deployable Video Understanding Solution with its use cases and the architecture, science, and engineering behind.
1
2
3
4
5
6
7
8
9
10
11
system_prompt = "You are an expert in extracting information from video frames. Each video frame is an image. You will extract the scene, text, and caption."
task_prompt = "Extract information from this image and output a JSON with this format:\n" \
"{\n" \
"\"scene\" : \"String\",\n" \
"\"caption\" : \"String\",\n" \
"\"text\" : [\"String\", \"String\" , . . .],\n" \
"}\n" \
"For \"scene\", look carefully, think hard, and describe what you see in the image in detail, yet succinct. \n" \
"For \"caption\", look carefully, think hard, and give a SHORT caption (3-8 words) that best describes what is happening in the image. This is intended for visually impaired ones. \n" \
"For \"text\", list the text you see in that image confidently. If nothing return empty list.\n"
vqa_response = self.call_vqa(image_data=base64.b64encode(image).decode("utf-8"), system_prompt = system_prompt, task_prompt=task_prompt)
1
2
3
4
5
6
7
8
9
10
11
70.0:Scene:The image shows a blue wooden desk or table in a warehouse or storage facility. The desk has a sign on it that says 'Amazonians around the world still use door desks today'.
70.0:Texts:Amazonians around the world still use door desks today
70.8:Voice: Speaker 3 in en-US: You
70.9:Voice: Speaker 3 in en-US: go
71.0:Scene:The image shows a simple wooden desk or table with a text message displayed on it. The desk is placed in a setting with what appears to be curtains or fabric in the background, suggesting an indoor or enclosed space. The text message on the desk is the main focus of the image.
71.7:Voice: Speaker 3 in en-US: write
71.8:Voice: Speaker 3 in en-US: yourself
72.0:Scene:The image shows a blue wooden desk or table with a text overlay on it. The desk is placed in what appears to be a warehouse or storage facility, with shelves and other equipment visible in the background.
72.2:Voice: Speaker 3 in en-US: a
72.2:Voice: Speaker 3 in en-US: door
72.4:Voice: Speaker 3 in en-US: desk
1
2
3
4
5
6
7
8
"You are an expert video analyst who reads a Video Timeline and creates summary of the video.\n" \
"The Video Timeline is a text representation of a video.\n" \
"The Video Timeline contains the visual scenes, the visual texts, human voice, celebrities, and human faces in the video.\n" \
"Visual objects (objects) represents what objects are visible in the video at that second. This can be the objects seen in camera, or objects from a screen sharing, or any other visual scenarios.\n" \
"Visual scenes (scene) represents the description of how the video frame look like at that second.\n" \
"Visual texts (text) are the text visible in the video. It can be texts in real world objects as recorded in video camera, or those from screen sharing, or those from presentation recording, or those from news, movies, or others. \n" \
"Human voice (voice) is the transcription of the video.\n" \
. . .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
prompt = f"The video has {number_of_chunks} parts.\n\n" \
f"The below Video Timeline is only for part {chunk_number+1} of the video.\n\n" \
f"{core_prompt}\n\n" \
"Below is the summary of all previous part/s of the video:\n\n" \
f"{self.video_rolling_summary}\n\n" \
"<Task>\n" \
"Describe the summary of the video so far in paragraph format.\n" \
"In your summary, retain important details from the previous part/s summaries. Your summary MUST include the summary of all parts of the video so far.\n"\
"You can make reasonable extrapolation of the actual video given the Video Timeline.\n" \
"DO NOT mention 'Video Timeline' or 'video timeline'.\n" \
"Give the summary directly without any other sentence.\n" \
f"{self.prompt_translate()}\n" \
"</Task>\n\n"
chunk_summary = self.call_llm(system_prompt, prompt, prefilled_response, stop_sequences=["<Task>"])
self.video_rolling_summary = chunk_summary
- clip generations (with the real video segments stitching) from the original video using semantic query;
- ability to use native video-to-text models;
- allowing prompt customization from the UI;
- ability to synthesize speech to succinctly explain the video scenes to help the visually impaired people;
- APIs for the video Q&A with WebSocket to allow the chat function to be accessed programmatically;
- visual and/or semantic search within the video;
- video categorization (auto-labeling) support;
- analytics function and integration with Amazon QuickSight;
- live video feed support.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.