Extracting the Best Insights from AWS re:Invent with Amazon Bedrock – Part 1
In this first post, we explore leveraging Amazon Bedrock to build an automated system that extracts and summarizes insights from video content.
Published Dec 13, 2024
As many of you may know, re:Invent is one of the biggest cloud conferences in the world. This conference brings global talent to Las Vegas, NV, where attendees can experience new AWS services firsthand as they are announced throughout the week.
The most significant announcements regarding AWS are delivered during the keynotes—the major presentations that take place at the beginning of each day. As exciting as this may sound, these keynotes tend to last between 2 and 2.5 hours. Considering there are about four keynotes distributed across the week, this amounts to approximately 10 hours of content. While this is incredibly insightful for individuals pursuing careers in tech and cloud computing, it is also highly time-consuming. Additionally, if you forget key details from the talks, you might find yourself needing to rewatch the recordings.
If you have read any scientific papers before, you might remember that the first part of any paper is called an “abstract.” An abstract provides a concise summary of the research paper's key points, methodology, and findings, allowing readers to quickly grasp the main concepts without reading the entire document. Following this same principle, I wanted to create a solution that could provide a quick summary of lengthy video content.
That’s why I designed a small architecture that leverages AWS Bedrock to extract the most important insights from a recording by simply providing a YouTube URL. While working on this project, I realized that it had the potential to scale into an almost fully automated system. So I decided to make this a series of blog posts where I start with an MVP—a straightforward, easy-to-implement solution—and then progressively enhance it to make the system more robust, reliable, and automated.
For the first version of this architecture, I am performing as much processing as possible on my local computer. This approach helps me better understand the various costs associated with AWS by starting with a simple concept rather than migrating an entire system to the cloud.
To begin, I am manually downloading the YouTube video transcript using the Python library
youtube_transcript_api
. After providing the video ID, the captions of the video are downloaded into a list of dictionaries containing several attributes, such as text, start, and duration.Since I am only interested in the text portion, I filter out the other attributes using the
map
function to extract a list of all available texts. Finally, I store the extracted texts in a .txt
file.Now that I have the texts from the YouTube video transcript, I can use a Foundation Model (FM) to analyze and summarize the content. For this task, I decided to use Anthropic’s Claude 3 Sonnet v1 model due to its capabilities in complex reasoning and analysis.
After opening the FM in the AWS Console Playground, I entered the following prompt, provided the transcript manually, and clicked on Run:
After the execution was complete, this was the model’s response about the most important highlights of the video:
Here are the key details and important points from the transcript:
But I just couldn’t take the model’s response for granted, I had to validate its accuracy since this was my first experiment. Even though this project started as an automated agent to summarize lengthy videos, I still watched the entire keynote recording one more time to ensure the model didn’t generate any hallucinations—which it didn’t!
This is considered a huge win. Thanks to the transcript provided as part of the prompt, the model was able to generate a valid and coherent response that aligned with the content of the video. Without the transcript, the model could have produced random text that might not have been entirely related to the keynote’s content.
One of the main advantages of running this new project from scratch is that I implemented it in a brand-new AWS account, literally an account that has had $0 expenses so far. This will help me understand firsthand the actual costs involved in running this kind of task, as GenAI applications tend to be a bit expensive due to the costly compute resources required to use foundational models.
After running a couple of prompts, this is what I saw in the Billing and Cost Management panel.
Two executions cost about $0.25. While this may not seem like a significant expense upfront, if we imagine a scenario where this application serves a thousand requests per day, the cost could increase to approximately $125 per day. Some possible alternatives that will be explored in future posts include leveraging different foundational models that may affect the pricing. For now, since we are in the experimentation phase, this cost is manageable—or is it?
Since this was the first step in our MVP, the architecture still remains quite simple, as we are only using AWS at the moment to use the foundational models available in Bedrock. Everything else is being run on a local computer as follows:
As a first iteration, I think this fulfills the purpose of providing a quick response. However, it could be improved by automating parts of the workflow, such as obtaining the transcripts automatically by providing the Video ID through an API Gateway and running our code from a Lambda Function. This is exactly what we will tackle in the next post!