logo
AI powered video summarizer with Amazon Bedrock

AI powered video summarizer with Amazon Bedrock

Explore how to use Amazon Bedrock with Anthropic's Claude to build a Youtube video summarizer

Published Jan 4, 2024
At times, I find myself wanting to quickly get a summary of a video or capture the key points of a tech talk. Thanks to the capabilities of generative AI, achieving this is entirely possible with minimal effort.
In this article, I’ll walk you through the process of creating a service that summarizes YouTube videos based on their transcripts and generates audio from these summaries.
We’ll leverage Anthropic’s Claude 2.1 foundation model through Amazon Bedrock for summary generation, and Amazon Polly to synthesize speech from these summaries.

I will use a step functions to orchestrate the different steps involved in the summary and audio generation :
🔍 Let’s break this down:
  • The ‘Get Video Transcript’ function retrieves the transcript from a specified YouTube video URL. Upon successful retrieval, the transcript is stored in an S3 bucket, ready for processing in the next step.
  • ‘Generate Model Parameters’ function retrieves the transcript from the bucket and generates the prompt and inference parameters specific to Anthropic’s Claude v2 model. These parameters are then stored in the bucket for use by the Bedrock API in the subsequent step.
  • Invoking the Bedrock API is achieved through the step functions’ AWS SDK integration, enabling the execution of the model inferences with inputs stored in the bucket. This step generates a structured JSON containing the summary.
  • ‘Generate audio form summary’ relies on Amazon Polly to perform speech synthesis from the summary produced in the previous step. This step returns the final output containing the video summary in text format, as well as a presigned URL for the generated audio file.
  • The bucket serves as a state storage used across all the steps of the state machine. In fact, we don’t know the size of generated video transcript upfront; it might reach the Step Functions’ payload size limit of 256 KB in some lengthy videos.

At the time of writing, Claude 2.1 model supports 200K tokens, an estimated word count of 150K. It provides also a good accuracy over long documents, making it well-suited for summarizing lengthy video transcripts.

You will find the complete source code here 👇
I will use NodeJs, typescript and CDK for IaC.

Amazon Bedrock offers a range of foundational models, including Amazon Titan, Anthropic’s Claude, Meta Llama2, etc., which are accessible through Bedrock APIs. By default, these foundational models are not enabled; they must be enabled through the console before use.
We’ll request access to Anthropic’s Claude models. But first we’ll need to submit a use case details:
Request Anthropic’s Claude access

I will rely on this lib for the video transcript extraction (It feels like a cheat code 😉) ; in fact, this library makes use of an unofficial YouTube API without relying on a headless Chrome solution. For now, it yields good results on several YouTube videos, but I might explore a more robust solutions in the future :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import { storeTranscript } from "adapters/transcript-repository";
import { YoutubeTranscript } from "youtube-transcript";

export const handler = async (event: {
youtubeVideoUrl: string;
requestId: string;
}
) => {
const { youtubeVideoUrl, requestId } = event;
const transcript = await YoutubeTranscript.fetchTranscript(youtubeVideoUrl);
const sentences = Array.from(getSentencesFromYoutubeTranscript(transcript));

await storeTranscript(requestId, sentences.join("\n"));
};

function* getSentencesFromYoutubeTranscript(transcript: { text: string }[]) {
let currentSentence: string[] = [];
let i = 0;
do {
const { text } = transcript[i];

currentSentence.push(text);

if (text.endsWith(".")) {
yield currentSentence.join(" ").replaceAll("\n", " ");
currentSentence = [];
}
i++;
} while (i < transcript.length);

yield currentSentence.join(" ").replaceAll("\n", " ");
}
The extracted transcript is then stored on the s3 bucket using ${requestId}/transcript as a key.
You can find the code for this lambda function here

At the time of writing, Bedrock currently only supports Claude’s Text Completions API. Prompts must be wrapped in \n\nHuman: and \n\nAssistant: markers to let Claude understand the conversation context.
Here is the prompt; I find that it produces good results for our use case:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
You are a video transcript summarizer.
Summarize this transcript in a third person point of view in 10 sentences.
Identify the speakers and the main topics of the transcript and add them in the output as well.
Do not add or invent speaker names if you not able to identify them.
Please output the summary JSON format conforming to this JSON schema:
{
"type": "object",
"properties": {
"speakers": {
"type": "array",
"items": {
"type": "string"
}
},
"topics": {
"type": "string"
},
"summary": {
"type": "array",
"items": {
"type": "string"
}
}
}
}

<transcript>{{transcript}}</transcript>
🤖 Helping Claude producing good results:
  • To clearly mark to the transcript to summarize, we use <transcript/> XML tags. Claude will specifically focus on the structure encapsulated by these XML tags. I will be substituting {{transcript}} string**** with the actual video transcript.
  • To assist Claude in generating a reliable JSON output format, I include in the prompt the JSON schema that needs to be adhered to.
  • Finally, I also need to inform Claude that I want to generate only a concise JSON response without unnecessary chattiness, meaning without including a preamble and postscript while returning the JSON payload:
1
\n\nHuman:{{prompt}}\n\nAssistant:{
Note that the full prompt ends with a trailing {
As mentioned on the section above, we will store this generated prompt as well as the model parameters in the bucket so that It can be used as an input of Bedrock API:
1
2
3
4
5
6
7
8
9
const modelParameters = {
prompt,
max_tokens_to_sample: prompt.length,
top_k: 250,
top_p: 1,
temperature: 0.2,
stop_sequences: ["Human:"],
anthropic_version: "bedrock-2023-05-31",
};
You can follow this link for the full code of the generate-model-parameters lambda function.

In this step, we’ll avoid writing custom lambda function to invoke Bedrock API. Instead, we’ll use Step functions direct SDK integration. This state loads from the bucket the model inference parameters that were generated in the previous step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
new CustomState(this, "bedrock-invoke-model", {
stateJson: {
Type: "Task",
Resource: "arn:aws:states:::bedrock:invokeModel",
Parameters: {
ModelId: "anthropic.claude-v2:1",
Input: {
"S3Uri.$": `$.Payload.modelParameters`,
},
ContentType: "application/json",
},
ResultSelector: {
"id.$": "$$.Execution.Name",
"summaryTaskResult.$":
"States.StringToJson(States.Format('\\{{}', $.Body.completion))",
},
});
☝️ Note: As we instructed Claude to generate the response in JSON format, the completion API response misses a leading { as Claude outputs the rest of the requested JSON schema.
We use intrinsic functions on the state’s ResultSelector to add the missing opening curly brace and to format the state output in a well formed JSON payload :
1
2
3
4
5
ResultSelector: {
"id.$": "$$.Execution.Name",
"summaryTaskResult.$":
"States.StringToJson(States.Format('\\{{}', $.Body.completion))",
}
I have to admit, it is not ideal but this helps get by without writing a custom Lambda function.

This step is heavily inspired by this previous blog post. Amazon Polly generates the audio from the video summary:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import {
getPubliclyAvailableUrl,
storeAudio,
} from "adapters/audio-summary-repository";
import { synthesize } from "adapters/speech-synthesis";

export const handler = async (event: SummaryTaskOutput) => {
const audio = await synthesize(event.summaryTaskResult);
await storeAudio(event.id, audio);

return {
videoSummary: {
...event.summaryTaskResult,
audioUrl: await getPubliclyAvailableUrl(event.id),
},
};
};
Here are the details of synthesize function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import { PollyClient, SynthesizeSpeechCommand } from "@aws-sdk/client-polly";

const polly = new PollyClient({});

const synthesize = async (data: { topics: string; summary: string[] }) => {
const audioBuffers = [];
for (const sentence of data.summary) {
const sentenceWithBreak = `${sentence} <break strength="x-strong" />`;

const paragraphBuffers = await Promise.all(
chunkString(sentenceWithBreak, 1500).map((chunk) => {
return polly
.send(
new SynthesizeSpeechCommand({
OutputFormat: "mp3",
TextType: "ssml",
Text: `<speak>${chunk}</speak>`,
Engine: "neural",
VoiceId: "Joanna",
LanguageCode: "en-US",
})
)
.then((data) => data.AudioStream.transformToByteArray())
.then((byteArray) => Buffer.from(byteArray));
})
);
audioBuffers.push(...paragraphBuffers);
}

const mergedBuffers = audioBuffers.reduce(
(total: Buffer, buffer: any) =>
Buffer.concat([total, buffer], total.length + buffer.length),
Buffer.alloc(1)
);
return mergedBuffers;
};
Once the audio generated, we store it on the S3 bucket and we generate a presigned Url so it can be downloaded afterwards.
☝️ On language detection : In this example, I am not performing language detection; by default, I am assuming that the video is in English. You can find in my previous article how to perform such a process in speech synthesis. Alternatively, We can also leverage Claude model capabilities to detect the language of the transcript.

Alright, let’s put it all together and let’s take a look at the CDK definition of the state machine:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
const failState = new Fail(this, "fail");
const successState = new Succeed(this, "success");

const chainDefinition = new LambdaInvoke(this, "get-video-transcript", {
lambdaFunction: getVideoTranscriptLambda,
payload: TaskInput.fromObject({
"requestId.$": "$$.Execution.Name",
"youtubeVideoUrl.$": "$.youtubeVideoUrl",
}),
})
.addCatch(failState)
.next(
new LambdaInvoke(this, "generate-model-parameters", {
lambdaFunction: generateModelParameters,
payload: TaskInput.fromObject({
"requestId.$": "$$.Execution.Name",
}),
}).addCatch(failState)
)
.next(
new CustomState(this, "bedrock-invoke-model", {
stateJson: {
Type: "Task",
Resource: "arn:aws:states:::bedrock:invokeModel",
Parameters: {
ModelId: "anthropic.claude-v2:1",
Input: {
"S3Uri.$": `$.Payload.modelParameters`,
},
ContentType: "application/json",
},
ResultSelector: {
"requestId.$": "$$.Execution.Name",
"summaryTaskResult.$":
"States.StringToJson(States.Format('\\{{}', $.Body.completion))",
},
},
})
.addCatch(failState)
.next(
new LambdaInvoke(this, "generate-audio-from-summary", {
lambdaFunction: generateAudioFromSummary,
}).addCatch(failState)
)
.next(successState)
);

const stateMachine = new StateMachine(this, "StateMachine", {
definitionBody: DefinitionBody.fromChainable(chainDefinition),
stateMachineType: StateMachineType.EXPRESS,
logs: {
destination: new LogGroup(this, "ExpressLogs", {
retention: RetentionDays.ONE_DAY,
removalPolicy: cdk.RemovalPolicy.DESTROY,
}),
level: LogLevel.ALL,
includeExecutionData: true,
},
});
In order to be able to invoke Bedrock API, we’ll need to add this policy to the workflow’s role (And it’s important to remember granting the S3 bucket read & write permissions to the state machine):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
stateMachine.addToRolePolicy(
new PolicyStatement({
actions: ["bedrock:InvokeModel"],
resources: [
`arn:aws:bedrock:${Stack.of(this).region}::foundation-model/anthropic.claude-v2:1`,
],
})
);

stateMachine.addToRolePolicy(
new PolicyStatement({
actions: ["s3:GetObject", "s3:PutObject"],
resources: [`${bucket.bucketArn}/*`],
})
);

I find creating generative AI based applications to be a fun exercise, I am always impressed by how quickly we can develop such applications by combining Serverless and Gen AI.
Certainly, there is room for improvement to make this solution production-grade. This workflow can be integrated into a larger process, allowing the video summary to be sent asynchronously to a client, and let’s not forget robust error handling.
Follow this link to get the source code for this article.
Thanks for reading and hope you enjoyed it !