AWS Transcribe: Converting Audio Files to Text

Introduction

Business Case: Recently, I faced the challenge of implementing a solution to convert audio conversations into text include chancel identification(speakers) for a Retrieval Augmented Generation (RAG) application. These conversations originated from various meeting platforms, including Zoom, Teams, and other recording tools. While some platforms like Zoom offer built-in AI assistance and transcription, I needed to develop a streamlined workflow that integrates seamlessly with other AWS services within our solution. While alternative services like Google Cloud Speech-to-Text and Azure Audio-to-Text offer viable solutions, each with distinct advantages and disadvantages, AWS Transcribe proved to be the ideal service, offering the necessary features, a straightforward SDK, and robust API integration. In addition, AWS Transcribe leverages Amazon Bedrock and its underlying models for powerful summarization capabilities, enhancing the value of the transcribed audio. During implementation, I encountered an issue with job history accumulation in the AWS Transcribe console. As the number of completed jobs grew, it became cumbersome to manage. To address this, I developed a post-processing flow: upon job completion, relevant information is written to AWS DynamoDB for persistent storage if you need this information, and the Transcribe job is then deleted. This ensures a clean and manageable job history while retaining essential data.
Benefits: AWS Transcribe offers scalable, accurate, and cost-effective audio transcription powered by robust AWS Bedrock, It efficiently processes various audio types and supports use cases like meetings, customer calls, and medical transcriptions. With event-driven automation and seamless AWS integration, it saves time and enhances adaptability, making it ideal for businesses of all sizes. This is the first article in a three-part series that can be used on its own to solve specific business problems or as part of a complete solution.
Overview: If you want to transcribe audio files that include channel identification, which refers to distinguishing individual speakers in multi-audio conversations, AWS Transcribe provides robust support for this feature. However, in this example, we will not be using the channel identification option. Instead, we will discuss the use case and the architectural design needed to achieve our goal without channel identification. AWS Transcribe’s standard transcription capabilities can still effectively process audio, making it suitable for scenarios such as meeting transcriptions, call centre recordings, and more.
Purpose: The benefits of using the AWS Transcribe service include being a fully managed AWS solution that is cost-effective and enables the automation of audio transcription for various types of files formats (MP4e.g.,) across many industries and business use cases, such as meeting summarization, audio conversations, and other audio recordings. Additionally, Amazon Transcribe Medical is available, which is specifically trained to recognize medical vocabulary and terminology. The service offers numerous features, including custom language models, custom vocabulary, vocabulary filtering, and more. It can also be integrated seamlessly with other AWS services to enhance its capabilities.
Objective: In this article, I will provide a high-level overview of how to use a service based on the architecture diagram presented. I will include a few concise code samples focused on an AWS Lambda function that initiates an AWS Transcribe job, while the remaining parts of the architecture will be described in detail. This design follows an event-driven approach but can be adapted to use AWS Step Functions for orchestration if needed. I chose this design for its simplicity and straightforward pattern.
Security: The security consideration of any solution always need to part of the design and implementation to insure data privacy, data security, integrity using less privilege approach, to insure that data encrypted in rest and in transit. In this solution we are using services such is Amazon Transcribe that is secure service keep the data encrypted in transit, the S3 where the source and destination data is save is encrypted in rest permissions that granted to AWS Lambda function are limited and to the service that is used by the Lambada.

Architecture

Workflow Overview: In this workflow, the user or the system uploads audio files to an AWS S3 bucket. Amazon EventBridge is configured to detect an object creation event, which triggers an AWS Lambda function. This function initiates an Amazon Transcribe job to process the uploaded file. Once the transcription is complete, the output is stored in the S3 bucket in JSON format, enabling future use and integration with other services.

Image not found

Services Used

Amazon S3: In this solution, we use two S3 buckets: one for storing audio files and another for storing the transcribed JSON files. We configure Amazon EventBridge to trigger an AWS Lambda function when an audio file is uploaded.
AWS EventBridge: Trigger the AWS Lambda function upon an Object Create event.
AWS Lambda: The function is developed in Python using the AWS SDK, Boto3. It is designed to accept a JSON payload from Amazon EventBridge and create an AWS Transcribe job. This allows seamless integration with event-driven architectures, where the function is automatically triggered upon file upload events. The Python code handles the input payload, configures the transcription parameters, and initiates the job, ensuring that each audio file is processed efficiently and stored in the specified S3 bucket. By default, an Amazon Transcribe job remains available for 90 days after completion. To delete the job upon completion, you can make an API call using the Python SDK Boto3.
AWS Transcribe: We use an Amazon Transcribe job for each audio file to ensure accurate and efficient transcription. Each job processes the uploaded audio file and generates a JSON output stored in the designated S3 bucket. This setup allows for scalable handling of multiple audio files, ensuring that each transcription job runs independently and integrates seamlessly with other AWS services.
IAM Roles: The IAM roles for the AWS Lambda function include the necessary policies to access both Amazon Transcribe and the S3 bucket.

Here is a code snippet of the AWS Lambda function for demonstration only:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import json
import logging
import boto3
import os
import datetime
import urllib

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

def start_job(job_name: str, audio_bucket: str, audio_bucket_prefix: str, output_prefix_key: str,file_name: str):
            client = boto3.client("transcribe", region_name="us-east-1")
            
            logger.info("Starting Transcribe job")
            logger.info(f"Job Name: {job_name}")
            logger.info(f"Media Format: {media_format}")
            logger.info(f"File Name: {file_name}")

            now = datetime.datetime.now()
            prefix = now.strftime("%Y-%m-%d_%H-%M-%S")
            audio_bucket_prefix="test"
            media_format="mp4"
            media_file_uri = f"s3://{audio_bucket}/{audio_bucket_prefix}/{file_name}.{media_format}"
            logger.info(f"Media File URI: {media_file_uri}")
            print(f"Media File URI: {media_file_uri}")

            response = client.start_transcription_job(
                TranscriptionJobName=job_name,
                LanguageCode="en-IN",
                MediaFormat=media_format,
                Media={
                    "MediaFileUri": media_file_uri,
                },
                Settings={
                    "ShowSpeakerLabels": show_speaker_labels,
                    "MaxSpeakerLabels": max_speaker_labels,
                    "ChannelIdentification": channel_identification,
                },
                OutputBucketName=f"{audio_bucket}",
                OutputKey=f"{output_prefix_key}/{prefix}_{file_name.replace(' ', '_')}.json",
            )
            return response

def handler(event, context):

    logger.info(f"Event Received: {event}")

    # Note of this event comming from Amazon EventBridge need to extract the bucket and key from the event

    
    # Get value from the Test Paylod
    audio_bucket = event['bucket']
    audio_bucket_prefix = event['key']
    key_object = urllib.parse.unquote_plus(event['key'])
    logger.info(f"key_object: {key_object}")

    # Testingset values
    job_name="test_job"
    output_prefix_key = "transcription"
    
    # Create Transcribe job and store the transcription in S3 bucket
    transcription_job_response = start_job(job_name,audio_bucket,audio_bucket_prefix,output_prefix_key)
    
    logger.info(f"Transcribe Job Response: {transcription_job_response}")
    job_name = transcription_job_response['TranscriptionJob']['TranscriptionJobName']
    logger.info(f"Job Name: {job_name}")

    return {"statusCode": 200, "body": f"Job started. ${job_name}"}

Test Payload

1
2
3
4
5
 test_event ={
      "bucket": "s3_bucket_name",
      "key": "audio_files"
      
 }

Output JSON transcription file of the AWS Transcribe job

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
{
  "jobName": "test-job1",
  "accountId": "xxxxxxxxxxx",
  "status": "COMPLETED",
  "results": {
    "transcripts": [
      { "transcript": "a line of people wait outside of a building." }
    ],
    "items": [
      {
        "id": 0,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "a" }],
        "start_time": "0.509",
        "end_time": "0.519"
      },
      {
        "id": 1,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "line" }],
        "start_time": "0.579",
        "end_time": "0.99"
      },
      {
        "id": 2,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "of" }],
        "start_time": "1.0",
        "end_time": "1.11"
      },
      {
        "id": 3,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "people" }],
        "start_time": "1.12",
        "end_time": "1.509"
      },
      {
        "id": 4,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "wait" }],
        "start_time": "1.519",
        "end_time": "1.71"
      },
      {
        "id": 5,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "outside" }],
        "start_time": "1.72",
        "end_time": "2.069"
      },
      {
        "id": 6,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.996", "content": "of" }],
        "start_time": "2.079",
        "end_time": "2.18"
      },
      {
        "id": 7,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.997", "content": "a" }],
        "start_time": "2.19",
        "end_time": "2.2"
      },
      {
        "id": 8,
        "type": "pronunciation",
        "alternatives": [{ "confidence": "0.999", "content": "building" }],
        "start_time": "2.21",
        "end_time": "2.74"
      },
      {
        "id": 9,
        "type": "punctuation",
        "alternatives": [{ "confidence": "0.0", "content": "." }]
      }
    ],
    "audio_segments": [
      {
        "id": 0,
        "transcript": "a line of people wait outside of a building.",
        "start_time": "0.479",
        "end_time": "2.859",
        "items": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      }
    ]
  }
}

You can enhance the logic of the AWS Lambda function to store data in an AWS DynamoDB table if needed and remove the job afterward.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# DynamoDB Boto3 Client call 
def get_dynamodb_client(region):
    dynamodb_client = boto3.resource("dynamodb", region_name=region)
    return dynamodb_client
# AWS Transcribe Boto3 Client call 
def get_transcribe_client(region):
    transcribe_client = boto3.client("transcribe", region_name=region)
    return transcribe_client

def store_log_in_dynamodb(
    dynamodb_client, transcribe_client,dynamodb_table_name: str, data: dict[str, str]
):
    logger.info(f"Table Name inside function: {dynamodb_table_name}")
    print(f"Table Name inside function: {dynamodb_table_name}")
    logger.info(f"Data from event inside function: {data}")
    print(f"Data from event inside function: {data}")

    table = dynamodb_client.Table(dynamodb_table_name)

    #  # Generate a unique message UID
    uid = generate_uid()

    table = dynamodb_client.Table(dynamodb_table_name)

    table.put_item(
        Item={
            "RecordId": uid,
            "TranscriptionJobName": data.get("TranscriptionJobName"),
            "TranscriptFileUri": data.get("TranscriptFileUri"),
            "TranscriptionJobStatus": data.get("TranscriptionJobStatus"),
            "CreatedAt": datetime.datetime.utcnow().isoformat(),
        }
    )

def delete_transcribe_jobs(transcribe_client, job_names):
    logger.info("Deleting Transcribe jobs: %s", job_names)
    print("Deleting Transcribe jobs", job_names)
    for job_name in job_names:
        try:
            transcribe_client.delete_transcription_job(TranscriptionJobName=job_name)
            logger.info("Deleted Transcribe job: %s", job_name)
            print("Deleted Transcribe job", job_name)
        except Exception as e:
            logger.error("Error deleting job %s: %s", job_name, e)
            print("Error deleting job", job_name, e)
            raise e

Conclusion

Summary: Audio files are uploaded to an S3 bucket, triggering an AWS Lambda function via AWS EventBridge to start an Amazon Transcribe job. The transcribed text is stored as a JSON file in another S3 bucket for future use.
Next Steps: In the next blog we will use the transcribe text embed using Bedrock embedding and store in the Amazon OpenSearch vector database. This will allow to perform Retrieval Augmented Generation (RAG) application searches again the recording files.

Select your cookie preferences

Site Terms, Privacy, and more.

AWS Transcribe: Converting Audio Files to Text

Turn audio files into text easily with AWS Transcribe

Architecture

Conclusion

1 Comment