Generative AI-powered American Sign Language Avatars

Generative AI-powered American Sign Language Avatars

A generative AI-powered solution that translates speech or text into expressive ASL avatar animations

Suresh Poopandi
Amazon Employee
Published Jun 27, 2024
This post is written by Alak Eswaradass, Senior Solutions Architect, Suresh Poopandi, Principal Solutions Architect, and Rob Koch, Principal at Slalom Build.
In today's world, effective communication is essential for fostering inclusivity and breaking down barriers. However, for individuals who rely on visual communication methods like American Sign Language (ASL), traditional communication tools often fall short. That's where GenASL comes in – a generative AI-powered solution that translates speech or text into expressive ASL avatar animations, bridging the gap between spoken/written language and sign language.
The rise of the foundational models, and the fascinating world of generative AI that we live in, is incredibly exciting and it opens up doors to imagine and build what was not previously possible. In this blog-post, we'll dive into the architecture and implementation details of GenASL, leveraging AWS generative AI capabilities to create human-like ASL avatar videos.

Architecture Overview

The GenASL solution comprises several AWS services working together to enable seamless translation from speech/text to ASL avatar animations. Users send audio, video, or text as input to GenASL, and they will visualize an ASL avatar video that interprets the input data. The solution utilizes the AWS AIML services namely Amazon Transcribe, Amazon SageMaker/Amazon Bedrock, and Foundational Models.
Here's a high-level overview of the architecture:
Architecture Diagram
Architecture Diagram
1. Amazon EC2 instance initiates a batch process to create ASL Avatars from a video data set consisting of 8000+ poses using RTMPose, a Real-Time Multi-Person Pose Estimation toolkit based on MMPose.
2. AWS Amplify distributes the GenASL web app consisting of HTML, JavaScript, and CSS to end users’ mobile devices.
3. The Amazon Cognito Identity pool grants temporary access to the Amazon S3 bucket.
4. The user uploads an audio file to the Amazon S3 bucket using AWS SDK through the web app.
5. The GenASL web app invokes the backend services by sending the Amazon S3 object Key in the payload to an API hosted on Amazon API Gateway.
6. Amazon API Gateway instantiates an AWS Step Functions workflow. The state Machine orchestrates the Artificial Intelligence /Machine Learning (AI/ML) services Amazon Transcribe, and Amazon Bedrock and the NoSQL datastore Amazon DynamoDB using AWS lambda functions.
7. The AWS Step Functions workflow generates a pre-signed URL of the ASL Avatar video for the corresponding audio file.
8. A pre-signed URL for the video file stored in Amazon S3 is sent back to the user’s browser through Amazon API Gateway. The user’s mobile device plays the video file using the pre-signed URL.

Examples

Here is an example of ASL videos created for a text "Hello, How are you?" . As a first step, this text is converted to ASL Gloss "HELLO HOW 2P"
Example
You can test drive this application hosted at https://www.aslavatars.com

Technical Deep Dive

Let's dive into the implementation details of each component:

Batch Process

The American Sign Language Lexicon Video Dataset (ASLLVD) consists of multiple synchronized videos showing the signing from different angles of more than 3,300 ASL signs in citation form, each produced by 1-6 native ASL signers. Linguistic annotations include gloss labels, sign start and end time codes, start and end handshape labels for both hands, morphological and articulatory classifications of sign type. For compound signs, the dataset includes annotations for each morpheme. To facilitate computer vision-based sign language recognition, the dataset also includes numeric ID labels for sign variants, video sequences in uncompressed-raw format, and camera calibration sequences.
We store the input dataset in an S3 bucket (Video dataset) and using RTMPose, a real-time multi-person pose estimation model based on MMPose, a Pytorch-based pose estimation open-source toolkit, we generate the ASL avatar videos. MMPose, is a member of the OpenMMLab Project, and contains a rich set of algorithms for 2d multi-person human pose estimation, 2d hand pose estimation, 2d face landmark detection, and 133 keypoint whole-body human pose estimation
Amazon EC2 instance initiates the batch process which stores the ASL Avatar videos in another S3 bucket (ASL Avatars) for every ASL gloss and stores the ASL Gloss and its corresponding ASL avatar video’s S3 key in the DynamoDB table.

Back End

The back end process has 3 steps, processing the input audio to an english text, translating the english text to ASL gloss and lastly, generating ASL avatar video from the ASL gloss. This API layer is fronted by Amazon API Gateway. API Gateway allows the user to authenticate, monitor and throttle the API request. So, whenever the API gets a request to generate the sign video, it invokes an AWS step function workflow and then return the step function execution URL back to the front-end application. The step function has three steps, outlined below.
Input Processing: First, we convert the audio input to english text using Amazon Transcribe, an automatic speech-to-text AI service that uses deep learning for speech recognition. Amazon Transcribe is a fully managed and continuously training service, designed to handle a wide range of speech/acoustic characteristics, including variations in volume, pitch, and speaking rate.
Translation: In this step, we translate the english text to ASL Gloss using Amazon Bedrock, which is the easiest way to build and scale generative AI applications utilizing FMs. Amazon Bedrock is a fully managed service that makes Foundational Models from leading AI startups and Amazon to be available from an API, so you can choose from some of the most cutting-edge FMs available today that is best suited for your use case. It provides an API driven serverless experience for builders, providing an ability to accelerate development. We have used Anthropic’s Claude V3 to create ASL Gloss.
Avatar Generation: In this last step, we generate the ASL avatar video from the ASL Gloss. Using the ASL gloss created in the translation layer, we look up corresponding ASL sign from the Dynamo DB table. The “lookup ASL Avatar“ Lambda function stitches the videos together, and generate a temporary video, upload that to the S3 bucket, create a pre-signed URL, and send the pre-signed URL for both sign video and avatar video back to the front-end. The front-end plays the video in a loop.

Front-End

The front-end application is built using AWS Amplify, a framework that allows you to build, develop, and deploy full stack applications including mobile and web applications. So, you can simply add the authentication to a front-end AWS Amplify app using Amplify CLI Add Auth that generates the signup screen, login screen, as well as the backend and the Amazon Cognito Identity pools. During the audio file upload to S3, the front-end connects with S3 using the temporary identity is provided by the Cognito identity pool.

Best Practices

API design
  1. Polling: API Gateway supports a maximum timeout of 30 seconds. So, if you give a longer audio file that takes more than 30 seconds, we want to avoid the timeout issue. Also, It's a best practice not to build a synchronous APIs for long running processes. . We built an asynchronous API consisting of two-stages. First stage is assigned with an API endpoint that accepts the S3 key, and bucket name. This API endpoint delegates the request to a step function workflow and sends a response back with the execution ARN. The second stage is assigned to an API that checks the status of the step function execution based on the execution ARN provided as an input to this API endpoint. If the ASL avatar creation is completed, this API returns the pre-signed URL back.. Otherwise it sends a RUNNING status and the front-end waits for a couple of seconds, and then call the second API endpoint again. This step is repeated until the pre-signed URL is returned by the API.
  2. Step function supports direct optimized integration with Bedrock: We don't need to have the Lambda in the middle to create ASL gloss. We can call the Bedrock API directly from step function to save the Lambda compute cost.
Dev Ops
From a DevOps perspective, the front-end is using Amplify to build and deploy, and the backend is using SAM, a Serverless Application Model to build, package, and deploy the serverless applications. Using Amazon CloudWatch, we built a dashboard to capture the metrics like number of API invocations (number of ASL avatar videos generated), average response time to create the video and error metrics to create good user experience by tracking if there is a failure, and then alert the DevOps team appropriately.
Prompt engineering
We provided a prompt to convert English Text to ASL gloss along with the input text message to Bedrock API to invoke Claude. We use few shot prompting technique by providing few examples to produce accurate ASL gloss.
NOTE - The code sample is available in the amazon-bedrock-samples git repository here.

Instructions to setup this solution

Prerequisites

Before you begin, ensure you have the following set up:
Docker
Make sure Docker is installed and running on your machine. Docker is required for containerized application development and deployment. You can download and install Docker from Docker's official website.
AWS SAM CLI
Install the AWS Serverless Application Model (SAM) CLI. This tool is essential for building and deploying serverless applications. Follow the installation instructions provided in the AWS SAM CLI documentation.
Amplify CLI
Install the Amplify Command Line Interface (CLI). The Amplify CLI is a powerful toolchain for simplifying serverless web and mobile development. You can find the installation instructions in the Amplify CLI documentation.
Windows-based EC2 Instance
Ensure you have access to a Windows-based EC2 instance to run the batch process. This instance will be used for various tasks such as video processing and data preparation. You can launch an EC2 instance via the AWS Management Console. If you need help, refer to the EC2 launch documentation.

Steps to deploy

This section provides steps to deploy an ASL avatar generator using AWS services. The following sections outline the steps for cloning the repository, processing data, deploying the backend, and setting up the frontend.


1. Clone the Git Repository

Clone the Git repository using the following command:
git clone https://github.com/aws-samples/genai-asl-avatar-generator.git



2. Batch Process

Follow the instructions specified in the dataprep folder to initialize the database,


2.1 Modify Configuration File

Modify `genai-asl-avatar-generator/dataprep/config.ini` with information specific to your environment:

2.2 Set Up Your Environment

Set up your environment by installing the required Python packages:


2.3 Prepare Sign Video Annotation File

Prepare the sign video annotation file for each processing run:



2.4 Download and Segment Sign Videos

Download sign videos, segment them, and store them in S3:


2.5 Generate Avatar Videos

Generate avatar videos:


3. Deploy the Backend

Use the following command to deploy the backend application:


4. Set Up the Frontend


4.1 Initialize Amplify Environment

Initialize your Amplify environment:

4.2 Modify Frontend Configuration

Modify the frontend configuration to point to the backend API:
- Open `frontend/amplify/backend/function/Audio2Sign/index.py`
- Modify the `stateMachineArn` variable to have the state machine ARN shown in the output generated from backend deployment



4.3 Add hosting to the amplify project

In the prompt, select "Amazon CloudFront and S3" and select the bucket to host GenASL application


4.4 Install the relevant packages by running the following command


4.5 Deploy the amplify project

Running the solution

After deploying the Amplify project using the amplify publish command, a CloudFront URL will be returned. You can use this URL to access the GenASL demo application. With the application open, you can register a new user and test the ASL avatar generation functionality.

Cleanup

1. Delete the Frontend Amplify Application
2. Delete the Backend Resources
3. Cleanup Resources Used by Batch Process



4. Delete the S3 Bucket

Remove all the frontend resources created by Amplify using the following delete command:
Remove all the backend resources created by SAM using the following delete command:
- If you created a new EC2 instance for running the batch process, you can terminate the EC2 instance using the AWS Console.
- If you reused an existing EC2 instance, you can delete the project folder recursively to clean up all the resources:
Use the following AWS CLI commands to delete the buckets created for storing ASL videos:
Replace <bucket-name> with the name of your S3 bucket.

What's next?

3D pose Estimation: GenASL application is currently generating a 2D avatar, we plan to convert the GenASL solution to create 3D avatars using the 3D pose estimation algorithms supported by MMPose. With that approach, we can create thousands of 3D keypoints. Using a Stable Diffusion image generation capabilities we can create realistic human-like avatars in real world settings.
Blending Techniques: When you see some of those videos generated by GenASL application, there is a frame skipping, so when we stitch the video together, there's a frame drop, resulting in a sudden change in the motion. So to fix that, we can use a technique called blending. There are partner solutions available that will create the intermediate frame to create the blending. Next steps, we will incorporate partner solutions to create smoother videos.
Bi-directional: GenASL application currently converts audio to an ASL video. We also need a solution from ASL video back to English audio, so that also can be done by navigating in the reverse direction. by recording a real-time sign video, and take the video frame by frame, and send that through pose estimation algorithms, and collect and combine the keypoints, and search against the keypoints database to get the ASL gloss and convert that back to text. Using Amazon Polly, we can convert the text back to audio.

Conclusion

By combining speech-to-text, machine translation, and text-to-video generation, and using AWS AI/ML services, GenASL solution creates expressive ASL avatar animations, fostering inclusive and effective communication. This blog post provided an overview of the GenASL architecture and implementation details. As generative AI continues to evolve, we can create groundbreaking applications that enhance accessibility and inclusivity for all.

About Authors

Alak Eswaradass is a Senior Solutions Architect at AWS, based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She hangs out with her daughters and explores the outdoors in her free time.
Suresh Poopandi is a Principal Solutions Architect at AWS, based in Chicago, Illinois, helping Healthcare Life Science customers with their cloud journey by providing architectures utilizing AWS services to achieve their business goals. He is passionate about building home automation and AI/ML solutions
Rob Koch is a tech enthusiast who thrives on steering projects from their initial spark to successful fruition, Rob Koch is Principal at Slalom Build in Seattle, an AWS Data Hero, and Co-chair of the CNCF Deaf and Hard of Hearing Working Group. His expertise in architecting event-driven systems is firmly rooted in the belief that data should be harnessed in real time. Rob relishes the challenge of examining existing systems and mapping the journey towards an event-driven architecture.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments