AWS Logo
Menu
Serving OpenAI Whisper with Ray Serve and Gradio on Amazon EKS powered by AWS Neuron

Serving OpenAI Whisper with Ray Serve and Gradio on Amazon EKS powered by AWS Neuron

Step by step instructions that will show how you can build OpenAI Whisper based application using Amazon EKS and AWS Neuron.

Published Dec 13, 2024
In this blog post, we'll explore how to deploy OpenAI's Whisper model using Ray Serve and Gradio on AWS Neuron. This powerful combination allows for efficient speech recognition capabilities on specialized hardware. Let's dive into the key components and the deployment process.

AWS Neuron

AWS Neuron is a software development kit (SDK) designed to optimize machine learning inference on AWS Inferentia chips. These custom-built chips are specifically created to accelerate machine learning workloads, offering high performance and cost-effectiveness for inference tasks.

Ray Serve

Ray Serve is a scalable model serving library built on Ray. It allows you to scale machine learning models and other Python functions in production environments. Ray Serve provides a flexible architecture for building online inference systems, making it easier to deploy and manage machine learning models.

Whisper

Whisper is a state-of-the-art automatic speech recognition (ASR) system developed by OpenAI. It's capable of transcribing speech in multiple languages and can perform translation tasks. Whisper is known for its robustness and accuracy across various audio conditions.

Gradio

Gradio is an open-source Python library that simplifies the creation of web interfaces for machine learning models. It allows developers to quickly build and share interactive demos for their models, making it easier for non-technical users to interact with and test AI applications.

Data on EKS

Data on EKS is a comprehensive platform that enables users to harness the power of generative AI and Large Language Models (LLMs) on Amazon Elastic Kubernetes Service (EKS). Data on EKS provides a unified platform that streamlines the entire AI/ML workflow, from data preparation and model training to deployment and inference, all within the scalable and flexible environment of Amazon EKS.
Key features of Data on EKS include:
1. Support for various LLMs: Users can work with models like BERT-Large, Llama2, and Stable Diffusion.
2. Comprehensive ML lifecycle management: The platform covers training, fine-tuning, and inference stages.
3. Integration with popular ML frameworks: PyTorch, TensorFlow, TensorRT, and vLLM are supported.
4. Collaborative development environment: JupyterHub is available for interactive model development and experimentation.
5. Workflow management: Kubeflow and Ray are integrated for handling complex machine learning pipelines.
6. High-performance inference: Tools like RayServe, NVIDIA Triton Inference Server, and KServe ensure efficient model serving.
7. Optimization techniques: AWS Neuron for Inferentia and NVIDIA GPUs are utilized to accelerate inference.
8. Robust storage solutions: Integration with AWS storage services like S3, EBS, EFS, and FSx for scalable data handling.
9. Model versioning and registry: MLflow is used for tracking model versions and maintaining a model registry.
10. Container management: Amazon ECR is employed for managing container images.

Deploying the Solution

Architecture Diagram for the deployedsolutions
Architecture
Before we begin, ensure you have all the prerequisites in place:

Deployment Steps

1. Clone the repository:
2. Navigate to the example directory and run the installation script:
Important Note: Update the region in the `variables.tf` file before deploying. Ensure your local region setting matches the specified region to prevent discrepancies.

Verifying Resources

To verify the Amazon EKS Cluster:

Deploying the Ray Cluster with Whisper

Creating Whisper Encoder and Decoder Weights

Separate weights are necessary for optimizing performance on AWS Neuron. This approach allows for more efficient inference by leveraging the specialized hardware. For a detailed guide on creating these weights, refer to Samir Souza's repository.
Create separate encoder and decoder weights for Whisper with below names:
- `whisper_large-v3_1_64_neuron_decoder.pt`
- `whisper_large-v3_1_neuron_encoder.pt`

Creating Whisper Application

Let's create the Whisper application using Ray Serve and FastAPI, save as whisper-neuron.py:

Dockerfile for Whisper on Neuron

Create a Dockerfile to set up the environment for Whisper on AWS Neuron:

Building Gradio Application

Create a Gradio interface for the Whisper model, save it as :

Gradio Dockerfile

Create a Dockerfile for the Gradio application:

Deploying Whisper Ray Service

Create a Kubernetes manifest for deploying the Whisper Ray Service, save as ray-whisper-inf2.yaml:
Deploy the Ray Service with Whisper.
This deployment also configures a service with multiple ports. Port 8265 is designated for the Ray dashboard, and port 8000 is for the vLLM inference server endpoint.
Run the following command to verify the services:
To access the Ray dashboard, you can port-forward the relevant port to your local machine:
Check the Ray dashboard for successful application deployment.
Ray Dashboard view
Ray Dashboard

Cleanup

Finally, we'll provide instructions for cleaning up and deprovisioning the resources when they are no longer needed.
Delete the RayCluster
Destroy the EKS Cluster and resources
This article is contributed by Jagdeep Phoolkumar, Sr. Specialist Solution Architect, Compute & Utkarsh Pundir, Assoc. Specialist Solutions Architect, Containers.
Any opinions in this post are those of the individual author and does not reflect the opinions of AWS.
 

Comments