Leaving no language behind with Amazon SageMaker Serverless Inference πŸŒπŸ’¬

Leaving no language behind with Amazon SageMaker Serverless Inference πŸŒπŸ’¬

Host a high-quality translation model at scale using Amazon SageMaker Serverless Inference.

JoΓ£o Galego
Amazon Employee
Published Jul 2, 2024
πŸ’¬ β€œLanguage was just difference. A thousand different ways of seeing, of moving through the world. No; a thousand worlds within one. And translation – a necessary endeavour, however futile, to move between them.” ― R. F. Kuang, Babel, or the Necessity of Violence: An Arcane History of the Oxford Translators' Revolution
In this blog post, I'm going to show you how to deploy a translation model from πŸ€— Hub with Amazon SageMaker using Serverless Inference (SI).
The model in question is No Language Left Behind (NLLB for short), a high-quality translation model from Meta AI with support for over 200 languages, from Acehnese to Zulu.
One of the things that immediately stands out about NLLB is its human-centered approach and its inclusion of underrepresented languages like Asturian (~100000 native speakers) or Minangkabau.
Meta AI presents Stories Told Through Translation: nllb.metademolab.com
🧐 Did you know: According to the Ethnologue, there are 7164 languages in use today. However, this number is rapidly declining: 40% of all living languages are endangered, while the top 25 by number of speakers accounts for more than half (!) of the world population.
"Language is a living thing" ― Gilbert Highet
Given its focus on underrepresented languages, some of which I'd never heard before, my relationship with NLLB was really love at first sight. I've been meaning to write something about it for a very long time: using it to push the limits of SI feels like a great excuse. This post is my love letter (love sticky note? πŸ—’) to the intricacies of language and the fact, which still amazes me, that we can even think about translating between languages spread all over the globe.
❗ A word of caution before we proceed: please keep in mind that SI is not for everyone.
If your workload tolerates cold starts and idle periods between traffic spurts, or if you're just testing a new model inside a non-prod environment then, by all means, go for it; otherwise, you'll find that alternative deployment options like Real-Time Inference (RTI) are probably a better fit. If you're looking for guidance, the diagram below from the Model Hosting FAQs is a good place to start.
How do I choose a model deployment option in Amazon SageMaker?

Demo: NLLB on Serverless Inference πŸš€

If there's one thing I'd like you to take from this demo, especially if it's your first time using the service, is that the effort involved in deploying a model using Amazon SageMaker is essentially invariant to the choice of deployment method.
As you'll see in this section, to use SI as opposed to, say, RTI is really just a matter of injecting the right set of attributes into the endpoint configuration: for RTI, it's things like instance_type and instance count, while for SI we work with attributes like memory size or the number of concurrent requests the model should be able to handle.
Near the end of this article, I'll have more to say about how to find their optimal values, so stay tuned... but first things first.
Let's start by updating the Amazon SageMaker Python SDK
and importing some useful classes
If you're running this demo on a SageMaker Studio environment (preferred) or a SageMaker Notebook Instance, you can retrieve the attached IAM role with the help of get_execution_role (we'll need it in a second to create the model endpoint).
Next, we define the πŸ€— Hub model configuration, which allows us to select the model as well as the task we're going to use it for
I chose the distilled 600M variant of the NLLB-200 model, which is small enough to run on SI, but I strongly encourage you to try a different version or a different model altogether.
We now have everything we need to initialize the model using the HuggingFace estimator class.
Lastly, in order to use SI endpoints, we need to create an SI configuration with our memory and concurrency requirements
🚩 Memory size for SI endpoints ranges between 1024MB/1G and 6144/6GB in 1GB increments. SageMaker will automatically assign compute resources aligned to the memory requirements.
and pass it along for .deployment
Depending on the model size, the deployment process should take at least a couple of minutes ⏱️
Let's run a quick test to see if it's working (if there are any Hindi speakers reading this, please let me know if the translated text is correct)
🎯 The full list of supported languages and their codes is available in the special tokens map and in the model card metadata (language_details).
As an alternative, you can also invoke the endpoint with Boto3
or call the endpoint URL directly with cURL
πŸ’‘ Pro Tip: you can turn the Python snippet above into a locustfile and call it a stress test. If you didn't understand a bit of what I just said, then check the AWS ML Blog post Best practices for load testing Amazon SageMaker real-time inference endpoints which, despite the name, also works with SI endpoints.

Bonus: Serverless Inference Benchmarking Toolkit ✨

If you're the observant type, you may have noticed that I maxed out the memory size of the endpoint (6GB) in the SI configuration without so much as a "how do you do". Do we really need all this RAM? Is this the optimal choice? How can we tell?
There are several different ways to tackle these questions. Today, as a token of my appreciation for following along, I'd like to show you one specifically tailored to SI, the Amazon SageMaker Serverless Inference Benchmarking Toolkit (SIBT).
SIBT works by collecting cost and performance data to help you assess whether SI is the right option for your workloads.
We start by creating a set of representative samples
which are converted into a JSONLines file
We then feed these examples into the benchmark, which you can run either locally or as a SageMaker Processing Job
In the configuration above, we're restricting the memory size coverage to the 4-6GB range to save time.
You can keep tabs on the benchmark progress by checking the status of the processing job status (as a reference, the whole thing took about 1 hour βŒ›)
Once it finishes, the tool generates multiple reports on performance (concurrency, endpoint stability) and spend (cost analysis) that are stored in the location specified by result_save_path.
Looking inside the consolidated report, which includes a cross-analysis of performance and cost for each endpoint configuration (picture below), it becomes clear from the hockey-stick curve that lowering the memory size to 5GB offers essentially the same level of performance (average latency +18 ms) at a fraction of the price (-20% when compared to 6GB).
That's all folks! See you next time... πŸ––

References πŸ“š

Articles

Blogs

Code

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

6 Comments