Leaving no language behind with Amazon SageMaker Serverless Inference ππ¬
Host a high-quality translation model at scale using Amazon SageMaker Serverless Inference.
JoΓ£o Galego
Amazon Employee
Published Jul 2, 2024
π¬ βLanguage was just difference. A thousand different ways of seeing, of moving through the world. No; a thousand worlds within one. And translation β a necessary endeavour, however futile, to move between them.β β R. F. Kuang, Babel, or the Necessity of Violence: An Arcane History of the Oxford Translators' Revolution
In this blog post, I'm going to show you how to deploy a
translation
model from π€ Hub with Amazon SageMaker using Serverless Inference (SI).The model in question is No Language Left Behind (NLLB for short), a high-quality translation model from Meta AI with support for over
200
languages, from Acehnese to Zulu.One of the things that immediately stands out about NLLB is its human-centered approach and its inclusion of underrepresented languages like Asturian (~100000 native speakers) or Minangkabau.
π§ Did you know: According to the Ethnologue, there are
7164
languages in use today. However, this number is rapidly declining: 40%
of all living languages are endangered, while the top 25 by number of speakers accounts for more than half (!) of the world population.Given its focus on underrepresented languages, some of which I'd never heard before, my relationship with NLLB was really love at first sight. I've been meaning to write something about it for a very long time: using it to push the limits of SI feels like a great excuse. This post is my love letter (love sticky note? π) to the intricacies of language and the fact, which still amazes me, that we can even think about translating between languages spread all over the globe.
β A word of caution before we proceed: please keep in mind that SI is not for everyone.
If your workload tolerates cold starts and idle periods between traffic spurts, or if you're just testing a new model inside a non-
prod
environment then, by all means, go for it; otherwise, you'll find that alternative deployment options like Real-Time Inference (RTI) are probably a better fit. If you're looking for guidance, the diagram below from the Model Hosting FAQs is a good place to start.If there's one thing I'd like you to take from this demo, especially if it's your first time using the service, is that the effort involved in deploying a model using Amazon SageMaker is essentially invariant to the choice of deployment method.
As you'll see in this section, to use SI as opposed to, say, RTI is really just a matter of injecting the right set of attributes into the endpoint configuration: for RTI, it's things like
instance_type
and instance count
, while for SI we work with attributes like memory size or the number of concurrent requests the model should be able to handle. Near the end of this article, I'll have more to say about how to find their optimal values, so stay tuned... but first things first.
Let's start by updating the Amazon SageMaker Python SDK
and importing some useful classes
If you're running this demo on a SageMaker Studio environment (preferred) or a SageMaker Notebook Instance, you can retrieve the attached IAM role with the help of
get_execution_role
(we'll need it in a second to create the model endpoint).Next, we define the π€ Hub model configuration, which allows us to select the model as well as the task we're going to use it for
I chose the distilled 600M variant of the NLLB-200 model, which is small enough to run on SI, but I strongly encourage you to try a different version or a different model altogether.
We now have everything we need to initialize the model using the
HuggingFace
estimator class.Lastly, in order to use SI endpoints, we need to create an SI configuration with our memory and concurrency requirements
π© Memory size for SI endpoints ranges between1024MB/1G
and6144/6GB
in1GB
increments. SageMaker will automatically assign compute resources aligned to the memory requirements.
and pass it along for
.deploy
mentDepending on the model size, the deployment process should take at least a couple of minutes β±οΈ
Let's run a quick test to see if it's working (if there are any Hindi speakers reading this, please let me know if the translated text is correct)
π― The full list of supported languages and their codes is available in the special tokens map and in the model card metadata (language_details
).
As an alternative, you can also invoke the endpoint with Boto3
or call the endpoint URL directly with cURL
π‘ Pro Tip: you can turn the Python snippet above into a locustfile and call it a stress test. If you didn't understand a bit of what I just said, then check the AWS ML Blog post Best practices for load testing Amazon SageMaker real-time inference endpoints which, despite the name, also works with SI endpoints.
If you're the observant type, you may have noticed that I maxed out the memory size of the endpoint (
6GB
) in the SI configuration without so much as a "how do you do". Do we really need all this RAM? Is this the optimal choice? How can we tell?There are several different ways to tackle these questions. Today, as a token of my appreciation for following along, I'd like to show you one specifically tailored to SI, the Amazon SageMaker Serverless Inference Benchmarking Toolkit (SIBT).
SIBT works by collecting cost and performance data to help you assess whether SI is the right option for your workloads.
We start by creating a set of representative samples
which are converted into a JSONLines file
We then feed these examples into the benchmark, which you can run either locally or as a SageMaker Processing Job
In the configuration above, we're restricting the memory size coverage to the
4-6GB
range to save time.You can keep tabs on the benchmark progress by checking the status of the processing job status (as a reference, the whole thing took about
1 hour
β)Once it finishes, the tool generates multiple reports on performance (concurrency, endpoint stability) and spend (cost analysis) that are stored in the location specified by
result_save_path
.Looking inside the consolidated report, which includes a cross-analysis of performance and cost for each endpoint configuration (picture below), it becomes clear from the hockey-stick curve that lowering the memory size to
5GB
offers essentially the same level of performance (average latency +18 ms
) at a fraction of the price (-20%
when compared to 6GB
).That's all folks! See you next time... π
- (NLLB Team et al., 2022) No Language Left Behind: Scaling Human-Centered Machine Translation
- Amazon SageMaker Examples - includes a section on Serverless Inference
- SageMaker Serverless Inference Toolkit - tool to benchmark SageMaker serverless endpoint configurations and help find the most optimal one
Β
Β
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.