
Leaving no language behind with Amazon SageMaker Serverless Inference 🌍💬
Host a high-quality translation model at scale using Amazon SageMaker Serverless Inference.
💬 “Language was just difference. A thousand different ways of seeing, of moving through the world. No; a thousand worlds within one. And translation – a necessary endeavour, however futile, to move between them.” ― R. F. Kuang, Babel, or the Necessity of Violence: An Arcane History of the Oxford Translators' Revolution
translation
model from 🤗 Hub with Amazon SageMaker using Serverless Inference (SI).200
languages, from Acehnese to Zulu.7164
languages in use today. However, this number is rapidly declining: 40%
of all living languages are endangered, while the top 25 by number of speakers accounts for more than half (!) of the world population.prod
environment then, by all means, go for it; otherwise, you'll find that alternative deployment options like Real-Time Inference (RTI) are probably a better fit. If you're looking for guidance, the diagram below from the Model Hosting FAQs is a good place to start.instance_type
and instance count
, while for SI we work with attributes like memory size or the number of concurrent requests the model should be able to handle. 1
2
pip install -U sagemaker
pip show sagemaker
1
2
3
4
5
6
7
8
9
10
Name: sagemaker
Version: 2.224.2
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: attrs, boto3, cloudpickle, docker, google-pasta, importlib-metadata, jsonschema, numpy, packaging, pandas, pathos, platformdirs, protobuf, psutil, PyYAML, requests, schema, smdebug-rulesconfig, tblib, tqdm, urllib3
Required-by:
1
2
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig
get_execution_role
(we'll need it in a second to create the model endpoint).1
2
3
4
import sagemaker
role = sagemaker.get_execution_role()
print(f"Role 👷: {role}")
1
2
3
4
env = {
'HF_MODEL_ID': "facebook/nllb-200-distilled-600M",
'HF_TASK': "translation"
}
HuggingFace
estimator class.1
2
3
4
5
6
7
model = HuggingFaceModel(
transformers_version="4.37.0", # Transformers version used
pytorch_version="2.1.0", # PyTorch version used
py_version='py310', # Python version used
env=env,
role=role,
)
🚩 Memory size for SI endpoints ranges between1024MB/1G
and6144/6GB
in1GB
increments. SageMaker will automatically assign compute resources aligned to the memory requirements.
1
2
3
4
serverless_inference_config = ServerlessInferenceConfig(
memory_size_in_mb=6144, # Memory size of the endpoint
max_concurrency=10 # Max number of concurrent invocations the endpoint can handle
)
.deploy
ment1
2
3
nllb = model.deploy(
serverless_inference_config=serverless_inference_config
)
🎯 The full list of supported languages and their codes is available in the special tokens map and in the model card metadata (language_details
).
1
2
3
4
5
6
7
8
9
predictor.predict({
'inputs': "La vie est belle.",
'parameters': {
'src_lang': "fra_Latn", # French (Latin Script)
'tgt_lang': "hin_Deva" # Hindi (Devanagari Script)
}
})
# Output: [{'translation_text': 'जीवन सुंदर है।'}]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import json
import boto3
# Initialize client
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Runtime.html
sm_runtime = boto3.client("runtime.sagemaker")
# Create payload
payload = {
'inputs': "La vie est belle.",
'parameters': {
'src_lang': "fra_Latn",
'tgt_lang': "hin_Deva"
}
}
payload = json.dumps(payload)
# Call endpoint and process response
response = sm_runtime.invoke_endpoint(
EndpointName=predictor.endpoint_name,
ContentType="application/json",
Body=payload
)
json.loads(response["Body"].read().decode())
# Output: [{"translation_text":"जीवन सुंदर है।"}]
1
2
3
4
5
6
curl --aws-sigv4 "aws:amz:$AWS_REGION:sagemaker" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
-H "x-amz-security-token: $AWS_SESSION_TOKEN" \
-H "content-type: application/json" \
-d '{"inputs": "La vie est belle.", "parameters": {"src_lang": "fra_Latn", "tgt_lang": "hin_Deva"}}' \
https://runtime.sagemaker.$AWS_REGION.amazonaws.com/endpoints/$ENDPOINT_NAME/invocations
💡 Pro Tip: you can turn the Python snippet above into a locustfile and call it a stress test. If you didn't understand a bit of what I just said, then check the AWS ML Blog post Best practices for load testing Amazon SageMaker real-time inference endpoints which, despite the name, also works with SI endpoints.
6GB
) in the SI configuration without so much as a "how do you do". Do we really need all this RAM? Is this the optimal choice? How can we tell?1
2
# See https://github.com/aws-samples/sagemaker-serverless-inference-benchmarking
pip install sm-serverless-benchmarking
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
example_invoke_args = [
{
'Body': {
'inputs': "No language should be left behind!",
'parameters': {
'src_lang': "eng_Latn",
'tgt_lang': "por_Latn"
}
},
'ContentType': "application/json"
},
{
'Body': {
'inputs': "La vie est belle.",
'parameters': {
'src_lang': "fra_Latn",
'tgt_lang': "hin_Deva"
}
},
'ContentType': "application/json"
},
{
'Body': {
'inputs': "Доверяй, но проверяй.",
'parameters': {
'src_lang': "rus_Cyrl",
'tgt_lang': "mri_Latn"
}
},
'ContentType': "application/json"
},
]
for i in range(len(example_invoke_args)):
example_invoke_args[i]['Body'] = json.dumps(example_invoke_args[i]['Body'])
example_invoke_file = convert_invoke_args_to_jsonl(example_invoke_args)
1
2
3
{"Body": "{\"inputs\": \"No language should be left behind!\", \"parameters\": {\"src_lang\": \"eng_Latn\", \"tgt_lang\": \"por_Latn\"}}", "ContentType": "application/json"}
{"Body": "{\"inputs\": \"La vie est belle.\", \"parameters\": {\"src_lang\": \"fra_Latn\", \"tgt_lang\": \"hin_Deva\"}}", "ContentType": "application/json"}
{"Body": "{\"inputs\": \"\\u0414\\u043e\\u0432\\u0435\\u0440\\u044f\\u0439, \\u043d\\u043e \\u043f\\u0440\\u043e\\u0432\\u0435\\u0440\\u044f\\u0439.\", \"parameters\": {\"src_lang\": \"rus_Cyrl\", \"tgt_lang\": \"mri_Latn\"}}", "ContentType": "application/json"}
1
2
3
4
5
6
7
8
from sm_serverless_benchmarking.sagemaker_runner import run_as_sagemaker_job
benchmark = run_as_sagemaker_job(
role=role,
model_name=nllb.name,
invoke_args_examples_file=example_invoke_file,
memory_sizes=[4096, 5120, 6144]
)
4-6GB
range to save time.1 hour
⌛)1
2
3
4
5
6
7
8
9
10
11
import time
# Initialize SageMaker client
sm = boto3.client('sagemaker')
# Wait for the benchmark run to finish
while (status := sm.describe_processing_job(ProcessingJobName=benchmark.latest_job.name)['ProcessingJobStatus']) and status == "InProgress":
print(".", end="")
time.sleep(60)
print(f"{status}!")
result_save_path
.1
2
# Where the reports are stored
print(f"Results: {processor.latest_job.outputs[0].destination}")
5GB
offers essentially the same level of performance (average latency +18 ms
) at a fraction of the price (-20%
when compared to 6GB
).- (NLLB Team et al., 2022) No Language Left Behind: Scaling Human-Centered Machine Translation
- Amazon SageMaker Examples - includes a section on Serverless Inference
- SageMaker Serverless Inference Toolkit - tool to benchmark SageMaker serverless endpoint configurations and help find the most optimal one
Image not found
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.