logo
Menu
Deploy a small LLM to a device using AWS IoT Greengrass

Deploy a small LLM to a device using AWS IoT Greengrass

How to deploy a small language model to a device

Randy D
Amazon Employee
Published Apr 30, 2024
Small language models (SLMs) are generative AI foundation models designed to run on relatively low-powered hardware. While the latest models have tens or hundreds of billions of parameters and require powerful GPUs, an SLM may have a few billion parameters and can run on a device with a few GB of RAM and no GPU.
I thought it'd be interesting to look at deploying an SLM to a device in an IoT scenario. Language models could help device operators use or troubleshoot equipment more effectively by explaining diagnostic codes, offering tips for proper use, or even interfacing with other equipment using agents. Devices may have limited connectivity to the cloud, or you may have concerns about transmitting data outside of a facility, so having a "local to the device" SLM is useful.
In order to test this out, I decided to use AWS IoT Greengrass to deploy an SLM to a simulated Greengrass core device. The first step is just setting up an EC2 instance as a simulated Greengrass core device. I launched an m5.large instance (2 vCPU and 8 GB of RAM) with Ubuntu 24.04 and followed the instructions to configure it as a Greengrass core device. Note that the IAM role used by the core device needs read access to an S3 bucket used to store model artifacts.
Next, I picked an SLM to deploy. I used the new Phi-3 SLM from Microsoft, as it shows good performance and already has an ONNX version available. ONNX lets us be agnostic to the device frameworks and ML accelerator.
Using this sample as a starting point, I then wrote two Greengrass component recipes. The first just sets up the ONNX runtime and other dependencies on the core device. The recipe (shown below) creates a Python virtual environment and installs the required modules.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"RecipeFormatVersion": "2020-01-25",
"ComponentName": "com.demo.onnxruntime.phi",
"ComponentVersion": "1.0.0",
"ComponentDescription": "A component that installs the ONNX Runtime",
"ComponentPublisher": "Amazon",
"Manifests": [
{
"Platform": {
"os": "linux"
},
"Lifecycle": {
"Install": {
"RequiresPrivilege": true,
"Script": "python3 -m venv /opt/venv && . /opt/venv/bin/activate && python3 -m pip install numpy && python3 -m pip install -U --pre onnxruntime-genai",
"timeout": "900"
}
}
}
]
}
The second recipe downloads the model artifacts to the core device. Note that you need to put in your own bucket name on line 27 of this recipe.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"RecipeFormatVersion": "2020-01-25",
"ComponentName": "com.demo.onnx-phi",
"ComponentVersion": "1.0.0",
"ComponentDescription": "A component that installs the Phi SLM",
"ComponentPublisher": "Amazon",
"ComponentDependencies": {
"com.demo.onnxruntime.phi": {
"VersionRequirement": ">=1.0.0",
"DependencyType": "HARD"
}
},
"Manifests": [
{
"Platform": {
"os": "linux"
},
"Lifecycle": {
"Install": {
"RequiresPrivilege": true,
"Script": ". /opt/venv/bin/activate && python3 -m pip install awsiotsdk",
"timeout": "900"
}
},
"Artifacts": [
{
"URI": "s3://{BUCKET}/phi/greengrass-onnx.zip",
"Unarchive": "ZIP"
}
]
}
]
}
The zip file referenced in the recipe contains the model artifacts. In order to create this zip file, I downloaded the Phi-3 model from HuggingFace:
1
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx
I zipped up the downloaded artifacts and uploaded the zip file to my S3 bucket.
After deploying the two components, you're ready to use the SLM. On the core device, activate the python virtual environment, download the sample inference script, and try it out.
1
2
3
4
source /opt/venv/bin/activate
cd /greengrass/v2/packages/artifacts-unarchived/com.demo.onnx-phi/1.0.0/greengrass-onnx
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/model-qa.py -o model-qa.py
python model-qa.py -m model -k 40 -p 0.95 -t 0.8 -r 1.0
The sample inference script runs in an interactive loop. You can send it a sample prompt in this format:
1
<|user|>Tell me a fact about Ubuntu OS<|end|><|assistant|>
In a real scenario you could integrate the SLM into any other application running on the device. You could also use the AWS IoT services to capture model input, output, and diagnostics, and send them to an MQTT topic for auditing and analysis in the cloud.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments