Running Small Language Models on AWS Lambda 🤏
Learn how to run small language models (SLMs) at scale on AWS Lambda with function URLs and response streaming.
João Galego
Amazon Employee
Published Apr 25, 2024
Last Modified Jun 28, 2024
In this post, I'm going to show you a neat way to deploy small languages models (SLMs) or quantized versions of larger ones on AWS Lambda using function URLs and response streaming.
🚩 As of this writing, Lambda supports response streaming only on Node.js managed runtimes (14.x+
).
👨💻 All code and documentation for this post is available on GitHub.
27-06-2024
: SLaMbda was featured in Let's Build a Startup! (S2E4)! 🎥
Since I'm not a Node.js expert, I'll be using GPT4All which has a simple Node.js package for LLM bindings with some nice streaming capabilities. The code should be easy to adapt to other frameworks anyway.
Inspired by makit's article 'Running an LLM inside an AWS Lambda function', I'm also going to deploy
gpt4all-falcon-newbpe-q4_0.gguf
(trained by TII, fine-tuned by Nomic AI) just to have a baseline for comparison.We'll also need to download the model config file to run in
offline
mode i.e. without the need to pull anything from the GPT4All website.If you're interested, here's our model configuration:
Next, we need to create the handler function. So time to RTFM!
There's a nice section on the AWS Lambda developer guide on how to configure Lambda functions for response streaming with a tutorial and all.
In a nutshell, here's what we need to do:
- Create a
pipeline
to handle the stream (⚠️ StackOverflow has a nice entry on why you should avoidpipe
andwrite
) - Load the model to CPU using the
offline
mode (this is not mandatory, but it does speed things up). - Create a completion stream to handle user requests
I'm sure the Node.js purists will have a field day with this one 😅 Here goes nothing:
Don't forget to to add the
gpt4all
package as a dependencyFinally, let's package the whole thing inside a Docker image 🐋
I'm calling it SLaMbda, as in SLM on Lambda, because I suck with names. Feel free to choose another one if you wish.
Time to get this up and running!
I'll be using the AWS CLI to perform all the deployment steps, but there are certainly better tools for the job like AWS SAM or Serverless.
The commands below assume that you have defined some environment variables for the account ID (
AWS_ACCOUNT_ID
) and the region where you want to deploy the function (AWS_DEFAULT_REGION
). If you don't have these set up, just run the following commands:
First, we need to push the Docker image to an ECR repository.
Let's log into the ECR registry
and create a new one
Once the repository is ready, just tag the image and push it to ECR
Next, we need to create the Lambda function that is going to run my spaghetti script 🍝. This function will need an execution role, so let's take that out of the way
Now it's time to create the Lambda function λ
Here are a few important things to notice:
- 📦 Package type is set to
Image
, which we're grabbing from ECR, but you can also implement this withZip
and add a Lambda layer for the model. - ⏱️ Timeout is set to
5mins
so it has enough time to load the model, handle the user's request and generate the tokens. - 🧠 Memory size is set to the maximum allowed value (
10GB
) to improve performance cf. AWS Lambda now supports up to 10 GB of memory and 6 vCPU cores for Lambda Functions for more information.
Finally, let's create a URL endpoint for our function
🔐 Notice that the authentication to our function URL is handled by IAM, which means that all requests must be signed using AWS Signature Version 4 (SigV4) cf. Invoking Lambda function URLs for additional details.
We're also enabling
RESPONSE_STREAM
ing as promised to reduce the time-to-first token (TTFT). Be aware that if you're testing your function through the Lambda console, you'll always see responses as BUFFERED
.Save the function URL somewhere or just run this little snippet
The easiest way to see this in action is to use a simple curl command:
The
--no-buffer/-N
option disables the buffering of the output stream so we can see the chunks arriving in real time.The first run will be slow 🐢 because it has to load the model, but the second time will be a lot faster 🐇 Just make sure you keep the Lambda function warm!
Thanks for reading this far and see you next time! 👋
- GPT4All - free-to-use, locally running, privacy-aware chatbot
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.