Running Small Language Models on AWS Lambda 🤏

Learn how to run small language models (SLMs) at scale on AWS Lambda with function URLs and response streaming.

João Galego
Amazon Employee
Published Apr 25, 2024
Last Modified Jun 28, 2024

Overview

In this post, I'm going to show you a neat way to deploy small languages models (SLMs) or quantized versions of larger ones on AWS Lambda using function URLs and response streaming.
🚩 As of this writing, Lambda supports response streaming only on Node.js managed runtimes (14.x+).
👨‍💻 All code and documentation for this post is available on GitHub.

News 📢

Build 🛠️

Since I'm not a Node.js expert, I'll be using GPT4All which has a simple Node.js package for LLM bindings with some nice streaming capabilities. The code should be easy to adapt to other frameworks anyway.
Inspired by makit's article 'Running an LLM inside an AWS Lambda function', I'm also going to deploy gpt4all-falcon-newbpe-q4_0.gguf (trained by TII, fine-tuned by Nomic AI) just to have a baseline for comparison.
We'll also need to download the model config file to run in offline mode i.e. without the need to pull anything from the GPT4All website.
If you're interested, here's our model configuration:
Next, we need to create the handler function. So time to RTFM!
There's a nice section on the AWS Lambda developer guide on how to configure Lambda functions for response streaming with a tutorial and all.
In a nutshell, here's what we need to do:
  1. Create a pipeline to handle the stream (⚠️ StackOverflow has a nice entry on why you should avoid pipe and write)
  2. Load the model to CPU using the offline mode (this is not mandatory, but it does speed things up).
  3. Create a completion stream to handle user requests
I'm sure the Node.js purists will have a field day with this one 😅 Here goes nothing:
Don't forget to to add the gpt4all package as a dependency
Finally, let's package the whole thing inside a Docker image 🐋
I'm calling it SLaMbda, as in SLM on Lambda, because I suck with names. Feel free to choose another one if you wish.
Time to get this up and running!

Deploy 🚀

I'll be using the AWS CLI to perform all the deployment steps, but there are certainly better tools for the job like AWS SAM or Serverless.
The commands below assume that you have defined some environment variables for the account ID (AWS_ACCOUNT_ID) and the region where you want to deploy the function (AWS_DEFAULT_REGION).
If you don't have these set up, just run the following commands:
First, we need to push the Docker image to an ECR repository.
Let's log into the ECR registry
and create a new one
Once the repository is ready, just tag the image and push it to ECR
Next, we need to create the Lambda function that is going to run my spaghetti script 🍝. This function will need an execution role, so let's take that out of the way
Now it's time to create the Lambda function λ
Here are a few important things to notice:
Finally, let's create a URL endpoint for our function
🔐 Notice that the authentication to our function URL is handled by IAM, which means that all requests must be signed using AWS Signature Version 4 (SigV4) cf. Invoking Lambda function URLs for additional details.
We're also enabling RESPONSE_STREAMing as promised to reduce the time-to-first token (TTFT). Be aware that if you're testing your function through the Lambda console, you'll always see responses as BUFFERED.
Save the function URL somewhere or just run this little snippet

Test 📋

The easiest way to see this in action is to use a simple curl command:
The --no-buffer/-N option disables the buffering of the output stream so we can see the chunks arriving in real time.
The first run will be slow 🐢 because it has to load the model, but the second time will be a lot faster 🐇 Just make sure you keep the Lambda function warm!
Thanks for reading this far and see you next time! 👋

References 📚

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

3 Comments