Running Small Language Models on AWS Lambda 🤏

Overview

In this post, I'm going to show you a neat way to deploy small languages models (SLMs) or quantized versions of larger ones on AWS Lambda using function URLs and response streaming.

🚩 As of this writing, Lambda supports response streaming only on Node.js managed runtimes (14.x+).

👨‍💻 All code and documentation for this post is available on GitHub.

News 📢

26-06-2024: Added SLaMbda Chat (UI) and AWS SAM support
27-06-2024: SLaMbda was featured in Let's Build a Startup! (S2E4)! 🎥

Build 🛠️

Since I'm not a Node.js expert, I'll be using GPT4All which has a simple Node.js package for LLM bindings with some nice streaming capabilities. The code should be easy to adapt to other frameworks anyway.

Inspired by makit's article 'Running an LLM inside an AWS Lambda function', I'm also going to deploy gpt4all-falcon-newbpe-q4_0.gguf (trained by TII, fine-tuned by Nomic AI) just to have a baseline for comparison.

We'll also need to download the model config file to run in offline mode i.e. without the need to pull anything from the GPT4All website.

If you're interested, here's our model configuration:

Next, we need to create the handler function. So time to RTFM!

There's a nice section on the AWS Lambda developer guide on how to configure Lambda functions for response streaming with a tutorial and all.

In a nutshell, here's what we need to do:

Create a pipeline to handle the stream (⚠️ StackOverflow has a nice entry on why you should avoid pipe and write)
Load the model to CPU using the offline mode (this is not mandatory, but it does speed things up).
Create a completion stream to handle user requests

I'm sure the Node.js purists will have a field day with this one 😅 Here goes nothing:

Don't forget to to add the gpt4all package as a dependency

Finally, let's package the whole thing inside a Docker image 🐋

I'm calling it SLaMbda, as in SLM on Lambda, because I suck with names. Feel free to choose another one if you wish.

Time to get this up and running!

Deploy 🚀

I'll be using the AWS CLI to perform all the deployment steps, but there are certainly better tools for the job like AWS SAM or Serverless.

The commands below assume that you have defined some environment variables for the account ID (AWS_ACCOUNT_ID) and the region where you want to deploy the function (AWS_DEFAULT_REGION).

If you don't have these set up, just run the following commands:

First, we need to push the Docker image to an ECR repository.

Let's log into the ECR registry

and create a new one

Once the repository is ready, just tag the image and push it to ECR

Next, we need to create the Lambda function that is going to run my spaghetti script 🍝. This function will need an execution role, so let's take that out of the way

Now it's time to create the Lambda function λ

Here are a few important things to notice:

📦 Package type is set to Image, which we're grabbing from ECR, but you can also implement this with Zip and add a Lambda layer for the model.
⏱️ Timeout is set to 5mins so it has enough time to load the model, handle the user's request and generate the tokens.
🧠 Memory size is set to the maximum allowed value (10GB) to improve performance cf. AWS Lambda now supports up to 10 GB of memory and 6 vCPU cores for Lambda Functions for more information.

Finally, let's create a URL endpoint for our function

🔐 Notice that the authentication to our function URL is handled by IAM, which means that all requests must be signed using AWS Signature Version 4 (SigV4) cf. Invoking Lambda function URLs for additional details.

We're also enabling RESPONSE_STREAMing as promised to reduce the time-to-first token (TTFT). Be aware that if you're testing your function through the Lambda console, you'll always see responses as BUFFERED.

Save the function URL somewhere or just run this little snippet

Test 📋

The easiest way to see this in action is to use a simple curl command:

The --no-buffer/-N option disables the buffering of the output stream so we can see the chunks arriving in real time.

The first run will be slow 🐢 because it has to load the model, but the second time will be a lot faster 🐇 Just make sure you keep the Lambda function warm!

Thanks for reading this far and see you next time! 👋

References 📚

Announcing AWS Lambda Function URLs: Built-in HTTPS Endpoints for Single-Function Microservices
Introducing AWS Lambda response streaming
Running an LLM inside an AWS Lambda Function
GPT4All - free-to-use, locally running, privacy-aware chatbot

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.