AWS Logo
Menu
Avoid LLM Lock-in: A Dev's Guide to Model-Agnostic AI Apps

Avoid LLM Lock-in: A Dev's Guide to Model-Agnostic AI Apps

Accelerate Generative AI Model selection and enable easy model switching in your workloads

Nick
Amazon Employee
Published Dec 6, 2024

Intro

The GenAI landscape is evolving rapidly. New models are released daily, and each model claims to be the best based on various benchmarks in the industry. So far, when building GenAI applications, teams have had to tightly couple their application with a single model. Even in the simple Bedrock ecosystem, there is a refactoring effort to switch between model providers, like Anthropic Claude to Meta Llama. I have even invested time creating abstraction layers to help me transform between models to switch on the fly in my demos and POCs. However, rolling your abstraction layers isn't a cure-all; it leads to its own headaches.
Enter LiteLLM. This open-source project aims to eliminate the vendor lock-in associated with choosing a model. It handles the behind-the-scenes work of translating parameters ('TopP' vs.'TOP_P' vs. no top p parameter available, etc.) as well as structuring the message format ("User" vs. "Human" vs. "Assistant"). It even comes in two flavors: a stand-alone proxy that runs on a dedicated server or a library that you can call directly in your code. I will explore the benefits of both, as well as why leveraging a proxy layer is crucial for any GenAI applications.

Prerequisites

1/ Make sure to have Bedrock model access in your account.
2/ Use an EC2 as a dev environment and to test a proxy server.
3/ An EC2 IAM role that has access to Bedrock.
If you don't have an EC2 instance available, spin it up now (I prefer Ubuntu for dev work). Ensure the NACL/SG allows an SSH connection to connect with VSCode Remote.

Architecture

This post will walk through using LiteLLM as a stand-alone proxy. I will focus on the proxy version because this can be used with code as well as overload existing GenAI applications that support custom endpoints, like Cursor IDE, Obsidian GPT, and GitHub Copilot.
Architecture diagram detailing how to set up LiteLLM proxy.
Architecture Diagram

Benefits of this architecture and LiteLLM Proxy

Caching Mechanism
The centralized proxy allows for efficient prompt and response caching, reducing costs and improving response times by bypassing unnecessary LLM calls.
Reliability Features
The system ensures robust operation through a model fallback configuration that seamlessly handles failure scenarios. Automated retry mechanisms protect against transient issues, while load balancing across multiple endpoints optimizes performance. Configurable redundancy options further enhance system stability.
Operational Control
Centralized management enables cost tracking and monitoring across the entire system. Token usage tracking provides detailed visibility into consumption patterns. Request logging captures all interactions for analysis and debugging.
While the proxy introduces minimal latency, these benefits far outweigh the performance impact. Teams can leverage this architecture to build robust, cost-effective GenAI applications that scale efficiently.
 

Getting started with development

Jump into VSCode and make sure you are SSH'ed into your EC2 instance.
Let's run a couple commands to get us going:

Configuration

Before running the proxy, create a config file to define how the service will run.
First, the config, then the explanation
The config is split into 2 sections: Model List and LiteLLM Settings.
Model List
  • model_name — how you want to refer to the model you want to use. In most instances, it can be whatever you want to use. However, if you are trying to override a desktop app that allows for custom OpenAI endpoints, you would have to use OpenAI model name, like ‘gpt4o’. This won’t affect what is being used; it is just a label.
  • model — this is the actual name of the model you want to use. In this example, I want to use the Claude3.5 Sonnet model (actually the cross-region inference) in Amazon Bedrock.
  • aws_region_name — I usually work in ‘us-east-1’ for my POC and minor projects. Use whatever region you’re set up for.
I have defined two models here, with two names, but they both point to Amazon Bedrock Claude3.5 Sonnet.
LiteLLM
  • set_verbose — I wanted to see every request, response, and error out of Bedrock, because I was debugging some code when using this proxy.
  • drop_params — When switching between models, they don’t always have the same parameters, like ‘presence penalty’ or ‘topK’. LiteLLM will make sure to drop any parameters for when translating the calls.
  • modify_params — Llama and Claude are very picky about where the ‘User’ role appears in the prompt. LiteLLM will clean up your messages and ensure they have been assigned the correct roles.
These are by no means exhaustive; I urge you to peruse the documentation to see what other knobs there are for your specific use cases.

Run the proxy

Back in the terminal, it’s time to run the proxy. By default, it binds to 0.0.0.0 and uses port 4000, so it’s ready to access externally immediately.
Let’s make sure it’s working.
 

Implementation Guides

Let's look at a couple of ways to leverage our new proxy.

Langchain

I can use the Lanchain SDK and treat our proxy as a custom OpenAI endpoint. Just include the IP of your proxy and the model you defined in your config file.

OpenAI

Likewise, I can use the OpenAI SDK. This opens up the ability to use any existing projects targeted for OpenAI without making any code changes. Just stand up the proxy and use it to front any other models you're interested in testing.

LiteLLM SDK

LiteLLM also offers an SDK that you can point to the proxy. This allows you to write agnostic code without reliance on OpenAI or Bedrock APIs.

Future expansion

Deployment

Moving out of this simple dev environment into production will require additional thought and planning.

Performance

While LiteLLM doesn't add much overhead to individual calls, it is crucial to understand what should be expected here. You likely will also want to load test the EC2 instances behind the load balancer to ensure you have a good plan for when and how you need to scale

Cost

LiteLLM itself is free to use in your environment. However, there is an extra cost associated with hosting the proxy itself. The most cost-efficient option is to rely solely on the SDK. There are reasons to use the proxy instead of the SDK, and if you are working with something like LangChain or legacy code that is hard-coded to use OpenAI, a proxy may be non-negotiable.
Model the required EC2 instances and load balancer so you aren't surprised when you finally get to production.

Security

Your proxy (or the SDK) will require access to Bedrock. Make sure to follow best practices around least privilege when assigning permissions.
Also, remember that this proxy shouldn't be exposed to the world. Since it is responsible for talking to Bedrock (or other models), this could run up a considerable bill, pollute your cache, and even degrade the performance of your application. Lock this down appropriately.

Error handling and logging

LiteLLM can be configured to auto-retry requests to Bedrock. I recommend using this but keep the retries low. Consider implementing exponential backoffs if 2-3 attempts result in failures. You can also still take advantage of Bedrock cross-region inference endpoints. This is becoming the default way to interact with models, so it is generally best practice and should and should result in fewer errors such as 'insufficient capacity'
Also, ensure your EC2 are logging to Cloud Watch so you can catch any other errors that may be masked through whatever retry logic you land on.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments