Building a Foundation Model as a Service (FMaaS) on AWS

In today's world, organizations are increasingly leveraging the power of generative AI models, also known as foundation models (FMs), to create innovative applications and services. While fully managed services like Amazon Bedrock offer a convenient solution for many use cases, some organizations may prefer to build their own model serving platform to meet specific requirements or to serve their own custom models. This series of articles explores best practices for building a Foundation Model as a Service (FMaaS) on AWS.

Why Build a FMaaS?

There are several reasons why an organization might choose to build a FMaaS:

Model Providers: Companies that develop and train their own foundation models may want to provide a service for serving those models to their customers or partners.
Heavy Fine-tuners: Organizations that fine-tune existing foundation models to create customized models for their clients might prefer to host and serve those models through their own platform.
Internal Solutions: Some organizations may want to build an internal FMaaS to serve their own custom models or fine-tuned models across different teams or projects within the organization.

Prerequisites

Before embarking on building a FMaaS, it's important to consider scenarios where a fully managed service might be more appropriate: If your organization requires a fully managed service with minimal operational overhead, Amazon Bedrock may be a better choice. Building and maintaining an FMaaS requires dedicated resources and expertise. However, if you decide to build an FMaaS, here are two essential elements to start with:

Model Selection: This article assumes that you have already created or chosen models to serve. If you need assistance with selecting the right foundation model for your use case, consider tools like the Foundation Model Evaluations Library (fmeval) or the HuggingFace Open LLM Leaderboard.
SaaS Architecture Knowledge: This article focuses specifically on the additional considerations related to serving foundation models. If you need a refresher on the fundamentals of Software as a Service (SaaS) architecture, I recommend reviewing the relevant whitepapers or resources such as SaaS Architecture Fundamentals.

In this first part, I will focus on performance efficiency for LLMs (other metrics apply to other modalities). In future articles, I will address other architectural considerations such as cost optimization, security, and multi-tenancy, among others. Please let me know if you are interested in a specific part so that I can prioritize it accordingly.

Performance Efficiency Considerations

When building a FMaaS, performance efficiency is a critical factor to consider. Several input variables can impact the performance of your foundation model serving platform:

Model Size: The size of the model will determine the required High Bandwidth Memory (HBM) size, bandwidth and the required TFLOPS (Tensor Floating Point Operations per Second) for optimal performance.
Model Architecture: Different model architectures have varying performance characteristics and requirements.
Quantization: Quantization techniques can be used to reduce the precision of model parameters, allowing larger models to fit into smaller instances. However, aggressive quantization may impact model prediction accuracy and the quality of the generated text.
Context Length: The length of the input context can influence performance, particularly for real-time applications where latency is a critical metric.
Concurrent Users and Request Volume: The number of concurrent users and the volume of requests per user will directly impact the required compute resources and scaling considerations.
Input-Output Token Ratio: The ratio between input and output tokens can vary depending on the task (e.g., 2:1 for question-answering, 8:1 for summarization, 1:1 for text generation). Understanding this ratio can help optimize resource allocation.
Real-time vs. Batch Processing: Real-time applications prioritize low latency, while batch processing workloads focus more on throughput and total processing time. Keep in mind that average humain reader speed is around 200ms/token, no need to optimize latency further for Real-time use cases such as chatbots or code generation assistances.

Key Performance Metrics

Right-sizing instances based on your workload characteristics is essential. For instances with multiple accelerators, you can choose between having multiple copies of the model (for improved throughput) or increasing the tensor parallelism degree (for improved latency). Tools such as Foundation Model benchmarking tool (FMBench) allow to compare model performance accros different instance types.

To monitor and optimize the performance of your FMaaS, consider tracking the following key metrics:

Time to First Token (TTFT): This metric measures the time required for the prefill stage and is directly related to the number of input tokens. This part is usually compute bound.
Intertoken Latency (ITL): measures the time between generating consecutive output tokens, which is critical for real-time applications that requires token streaming.
Throughput: The number of output tokens generated per second is a key measure of overall performance. (Has a direct impact on the cost as we will see in future articles)
Batch Size: The number of concurrent inferences on the same model copy. Guidance: You will need to maximize the batch size while avoiding out of memory issues. Consider techniques such as dynamic batching or continuous batching to achieve a better hardware utilisation.
Cold Start Time: If you plan to serve multiple models on the same instance, you'll need to account for the time required to download the container, download the model, load the model into GPU memory, and start the inference server. Guidance: Capabilities such as SageMaker Inference Components can help you reduce the cold start.
Accelerator Occupancy: measures how effectively an accelerator parallel processing capabilities are being used. Guidance: 1/Inference workloads are memory bandwidth-bound, so choosing the right inference server that minimize memory transfers is key. SageMaker LMI (Large Model Inference) offers flexibility to experiment with various inference libraries like vLLM, TensorRT-LLM, DeepSpeed, and Transformers NeuronX, which support modern optimizations like FlashAttention and PagedAttention. 2/In some use-cases: to support new AI components and architectures, you might need to optimize further by writing your own kernel using CUDA, Triton Kernel Interface or Neuron Kernel Interfance (NKI). Kernel fusing is another techniques to mutualise operations per cycle and maximize occupancy. 3/Profilers such as Nvidia Nsight or Neuron Profiler allows you find opportunities for optimization.

Optimizing the performance efficiency under constraints is an iterative process. Initially, you may have limited visibility into your customers' usage patterns, making it challenging to fine-tune the system. However, gathering relevant metrics over time provides valuable insights, enabling you to identify optimization opportunities and dynamically adapt your system's behavior accordingly.

Conclusion

Building a Foundation Model as a Service (FMaaS) can be a complex undertaking, but it can provide organizations with greater control and flexibility over their generative AI model serving infrastructure. By carefully considering performance efficiency factors, monitoring key metrics, and leveraging the right tools and services, organizations can successfully build and optimize their FMaaS to meet their unique requirements.

Stay tuned for the next article in this series, where we will dive deeper into the specific architectural considerations and best practices for building a FMaaS on AWS.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.