Create Production-Ready Generative AI Applications

Authors: Andrew Walko (Enterprise Solutions Architect, AWS), Kait Healy (Startups Solutions Architect, AWS), Raj Pathak (Principal Enterprise Solutions Architect, AWS)

As businesses explore the capabilities of generative artificial intelligence (AI) solutions and business applications, they seek implementation best practices starting to take a concept from ideation to a production-ready application. Like all production level applications, there is no single answer on how to build the perfect generative AI application, but instead a set of best practices which can be applied to your specific situation in order to design a high-quality, production-grade application. In this blog, we will discuss the 5 phases required to build a production-ready generative AI application. In each phase, we will call out specific actions to take and considerations when designing, building, and implementing your generative AI application.

This is a 400-level blog and assumes you have an introductory understanding of generative AI concepts.

Phases of a Successful Generative AI Application Release

Regulatory and Compliance Requirements include 5 steps details below

Figure 1

There are 5 main phases to work through as you go from idea to production, which can be broken down into:

Identifying end users and targeting experience
Identifying required model capabilities, common patterns amongst end users, and testing prompt templates
Defining infrastructure and application requirements based on model selection and user expectations
Testing the application with real users to gain feedback and iterate
Deploying into a production environment with proper monitoring controls in place

Throughout all stages it’s important to take into account regulatory and compliance requirements that apply to your company and industry. This ranges from maintaining security of PII during end user requirements gathering to ensuring all technical infrastructure and services are eligible for your required compliance. Generative AI services and instances at AWS, such as AWS Inferentia and Trainium and Amazon Bedrock are eligible for common compliances such as HIPPA and PCI.

I. Target User Experience

The Target User Phase (TUE) phase lays the foundation for creating an AI application that effectively meets the needs of its target users. In this phase, you can work backwards from the end user to create an optimal user experience, and ultimately drive adoption. First, conduct user research to validate assumptions, prioritize features, and ultimately design a product with usability and business goals in mind. You can conduct user research using methods such as stakeholder interviews, focus groups, and card sorting. After gathering data and gaining a better understanding of user needs and behaviors, develop detailed user personas representing the target audience. Identify key demographics, goals, pain points, and the technical proficiency of target users. By the end of this step, the overarching goal is to have a comprehensive understanding of target users’ needs, goals and challenges, which can serve as a north star in subsequent phases.

There are internal and external personas to consider for generative AI applicaitons

Figure 2

Next, start to formulate a testing dataset for evaluating performance. Gather real-world sample prompts aligned with user needs. Collect diverse prompts covering common use cases and end scenarios, including edge cases. Next, create the ideal response pairs (golden answer) for each sample prompt, and define expected outputs and desired tone and style of AI responses. Creating a set of representative sample prompts and desired responses gives you a baseline at which to qualitatively and qualitatively measure success. You can use this evaluation dataset to evaluate different models, prompt templates, and to track performance during fine-tuning.

Carefully consider and identify which tasks are relevant to accomplishing business objectives. Although foundation models (FMs) are designed to handle a considerable range of tasks, it is important to clearly define a scope that outlines capabilities and limitations of the AI application as it pertains to your use case. Additionally, identify specific tasks the AI can assist with and areas it cannot. This can help better facilitate evaluation and improvement, mitigate risks, and promote consistency in generated content, ensuring that it remains relevant, coherent, and aligned with the application’s goals.

II. Model Capabilities, Patterns and Testing

Once you have a thorough understanding of your target end user and what they’ll use the application for, you need to experiment with different models, model architectures, and prompt templates in order to identify the combination with the best performance. The first step is determining which general architecture of models, such as language models, vision models, dialogue systems, or task-specific models, will deliver the TUE. Considerations when selecting a model architecture should include capability, latency, computational cost, context window and scalability. You need to decide what information and tasks will need to be used and executed to achieve the desired outcome. If specific company or private data is needed, this is where you decide whether to train your own model, fine-tune or continuously pre-train an existing model, or use retrieval augmented generation (RAG). Breaking down when to use each of these methods, or a combination of the methods, is a multi-faceted discussion; however, a general rule of thumb to follow is if you need to change model behavior train a new model or fine-tune an existing model. If you have a large corpus of data to add to a model, continuously pre-train an existing model. If you have continuously changing data or will need to consistently add new data, use RAG. There are additional considerations to be made when using RAG such as which embedding model to use and how that impacts search results from the vector database. When looking at the desired tasks to be completed by the model, determine if API calls need to be made, if agents need to be built, whether multiple models should be chained together, and what systems need to be integrated into the solution. A lot of the heavy lifting required to integrate these additional components can be abstracted away using features of Amazon Bedrock, such as Knowledge Bases for Amazon Bedrock for managed RAG and Agents for Amazon Bedrock.

Prompt and Model Evaluation

Next, you develop and test prompt templates with questions and corresponding golden answers against a range of models to determine the highest quality combination. This evaluation is arguably the most important part of creating an effective generative AI solution and consists of iterating over prompt templates and models. First, you develop a library of reusable, enterprise-level prompt templates aligned with the defined business requirements and optimized for model performance. Select multiple models within the selected model architecture family of various sizes and engage subject matter experts (SMEs) to test the different models with the defined prompt templates and the question:answer dataset created in phase 1. The number of test invocations should be equal to the number of questions * the number of prompt templates * the number of models. A few common ways to grade model evaluation is using code-based grading, human grading, or model-based grading. The collected metrics will change based on evaluation method, but common metrics include quantitative performance metrics, such as latency, input/output tokens, accuracy, robustness, and toxicity, and qualitative feedback, such as quality of response.

Compare the different responses to see which prompt templates with which models produce the best results. Iteratively change the prompt templates to see how it impacts model outputs until you have a set of prompt templates that return high quality responses when used with a specific model(s). In addition to being used to determine which model(s) and prompt template(s) will be used in the solution, the collected data provides a benchmark for logging and performance improvement/optimization in the future.

While iterating through prompt templates, it’s important to keep track of prompt versions. Keeping a version log of the evaluation metrics of each prompt template, question, and additional context combination helps give a wholistic view into what works and what does not. This is increasingly helpful when the application is deployed and versions can be compared with end-user inputs.

To help with model evaluation, Amazon Bedrock Model Evaluation allows you to provide a dataset of prompts in order to test models for accuracy, robustness, and toxicity. Bedrock can be integrated with Amazon CloudWatch Logs to log information about each request, including latency and input/output tokens. These logs can then be averaged out for each model using CloudWatch Logs Insights.

After completing this phase, you will have three main results. First, a comprehensive set of reusable prompt templates created to cover common enterprise use cases and ensure consist model behavior. Models which have been thoroughly tested by SMEs to validate performance, identify areas for improvement, and incorporate feedback into model selection, training, and prompt engineering. Lastly, quantitative benchmarks established to measure model performance across key metrics.

Step 1: understand top proprietary and open source models, Step 2: Test and Evaluate, Step 3: FM

Figure 3

III. Infrastructure and Application Requirements, and Testing

In this phase, consider the broader application architecture, application requirements, and best practices required to support an enterprise grade application. First, define application-specific outputs required to meet TUE. This includes designing a thoughtful user interface (UI) and user experience (UX), and selecting associated infrastructure. Consider other architectural components, such as APIs for integration and external databases, and aspects required to implement desired functionality and features. For example, you can provide a personalized chat experience through adding authentication capabilities and adding the user’s name and relevant information to the prompt template. If saving prior conversations is a valuable user feature, consider leveraging a database such as Amazon DynamoDB to persist chat history for its flexibility, scalability, performance and serverless nature. If conversational interaction is a requirement, provide relevant context from prior chat history. This enables the model to maintain a consistent dialogue, and follow-up questions.

One approach to achieve this conversational functionality is through designing a prompt template that includes chat history. As you add more data to the prompt template, consider how it affects the other information. Staying with the conversational history example, you would pass in the last n messages keeping in mind that too few prior messages would result in incomplete context, but too many would add unnecessary information which would take away the model’s attention from relevant context. There are many tools available to integrate such functionality, such as LangChain’s ChatMessageHistory and DynamoDBChatMessageHistory utility classes.

Below is a sample prompt template for a Claude 3-backed conversational RAG chatbot with personalization. Keep in mind that different models have different formatting for prompt templates, and this example is for Claude models.

Once you have mapped outputs, determine infrastructure requirements based on expected user load, data storage, and processing needs. Depending on whether model consumption is serverless or dedicated, you may need to consider selecting and sizing your instances types and volumes. If you have requirements to self-host the model, consider using Inferentia, a purpose-built silicon with best-in-class price performance for inference, with 9.2x higher throughput, up to 10x lower latency, and up to 70% lower cost than corresponding Elastic Compute Cloud (EC2) instances.

Use the documented requirements to begin mapping out your application architecture. Consider designing with best practices in mind, including but not limited to scalability, high availability, observability and security.

Backend for gen AI applications include embedding models, text/image generation models, & more

Figure 4

Scalability and High Availability

Estimate concurrency and throughput requirements to design architecture that can handle peak usage. Select and provision infrastructure configured to support the application, ensuring high availability and performance.

Observability

Set up logging and monitoring systems to track the performance of prompts, responses, and infrastructure/application components. In Bedrock, you can accomplish this through logging and monitoring systems in place to track the performance of the application, infrastructure, and AI models.

Continuous Integration and Continuous Deployment (CI/CD)

Implement continuous Integration and Continuous Deployment (CI/CD) controls to enable smooth updates and improvements to prompts and application components.

Security

Implement security controls to protect sensitive data, ensure privacy, and maintain compliance with relevant regulations. This includes application-level security such as leveraging identity and access management, firewalls, encryption, and general best practices. If applicable to your workload, use Guardrails for Bedrock to implement custom safeguards, detect misconduct, and detect prompt injection and jailbreak attempts in your generative AI application. Additionally, Bedrock doesn’t use your prompts and continuations to train any AWS models or distribute them to third parties, and doesn’t store or log your data.

IV. UAT and Feedback Controls

In the proof-of-concept (POC) phase and beyond, it is important to continuously incorporate feedback and iteration mechanisms. After a POC is developed, collect end-user feedback on the application experience, including usability, responsiveness, and quality of AI-generated responses. This serves as validation that the application meets end-user needs and delivers a satisfactory UX. Document processes and tools for continuous improvement based on user feedback and changing requirements. This results in established mechanisms for continuous enhancement and optimization based on user feedback and evolving requirements.

Establish a mechanism to feed user feedback into prompt refinement processes, as well as latency optimization efforts. Developing a data-driven approach to model and prompt improvement leveraging user feedback to refine AI performance and outputs. Consider adjusting your testing dataset as requirements shift to stay grounded in the business goal and adapt to user needs. This also provides a way to measure model performance as needs change over time.

Prior to production deployment, set up user acceptance testing (UAT) environments and procedures to validate application functionality and performance. By identifying potential issues and areas for improvement before production deployment, you can minimize risk and ensure a smooth rollout. To promote continuous iteration, create user feedback loops to gather insights on application usage, identify areas for improvement, and prioritize future enhancements. This promotes ongoing user engagement and a customer-centric approach to application development and enhancement.

GenAI Developers cover the backend and DevOps/AppDevs cover the Front-end

Figure 5

V. Production Deployments and Monitoring

The final phase to take your generative AI solution into production, consists of implementing logging and monitoring and executing a rollout strategy. Consider what information should be logged from the application in order to improve the application in future updates. Critical data includes input prompts, output responses, and end-to-end latency. Monitoring of input prompts and output responses can be set up to look for data drift and deltas in output quality to detect unwanted patterns or generative AI hallucinations. Strategies for monitoring output responses includes examining output for expected words from added data, expected formats, and similarity to established golden answers. Metrics from the application such as latency and number of requests should be monitored to ensure the application is performing at scale.

Now you are ready to release your generative AI application into production. Execute a phased production rollout strategy, such as canary deployments or staged rollouts, to minimize risk and validate performance.

After the application has been released and used for a period of time, use the collected data to rebuild or refine prompt templates based on production data and user feedback. Strategies discussed in phase 2 with prompt template and model evaluation and discussed above for monitoring, are utilized here as well.

It’s important to note, this final phase is not a checkbox, but instead a recurring iteration of implementing feedback and making improvements to prompt templates based on production data, and re-evaluating models as more foundation models are released.

Conclusion

Currently, many companies are exploring how generative AI can provide business value, and different patterns for implementation. In this blog post, we explored best practices for taking a generative AI application from ideation to production, addressing common challenges and considerations. Implementing these best practices can help you develop and production-grade a thoughtfully designed generative AI application built with users, performance, scalability, and security in mind.

For a deep dive on observability in Amazon Bedrock, check out Monitoring Generative AI Applications using Amazon Bedrock and Amazon CloudWatch Integration.

To get started building generative AI applications on AWS, check out the Generative AI Webpage.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.