Open Source GenAIOps: The new moat for innovators in AI

Thank you to startup founders, Marc Klingen (Langfuse), Ian Webster (Promptfoo) and Jeffrey Ip (DeepEval) for their open source insights.

This post draws upon expertise from AWS experts, Jawhny Cooke, Senior Specialist Solutions Architect - Anthropic, Melanie Li, PhD, Senior Specialist Solutions Architect - Generative AI, Bharathi Srinivasan, Generative AI Data Scientist, Aaron Su, Dustin Liu, Startup Solutions Architect, and Rafa Xu, Cloud Architect.

From startups to ambitious enterprises, innovators are increasingly turning to generative AI to gain a competitive edge. However, the real differentiator lies not in the large language models (LLMs) themselves, but in how effectively you can operationalize these technologies with your data. Enter Generative AI Operations (GenAIOps): an emerging discipline that is becoming the new competitive edge in the AI race.

You need more than just RAG and tool use to succeed with generative AI in production. In this post, you will learn about critical non-functional components that determine your application's real-world success. Whether you're just starting out or advancing your AI capabilities, this post provides a high-level roadmap for your generative AI journey across three key stages: foundation, optimization, and scaling for success, drawing upon insights from frontier model providers, popular open source tools and capabilities offered by Amazon Bedrock.

What is GenAIOps?

For those familiar with DevOps, GenAIOps refers to additional considerations specific to generative AI. To understand these challenges, let's take a look at some of the challenges that we hear from customers building production-quality generative AI applications:

Rapid evolution of new LLMs and model capabilities. This means that you need to develop systems that can evaluate and adopt these capabilities quickly.
Probabilistic nature of LLMs introduces inherent unpredictability in outputs, even with identical inputs. This fundamentally shifts how you approach application development compared to traditional deterministic systems.
Tracing becomes critical as each unique generation can follow different paths through prompt chains and parameter configurations. This requires you to have comprehensive visibility to understand and optimize.
Evals demands a blended approach combining both human judgment and statistical analysis. You need to move beyond simple pass-fail criteria to account for nuanced and contextual nature of generated outputs.

Stage 1 - Foundation: Lay the Groundwork

In the foundational stage, focus on working backwards from your intended behavior, much like test-driven development (TDD) principles.

Offline Evals

Offline evals involve assessing model performance on pre-defined golden datasets (or ground truths) without real-time interaction. These evaluations help you understand model behavior, identify potential biases, and optimize performance before deployment.

Recommendation: Start your own strong empirical evals scripts or consider tools such as Promptfoo, DeepEval or Bedrock Evaluations. Start with 50-200 high-quality examples for your golden dataset and refine as you gain real-world experience.

Guardrails

Guardrails are mechanisms that can filter and control AI inputs and outputs, ensuring your solution remains safe, consistent, and aligned with security and responsible AI guidelines. Don't skip this critical step - it's easier to build in controls from the outset to avoid risks.

Recommendation: First, start with foundation models with built-in safeguards such as Anthropic Claude or Amazon Nova. Then, implement Bedrock Guardrails or NeMo Guardrails. You can apply guardrails to both the input and output prompt. Guardrails also mitigate security risks such as those identified in the OWASP Top 10 for LLM Applications.

Stage 2 - Optimization: Refine your AI

In the optimization stage prior to production deployment, implement practices that can enable you to experiment quickly to improve your AI application.

Prompt Management

Given the importance of prompts in effective AI solutions, you need a systematic approach to managing your prompt engineering workflow. Like source code management with Git, you may need capabilities such as version control, templating and parameterization, team collaboration and A/B testing prompt variants.

Recommendation: Migrate from hard-coded prompts in your source code to a specialized prompt management tool to enable faster collaboration and iteration. Consider Bedrock Prompt Management or prompt management capabilities integrated to LLM observability tools below. For systematic improvement, leverage prompt optimization frameworks such as Anthropic MetaPrompt and DSPy.

LLM Observability

Monitor and analyze your generative AI applications through LLM observability. This includes monitoring, traces, metrics and evals of real-world usage to ensure alignment with your AI product goals.

Recommendation: Implement a robust observability solution before deploying to production, using tools like LangFuse, Arize or Vellum. Choose tools offering unified visibility and support for feedback loops from automated metrics, human-in-the-loop and user inputs.

Context Evals

Evaluate your RAG and tool-based retrievals to ensure accurate and relevant responses when using proprietary knowledge. Assess retrieval quality, relevance, and contextual grounding to identify pipeline improvements and reduce hallucinations.

Recommendation: Implement evaluation frameworks such as RAGAS with Langfuse, Promptfoo and DeepEval to measure and optimize your retrieval performance. Add contextual grounding checks for an additional safeguard against hallucination.

Low-Code Workflow Orchestration

Use low-code orchestration tools that provide visual interfaces and pre-built components, reducing complexity and enabling rapid iteration.

Recommendation: Consider workflow tools like Flowise, n8n, or Bedrock Flows. Evaluate based on total cost, integration needs, and importantly, ease of use for rapid iteration.

Stage 3 - Scaling for Success

In the scaling stage, your generative AI application is in production with real-world users. Firstly, congratulations if you made it here! In this stage, focus on leveraging user feedback for continuous improvement, while ensuring scalability and reliability to maintain seamless operations as demand grows.

Continuous Evals

Production AI applications face evolving real-world usage that offline evals may not be able to fully predict. To continuously improve, you need systematic monitoring and automated testing to align your evals to real-world use cases.

Recommendation: Set up automated evals of production traces. Periodically update your evals golden dataset using real-world data to ensure relevance.

Throughput Optimization and Load Testing

As your application grows, consider service quotas and throughput to avoid performance challenges and bottlenecks.

Recommendation: To ensure reliable performance, it is important to consider strategies to handle traffic spikes and optimize throughput such as queuing, load balancing and asynchronous processes. Consider load testing LLMPerf and FMBench to validate performance.

Multi-LLM Workflows and Agents

Enhance performance and optimize costs by orchestrating specialized LLMs. Use focused models for specific tasks like analysis and reasoning, rather than relying on a single general-purpose model.

Recommendation: Start with simple agents using frameworks like LangGraph or CrewAI. Gradually expand to specialized multi-agent systems to optimize accuracy, latency and cost. Reference Anthropic's research on building effective agents, Bedrock multi-agent orchestration patterns and the six levels of agentic behavior for best practices.

Multi-tenancy and SaaS

Design operational practices that can effectively serve your growing customer base (tenants) while ensuring security, performance and cost-effectiveness across diverse usage patterns.

Recommendation: Build on SaaS design principles. Implement tenant awareness across your AI application stack, from auth and prompts, to retrievals and observability. This brings performance metrics and cost attribution on a per-tenant basis. Consider tiering by customer segments and service levels. This can include using different models and introducing pricing for each tier, or mitigating noisy neighbor through isolation.

Standardizing Context Integrations

Standardize and secure your integrations with external tools and data sources to avoid exponential complexity and costs as you scale.

Recommendation: Adopt a modular and standardised approach such as Model Context Protocol (MCP), open-sourced by Anthropic in late 2024.

Best-Practices for Success

Evals as a new moat: Prioritize developing robust evaluation methods incorporating your unique data and domain knowledge to create a competitive advantage. Evals also help you adapt your system to new AI models.
Invest in observability: Implement comprehensive monitoring to understand your AI systems in production, enabling rapid iteration.
Implement controls early: Establish offline evaluations and guardrails from the outset to better control generations. If you are using frameworks that has pre-defined prompts, consider removing abstractions and implementing your own libraries for greater control.
Think big, but start small: Master prompt engineering and effective use of existing models before venturing into complex agentic systems or fine-tuning. Remember that LLMs are increasingly more capable.

Conclusion: From Insights to Action

GenAIOps presents a unique opportunity for startups and enterprises to gain a competitive edge through operational mastery of generative AI. While traditional AI differentiation relied on costly training and complex machine learning expertise, GenAIOps leverages the inherent capabilities of foundation models and incorporates your proprietary data into your operations. This approach moulds the probabilistic nature of foundation models into reliable solutions by combining your proprietary data and continuous iteration to address your specific customer and industry challenges.

To dive deeper, explore the resources linked throughout the post or try the GenAIOps with Langfuse and Amazon Bedrock hands-on workshop.

Additional resources

If you are building an open source tool, please contact us for collaboration opportunities.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.