Automate OpenSearch Cross Cluster Replication for Resilience with Sceptre and CloudFormation
Create a CloudFormation Custom Resource to automate the set-up of connectivity between two OpenSearch clusters
Stephen Beck
Amazon Employee
Published Nov 29, 2024
Motivation: With the advent of RAG-based approaches for better responses in Generative AI applications, focus on high availability and disaster recovery for the systems used to host RAG data becomes a very important consideration for production workloads.
Amazon OpenSearch offers cross-cluster replication as a means to maintain consistency between data hosted on distinct clusters, which can be in different AWS regions, different AWS accounts or both.
Replication connectivity can be established with the AWS console or using the OpenSearch API.
However, many organizations would like to have their infrastructure, including OpenSearch, fully defined in Infrastructure-as-Code (IaC) using CloudFormation. Ideally, OpenSearch replication setup should be doable in CloudFormation as well. This blog shows example code which achieves this result.
The technique used here is the creation of a Lambda function (this example written in Python) which is invoked by a CloudFormation custom resource. This function is built after installing the needed "crhelper" package which provides the needed scaffolding for invocation by a custom resource. This function needs to be uploaded into an appropriate S3 bucket and referenced in the CloudFormation template which creates the function. This Lambda function is deployed in the AWS account and region of the OpenSearch cluster which is the "follower" that receives the data replicated from the "leader".
The custom resource which triggers this Lambda function can be run in a CloudFormation stack as part of the initial stand-up of the OpenSearch follower cluster or anytime thereafter. Permission to make connection requests and accept connection requests are granted in the IAM roles associated with the Lambda code.
Note: this code can be found in aws-samples. The README includes full build instructions. The code shown here as example code needs to be modified via configurations to run in your environment. Build dependencies include the aws CLI, the sceptre python package, and the crhelper python module.
What does the Lambda code look like? The following shows the initial connection setup request from the follower. Later in the code we will see the acceptance from the leader.
Note that there is a NAT instance involved here. The reason for this is in the case where the Lambda function is running "within" a VPC. OpenSearch APIs needed for connection requests are not available via a private OpenSearch endpoint as the time of the authoring of this blog. Thus a NAT instance is needed to get requests to the OpenSearch public endpoint. An imaginative use of a NAT instance here is that the instance has already been created in by an earlier CloudFormation stack and then stopped. The Lambda code "wakes it up" and waits until it is ready to pass data outward to the public endpoint. If the Lambda is run outside a VPC, the NAT instance and the code in the Lambda managing it are not needed, nor are the EC2 and STS private endpoints which are created in the full implementation referenced above.
The data to configure the Lambda at the time of its creation looks like this:
Note that these CloudFormation parameter values are obtained from the output values of other CloudFormation stacks. These stacks are assembled using Sceptre along with CloudFormation templates.
Additional invocation parameters are obtained from the Custom Resource parameters which are passed to the Lambda in the "event" payload. Notice that the name or "alias" of the OpenSearch connection to be created is provided as a variable defined in a global file "var.yaml" provided as a option file to sceptre when creating the Custom Resource, thusly: "sceptre --var-file var.file create es-follower/crossclustercr.yaml".
The crosscluster custom resource parameter file:
And the crosscluster custom resource CloudFormation main body which sets up the event data to be passed into the Lambda function:
Going back to the Lambda function code, note that it has been written to allow for a cross-account use case. Review the code for accepting the connection:
How are the AWS account(s) with their credentials and the AWS regions obtained? Sceptre supports config files at any level in its directory structure. So it is simple to define different AWS accounts for the follower stacks and the leader stacks. This is a handy feature of sceptre, as it simplifies the build-out of multi-account environments. Example of the config file in the follower directory:
And this information is found in the global "var.yaml" file:
Notice that follower and leader profiles are using the same AWS account in this setup with different regions, so this is not a cross-account configuration. (The actual AWS account profiles, such as "follower-account" are obtained from the standard aws credentials file.)
CloudFormation custom resource stack creation will establish the OpenSearch replication connection. Conversely, the connection will be removed by deleting the stack.
Also, note that network connectivity needs to be established between the two OpenSearch clusters for actual data replication. The full sample implementation found in aws-samples creates all infrastructure the except needed peering connection between the respective regional transit gateways.
Also the full implementation features an EC2 tunneling instance for an desktop browser using an SSM-defined tunneling document to access the OpenSearch dashboards.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.