Building a Multi-region Serverless IoT system
Building a multi-region serverless IoT system can be a huge challenge. IoT resources, Things, and Certificates must be replicated across regions to allow IoT devices seamless multi-region connectivity
Published Aug 18, 2023
Last Modified Apr 28, 2024
In this post we will take a look at how to build a multi-region serverless IoT system. We will dig deep into concepts like how to replicate certificates across regions, and how to setup a DNS record for device connection. All in an event-driven and serverless way.
There are many reasons why we create multi-region systems on AWS. It can be compliance requirements, resiliency, or latency requirements. In an IoT system, we often have small devices with limited bandwidth which makes latency requirements even more important. With a large system with a global footprint a multi-region solution help us solve that challenge. However, building a multi-region IoT system comes with challenges of its own.
In a multi-region IoT system we don't want our IoT devices to be aware of what or how many AWS Regions we are running in. This should be totally transparent to the device, it should only connect to iot.example.com. For this to work, we must register our IoT Things, Policies, and certificates in all regions. Therefore, we will be able to replicate all of this information across all Regions, despite what Region the device connect to.
We will look at the steps needed to perform this replication in an efficient way. What services to use, and what the setup looks like.
What we are building in this post is an architecture that will span two AWS Regions, with devices connecting to the closest region using Route53. AWS StepFunctions state-machine will act on incoming events from IoT Core and carry out the needed work. Now let's get started and let's get building.
We will use an configurable endpoint with a custom domain: iot.example.com. This will be configured in two regions and we'll use a latency-based record to let devices connect to the Region closest to them. When devices connect for the very first time, the device certificate will be registered in IoT Core, using Just In Time Registration. If a device certificate is successfully registered and activated, it will be stored in an device registry, which is a Global DynamoDB table. By utilizing DynamoDB Streams, the replication of the device certificate can be done in an event-driven way, by utilizing EventBridge Pipes and a custom event-bus. Devices will use certificates issued by a self-signed Root CA. When we authenticate our devices towards AWS IoT Core, we will be using Mutual TLS.
Data from the IoT devices will be stored in a Global DynamoDB table to allow access from both regions. Throughout this blog post we will be using us-west-2 (Oregon) and eu-west-1 (Ireland) region when we build and create resources.
Let’s start off by creating the DynamoDB tables we need.
We will create two global DynamoDB tables that will only be deployed to a single Region but configured with a replica in a second Region. We need a way to keep track of the devices that have already been registered, despite which Region it was initially registered in. For this, we will create a Device Inventory table. For testing purposes, we will also create a global DynamoDB table where we can store data coming from the devices.
We will deploy these resources using AWS CloudFormation / Serverless Application Model (SAM)
We are going to use a self-signed Root CA when issuing server and device certificates. In a production environment, you should use Root CA signed by a third party trusted source. Our certificate chain will include a Root CA and an Intermediate CA. The Intermediate CA will issue both the server and client certificates. Read this to learn more about SSL/TLS certificates. Read this to learn more about the process that we will be using to create the certificates. We will be using OpenSSL when creating certificates.
We will start by creating the folder structure needed and the files needed to track our certificate creation.
Next, we create the private key and certificate for the Root CA.
With the Root CA created, we can use this to create and sign the Intermediate CA and create a certificate bundle with the Root and Intermediate CA.
With the Root CA and Intermediate CA created we can continue to create the server certificate and generate a couple of device certificates as well. The process is very similiar to the one when creating the Intermediate CA. When the server certificate is created, it's very important that the CommonName is set to the server FQDN, in our case iot.example.com.
Finally, we create a set of different client certificates. Here we can create a simple bash script that helps us create the certificate, and copy it to clients folder so we can easily access them. For the client certificates we like the common name to match the Thing name we'll use in IoT Core later.
The configuration files used is the standard OpenSSL configuration file with the absolute path to the keys set.
With all the certificates created, we can continue with everything and setup the IoT Core endpoints.
The first thing we need to do is import the server certificate to AWS Certificate Manager (ACM). We need three things: the server certificate, the server private key, and the certificate chain we also created earlier.
The import can also be done using the AWS CLI.
The next step in the process is process is to create the domain configuration and a latency based Route53 record. To prove that we are the rightful owners of the domain name, we also need to create a validation certificate in ACM. To create all of this, we can turn to AWS CloudFormation and SAM. It's assumed that a Hosted Zone for the domain exists in Route53.
We need the IoT Core Endpoint, so we can point the Route53 record to it. There is only one Iot Core endpoint per account and Region. The endpoint can be found under the Settings menu in the IoT Core console.
It's also possible to get the endpoint using the
CLI
. We need to get the Data ATS endpoint.We will be using the same template through out this blog post and just continue to add resources to it.
To validate the IoT domain configuration, we can navigate to IoT Core console and check the configuration under settings. The domain configuration should look like the images below.
With the endpoint configured, we can move to the next step, activating just in time registration.
The first time a device connects to IoT Core with a certificate signed by our Intermediate CA, the certificate from the device will be registered in IoT Core. This way we can manufacture our devices, bundle them with certificates but not have them registered in IoT Core. The device will receive an immediate disconnect and then need to reconnect. There needs to be some form of wait logic in the device to allow for the cloud to activate the device in IoT Core before reconnecting. We can add our own validation in this activation step, making sure the device is who it claims to be.
We need to run these steps in both regions, because our devices will connect using the configured DNS
iot.example.com
and will connect to the closest Region.- Register our CA certificate in IoT core and turn on auto-registration.
- Create an IoT Core Rule that will invoke an AWS StepFunction to carry out our activation logic.
- Replicate our device to the second Region. This step will use an event-driven architecture and react when our device is registered in the primary Region.
To register the CA certificate, we use the
CLI
to do so. We'll register the certificate in SNI_ONLY
mode. This way the same CA can be registered in several accounts and regions.It is also possible to do the registration using the console. Navigate to the IoT Console, and select Security. When using the console the
SNI_ONLY
mode is called Multi-account
, this is the same as SNI_ONLY
, the graphical interface just calls it something different. Upload the ca_intermediate.crt
file, that we created in the steps when creating certificates.As usual, we can also do this using the
CLI
To be able to invoke an AWS StepFunction when a certificate is registered, we must enable IoT Core Events. These events will be posted onto a reserved AWS topic,
$aws/events/xyz
that we then can create a rule for. The events for registered certificates can ONLY be enabled using the CLI
as this option is not available in the console. To enable the events, we need to run the commands.Once again, we turn to our CloudFormation / SAM template and add some additional resources. To start we'll, create the IoT Policy that we attach to our devices, the IoT Core Rule, and the StepFunction that implements our logic.
The JITR logic includes several steps that need to be coordinated and orchestrated. StepFunctions is an excellent service for this, and we can use the built in error handling to rollback certain steps if we have a failure. We need to fetch information about the registered certificate and read out information from it. We'll check that this is a new device and that it doesn't already exists in our registry of devices. We then create an IoT Thing, attach the registered certificate to that Thing, attach our standard IoT Policy, that we created in pervious part, and then we activate the certificate to allow the device to actually connect to IoT Core. Finally we register the device in our global device registry DynamoDB table.
The state-machine definition would be.
All but one of the tasks can be completed using the powerful service and SDK integration that exists in StepFunctions. For the task to read out information from the actual certificate, we need to use an AWS Lambda Function. We need to get the name of the thing, and this should match the common name in the certificate.
When a device has been registered and activated in one Region, we need to copy all of that information to the second Region. This will include the actual device certificate and we need to create the Thing, attach policy and certificate. We do this so our devices then can connect to any of our Regions, creating a multi-region setup with failover possibilities. This part of the system will be event-driven and invoked when a new device is stored in our device inventory. To accomplish this we will use DynamoDB streams, we will use EventBridge Pipes to publish an event to an EventBridge event-bus when there is a new record in the stream. From the event-bus we can invoke a StepFunction that will carry out the work. This StepFunction state-machine will be running in the destination Region, meaning we need to fetch information cross-region from the source Region.
We add some additional resources to our template and deploy this to both Regions.
The state-machine is very similar to the one used to register a new device. The main difference is that the device is not written to the device table. We also need to fetch certification data cross-region.
The StepFunctions state-machine definition for replication of IoT resources would be like this.
With the replication done, we are ready to start testing the solution using our two regions.
What we will do during testing is set up an IoT Rule that will invoke a StepFunction state-machine, which will store the incoming data in our global data DynamoDB table.
To imitate devices, we will use EC2 instances, running in the different regions, add use Eclipse Mosquitto and mosquitto_pub.
To imitate devices, we will use EC2 instances, running in the different regions, add use Eclipse Mosquitto and mosquitto_pub.
First of all, let's create the resources we need, we addon to our CloudFormation template.
Make sure to upload the Root CA certificate, the intermediate CA certificate, the device certificates and private keys, that you have created, to the
CertificateStoreBucket
bucket in both Regions. Since we use our self-signed CA certificate to sign the server certificate we must present this during the connection attempt.From the EC2 console, let's create an Ubuntu 22.04 instance.
For network setting use the default VPC and allow SSH access from anywhere.
Under advanced settings, select the Instance Profile, in the Region, that we created using the CloudFormation template.
Make sure to create an test instance in both Regions.
We can now connect to the instance using Instance Connect from the console.
When connected to the instance install AWS CLI, Mosquitto using
snap
, and download the device certificates from the S3 Bucket.With this installed with can use
mosquitto_pub
to connect and publish a message. The first time we try to connect the device will receive an immediate disconnect, this since the device certificate is not registered and active.Navigating to StepFunctions console in our current (primary) test Region, should show at least one invocation of the JITR state-machine with a success.
In our other (replication) Region we should see a successful invocation of the replication state-machine.
Returning to our instance connect session and issuing the command again will now generate a different result.
Repeat the above using an EC2 instances in both Region. You can imitate the same device from both regions. After sending some data we can navigate to the DynamoDB table, containing our IoT data. Verify what Region the data was written to the table from this is from our test state-machine writing ingress data to the table.
Creating a multi-region IoT system can be a challenge, with replication of Things and certificates. As shown in this post it can be automated in an event-driven and serverless way that allows our IoT devices to connect and send data to several Regions using FQDN custom endpoint.