AWS in the Enterprise the Good the Bad and the Ugly
How to manage AWS at scale and where improvements need to be made.
Published Dec 3, 2023
Working with enterprises to strategize and implement large-scale AWS deployments is what I have been doing for the past several years. There is no shortage of things you will learn when you move from mid-sized business to enterprise. During my transition I realized quickly that the Dunning-Kruger effect is absolutely real I had to learn how to take basic concepts in AWS and scale them to the enterprise level. I was using all kinds of tools like stacksets, step functions, Service Catalogs, and cross-account IAM at scale. Full-scale automation using custom functions was not uncommon either meaning I spent a lot of time not just writing IAC but also writing code.
One of the main tools people use in the enterprise is AWS Control Tower. AWS Control Tower is a service that allows easy management of multi-account environments especially at scale. You can apply controls to all accounts based on best practices from AWS and auto-implement controls such as central aggregation of cloud trail logs. All of this is much easier than setting up the well-architected multi-account environment by hand. On top of that, you can plug AWS Control Tower customizations to automate the deployment of SCPs.
On top of AWS Control Tower is another Service AWS Identity Center (formerly AWS SSO). AWS SSO allows us to manage AWS Permissions across hundreds of accounts using STS to limit the length of time the credentials remain active. This is a huge win over using traditional IAM in each account and provides a highly visible secure way to handle access.
What's not so good about AWS Control Tower and supporting services is very apparent. While the service provides a lot of value it's lacking severely in many areas.
First on the list is AWS Control Tower Upgrades. When upgrading AWS Control Tower you need to take it in two steps. The first step is to update the control tower itself and then you update all of the organizational Units in your organization. Each organizational unit takes about 45-50 minutes to upgrade and you are prevented from updating more than one OU in parallel. On top of that, you can't queue updates across a large set of OUs at larger organizations. So updating 9-10 OUs takes 500 minutes or roughly 8.3 hours of babysitting Control Tower.
The second issue is automation around AWS SSO. AWS does not come with a feature built into the service to handle updating permission sets. We went ahead and built an open-source tool to handle this (AWS SSO Permission Set Automation). Part of what we built as well an open-source tool to handle auto-assigning groups coming from an external identity provider or as long as the group follows a certain pattern it will auto-assign to an account or organization on create. This should be a feature that AWS provides as part of the service but it seems to be missing. Here is where we have the open-source tool (AWS SSO Automation)
We implement certain tasks that aren't typical features of Control Tower, such as automating industry or business-specific security configurations through code. This strengthens the account before it's used. A practical approach involves using Step Functions in conjunction with a universally assigned role, like the Control Tower execution role. Custom code, either within Lambda functions or containers, is written to execute the hardening process. The complexity and execution time determine the method used. This approach offers incredible flexibility and can be customized for different business needs. Additionally, we leverage the CreateManagedAccount event success feature in Control Tower. This feature initiates the state machine when a new account is successfully created, ensuring timely and efficient account hardening.
Moving onto the more challenging aspects, there's a significant issue with the suspension process of AWS accounts. Ideally, when an account is placed in a suspended state, pending a 90-day closure period, it's expected that access to all resources within that account, like KMS keys and Lambda functions, would be restricted. However, these resources remain accessible, eliminating the possibility of a reliable 'smoke test' to ensure seamless termination of the account. This oversight can lead to unexpected complications for businesses that anticipate a clear signal of account suspension, only to encounter issues after the account is finally decommissioned.
Additionally, there's an inconsistency in the account deletion timeline. While AWS states that a suspended account is recoverable for up to 90 days, the actual deletion of the account seems to occur at an unpredictable time, sometimes extending beyond the 90-day window. This lack of precision, particularly for a company that champions automation, indicates a significant gap in their process management.
Despite these challenges with Control Tower, its utility cannot be understated. It greatly simplifies the implementation of multi-account best practices, something that would be considerably more complex without it. I appreciate the ease it brings to setting up multiple accounts. However, AWS could benefit from integrating community-driven automation solutions into their offerings, enhancing their service and saving businesses from repeatedly 'reinventing the wheel' for common tasks.