Best practices for cost and usage visibility for ML workloads

This post outlines steps you can take to implement a comprehensive tagging governance plan across accounts, leveraging AWS tools and services that provide visibility and control. By setting up automated policy enforcement and checks, you can ensure cost optimization across your ML environment.

Gunjan Jain
Amazon Employee
Published Jul 30, 2024
Understanding the costs associated with running your business is always important, regardless of where these costs originate. These could relate to infrastructure, personnel, leases, etc. To maximize the agility, scalability, and overall value of the cloud, you need real-time insights into your costs and usage so that you can make effective decisions for the future. What makes cost visibility even more important for the cloud is that cloud usage is dynamic. This requires continuous cost reporting and monitoring to ensure costs do not exceed expectations and you only pay for the usage you need. Additionally, you can measure the value the cloud delivers to your organization by quantifying the associated cloud costs.
For a multi-account environment, you can track costs at the account level to associate expenses. However, to allocate costs to cloud resources, a tagging strategy is essential. A combination of an AWS account and tags provides the best results. Implementing a cost allocation strategy early is critical for managing your expenses and future optimization activities that will reduce your spend.

Implement a Tagging Strategy

A tag is a label you assign to an AWS resource. Tags consist of a customer-defined key and an optional value to help manage, search for, and filter resources. Tag keys and values are case sensitive. A tag value (e.g. Production) is also case sensitive, like the keys. It is important to define a tagging strategy for your resources as soon as possible when establishing your cloud foundation. Tagging is an effective scaling mechanism for implementing cloud management and governance strategies. When defining your tagging strategy, you need to determine the right tags that will gather all necessary information in your environment. You can remove tags when no longer needed and apply new tags whenever required.
Categories for designing tags
Some of the common categories used for designing tags are as follows:
  1. Cost allocation tags help track costs by different attributes like department, environment or application. This allows reporting and filtering costs in billing consoles based on tags.
  2. Automation tags are used during resource creation or management workflows. For example tagging resources with their environment allows automating tasks like stopping non-production instances after hours.
  3. Access control tags enable restricting access and permissions based on tags. IAM roles and policies can reference tags to control which users or services can access specific tagged resources.
  4. Technical tags provide metadata about resources. Tags like environment or owner help identify technical attributes. The AWS-reserved prefix aws: tags provide additional metadata tracked by AWS.
  5. Compliance tags may be needed to adhere to regulatory requirements. E.g. tagging with classification levels or whether data is encrypted or not.
  6. Business tags represent business-related attributes not technical metadata. E.g. cost centers, business lines, products etc. This help track spending for cost allocation purposes. A tagging strategy also defines a standardized convention and implementation of tags across all resource types.
When defining tags, used the following conventions
  • Use all lowercase for consistency and to avoid confusion.
  • Separate words with hyphens.
  • Use a prefix to identify and separate AWS generated tags from 3rd party tool generated tags.
Tagging Dictionary
When defining a tagging dictionary, delineate between mandatory and discretionary tags. Mandatory tags should identify every resource, regardless of purpose. These tags will enable identifying necessary metadata. Discretionary tags comprise the tags that your strategy defines. Make these tags available to assign to resources needing them. Below are examples of a tagging dictionary used for tagging ML resources.
Tag TypeTag KeyPurposeCost AllocationMandatory
Workloadanycompany:workload:application- idUsed to identify disparate resources that are related to a specific applicationYY
Workloadanycompany:workload:environmentUsed to distinguish between dev, test and productionYY
Financialanycompany:finance:ownerWho is responsible for the resource e.g SecurityLead, SecOps, Workload-1- Development-teamYY
Financialanycompany:finance:business-unitIdentifies the business unit the resource belongs to e.g Finance, Retail, Sales, DevOps, SharedYY
Financialanycompany:finance:cost-centerCost allocation and tracking e.g 5045, Sales-5045, HR- 2045YY
Securityanycompany:security:data- classificationData confidentiality that the resource supportsNY
Automationanycompany:automation:encryptionDoes the resource need to store data encryptedNN
Workloadanycompany:workload:nameIdentify an individual resourceNN
Workloadanycompany:workload:clusterIdentify resources that share a common configuration or perform a specific function for the applicationNN
Workloadanycompany:workload:versionUsed to distinguish between different versions of a resource or application componentNN
Operationsanycompany:operations:backupIdentifies if the resource needs to be backed up based on the type of workload and the data that it managesNN
Regulatoryanycompany:regulatory:frameworkRequirements for compliance to specific standards/frameworks e.g NIST, HIPAA, GDPRNN
You need to define what resources require tagging and implement mechanisms to enforce mandatory tags on all necessary resources. For multiple accounts, assign mandatory tags to each one identifying its purpose and the owner responsible. Avoid Personally Identifiable Information (PII) when labeling resources since tags remain unencrypted and visible.

Tagging Machine Learning Workloads on AWS

When running machine learning workloads on AWS, primary costs are incurred from compute resources required, such as Amazon EC2 instances for hosting notebooks, running training jobs, or deploying hosted models. You also incur storage costs for datasets, notebooks, models, etc. stored in Amazon Simple Storage Service (Amazon S3). Some of the key contributors towards the cost that should be tagged and tracked are as described below.
AWS Lake Formation helps manage data lakes and integrate them with other AWS analytics services. You can define metadata tags and assign them to resources like databases and tables. This identifies teams or cost centers responsible for those resources. Automating resource tags when creating databases or tables with the CLI or SDKs ensures consistent tagging. This enables accurate tracking of costs incurred by different teams.
Amazon SageMaker Feature Store allows you to tag your feature groups and search for feature groups using tags. You can add tags when creating a new feature group or edit the tags of an existing feature group.
When you tag Amazon SageMaker resources such as jobs or endpoints, you can track spending based on attributes like project, team, or environment. E.g. Tags can be specified when creating the Amazon SageMaker Estimator that launches a training job. Using tags allows you to incur costs that align with business needs. Monitoring expenses this way gives insight into how budgets are consumed
Enforcing a Tagging Strategy
An effective tagging strategy uses mandatory tags and applies them consistently and programmatically across AWS resources. You can use both reactive and proactive approaches for governing tags in your AWS environment. Proactive governance uses tools such as AWS CloudFormation, AWS Service Catalog, Tag Policies in AWS Organizations, or AWS Identity and Access Management (IAM) resource-level permissions to ensure you apply mandatory tags consistently at resource creation. For example, you can use the AWS CloudFormation Resource Tags property to apply tags to resource types. In AWS Service Catalog, you can add tags that automatically apply when you launch the service. Reactive governance is for finding resources that lack proper tags using tools such as the AWS Resource Groups Tagging API, AWS Config Rules, and custom scripts. To find resources manually, you can use Tag Editor and detailed billing reports.
Proactive Governance
  1. Using AWS Service Catalog: You can apply tags to all resources created when a product launches from the service catalog. The service catalog provides a TagOptions library. Use this to define the tag key- pairs to associate with the product.
  2. Using AWS CloudFormation Resource Tags: You can apply tags to resources using the AWS CloudFormation Resource Tags property. Tag only those resources which support tagging through AWS CloudFormation.
  3. Using Tag Policies: Tag policies standardize tags across your organization's account resources. Define tagging rules in a tag policy that apply when resources get tagged. For example, specify that a CostCenter tag attached to a resource must match the case and values the policy defines. Also specify that noncompliant tagging operations on some resources get enforced, preventing noncompliant requests from completing. The policy does not evaluate untagged resources or undefined tags for compliance. Tag policies involves working with multiple AWS services:
    1. To enable the tag policies feature, use AWS Organizations. You can create tag policies. Then attach those policies to organization entities to put the tagging rules into effect.
    2. Use AWS Resource Groups to find noncompliant tags on account resources. Correct the noncompliant tags in the AWS service where you created the resource.
  4. Using Service Control Policies (SCP): You can restrict the creation of an AWS resource without proper tags. Use Service Control Policies (SCP) to set guardrails around requests to create resources. SCPs allow you to enforce tagging policies on resource creation. To create a Service Control Policy, navigate to the AWS Organizations console, select Policies followed by Service Control Policies.
Reactive Governance
  1. Using AWS Config Rules: Check resources regularly for improper tagging. The AWS Config Rule required-tags examines resources to ensure they contain specified tags. You should take action when resources lack necessary tags.
  2. Using AWS Resource Group Tagging API: The API lets you tag or untag resources. It also enables searching for resources in a specified region or account using tag-based filters. Additionally, you can search for existing tags in a region or account. Or find existing values for a key within a specific region or account. To create a Resource tag Group, follow instructions here.
  3. Using AWS Tag Editor: With Tag Editor, you build a query to find resources in one or more AWS Regions that are available for tagging. To find resources to tag, follow instructions here.
Amazon SageMaker Tag Propagation
Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. SageMaker Studio automatically copies and assign tags to the SageMaker Studio Notebooks created by the users, so you can easily track and categorize the cost of SageMaker Studio notebooks.
Amazon SageMaker Pipelines allow you to create end-to-end workflows for managing and deploying SageMaker jobs. Each pipeline is composed of a sequence of steps that transform data into a trained model. Tags can be applied to pipelines similarly to how they are used for other SageMaker resources. When a pipeline is executed, its tags can potentially propagate to the underlying jobs launched as part of the pipeline steps.
When models are registered in Amazon SageMaker Model Registry, tags can be propagated from model packages to other related resources like endpoints. Model packages in the registry can be tagged when registering a model version. These tags become associated with the model package. Tags on model packages can potentially propagate to other resources that reference the model, such as endpoints created using the model.
Tag Policy Quotas
The number of policies that you can attach to an entity (root, OU, and account) is subject to quotas for AWS Organizations. See limits for number of tags that you can attach to a root, OU, or account.

Monitor resources

To achieve financial success and accelerate business value realization in the cloud, you need complete, near real- time visibility of cost and usage information to make informed decisions.
Cost organization
You can apply meaningful metadata to your AWS usage with AWS Cost Allocation Tags. Use AWS Cost Categories to create rules that logically group cost and usage information by account, tags, service, charge type, or other categories. Access the metadata and groupings in AWS Cost Management products like AWS Cost Explorer, AWS Cost & Usage Reports, and AWS Budgets to trace costs and usage back to specific teams, projects, and business initiatives.
Cost visualization
You can view and analyze your AWS costs and usage over the past 13 months using AWS Cost Explorer. You can also forecast your likely spending for the next 12 months and receive recommendations for Reserved Instance purchases that may reduce your costs. Using AWS Cost Explorer enables you to identify areas needing further inquiry and to view trends to understand your costs. For more detailed cost and usage data, use AWS Data Exports to create exports of your billing and cost management data by selecting SQL columns and rows to filter the data you want to receive. Data Exports get delivered on a recurring basis to your Amazon S3 bucket for you to use with your business intelligence or data analytics solutions.
You can use AWS Budgets to set custom budgets that track cost and usage for simple or complex use cases. AWS Budgets also lets you enable email or SNS notifications when actual or forecasted cost and usage exceed your set budget threshold. In addition, AWS Budgets integrates with AWS Cost Explorer.
Cost allocation
AWS Cost Explorer enables you to view and analyze your costs and usage data over time, up to 13 months, through the console. It provides premade views displaying quick information about your cost trends to help you customize views suiting your needs. You can apply various available filters to view specific costs. Also, you can save any view as a report.
Monitoring in a multi account setup
Amazon SageMaker supports cross-account lineage tracking. This allows you to associate and query lineage entities, like models and training jobs, owned by different accounts. It helps you track related resources and costs across accounts. Use the AWS Cost and Usage Report to track costs for SageMaker and other services across accounts. The report aggregates usage and costs based on tags, resources, etc. so you can analyze spending per team, project or other criteria spanning multiple accounts. AWS Cost Explorer allows you to visualize and analyze SageMaker costs from different accounts. You can filter costs by tags, resources or other dimensions. You can also export the data to third party business intelligence tools for customized reporting.

Conclusion

In this post we looked at implementing a comprehensive tagging strategy to track costs for machine learning workloads across multiple accounts. We discussed implementing tagging best practices, by logically grouping resources and tracking costs by dimensions like environment, application, team etc. Next we looked at enforcing the tagging strategy using proactive and reactive approaches. We also looked at capabilities within Amazon SageMaker to apply tags. Lastly, we looked at approaches to provide visibility of cost and usage for your machine learning workloads.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments