HTCondor in AWS: Harnessing the Power of High-Throughput Computing in the Cloud cost effectively
This article aims to offer a thorough exploration of HTCondor's integration with AWS, highlighting both its advantages and implementation challenges. Our goal is to navigate readers through the intricacies of deploying HTCondor in AWS, while providing guidance to expedite testing and proof of concept phases. By shedding light on both the benefits and potential challenges, we intend to empower organizations to effectively harness HTCondor's features for their high-throughput computing requirements within AWS
Sudhi Bhat
Amazon Employee
Published Dec 13, 2024
What is HTCondor?
HTCondor stands as a robust workload management system tailored for high-throughput computing (HTC), excelling in the optimization of distributed computing resources. Its comprehensive suite of features encompasses efficient job submission and management, ensuring smooth handling of large-scale computational tasks. The system's intelligent resource matching and allocation capabilities maximize resource utilization, while built-in fault tolerance and automatic job recovery mechanisms safeguard against disruptions. HTCondor facilitates seamless file transfer between compute nodes, streamlining data movement across the distributed environment. Additionally, it incorporates comprehensive security and access control measures, protecting sensitive data and maintaining the integrity of the computational ecosystem. These key features collectively position HTCondor as a powerful solution for organizations seeking to harness the full potential of their distributed computing resources.
The integration of HTCondor with Amazon Web Services (AWS) creates a formidable solution for managing large-scale computational workloads, offering a suite of compelling advantages. This powerful combination leverages AWS's virtually unlimited computing resources offered through 800+ instance types, allowing for seamless scalability that adapts to fluctuating demands. Organizations benefit from the cost-efficiency of AWS's pay-as-you-go model, significantly reducing infrastructure costs. The flexibility to access diverse AWS instance types ensures optimal matching of resources to specific workload requirements. Reliability is enhanced through AWS's global infrastructure, providing high availability for critical operations. AWS’s advanced job management features, including robust monitoring and observability capabilities, coupled with improved fault tolerance, ensure smooth and efficient workload processing. Furthermore, the integration enhances security by combining AWS's robust security features with HTCondor's access controls, creating a comprehensive protective framework for sensitive computational tasks and data.
Organizations can host HTCondor natively in AWS or leverage a feature of HTCondor called Condor Annex to allow users to dynamically extend their HTCondor pool by adding cloud resources, specifically from Amazon Web Services (AWS). Condor annex makes it easy to scale your use case by leveraging efficient compute choices like EC2 Spot instances and Graviton instances natively. HTCondor on AWS offers several compelling use cases for high-throughput computing and distributed workloads. Key Applications of HTCondor Annex in AWS:
1. Rapid Scaling for Time-Critical Tasks:
- Expand computing capacity swiftly to meet urgent deadlines
- Address sudden surges in computational demands
- Accelerate job completion by tapping into cloud resources
2. Access to Diverse Computing Resources:
- Utilize GPU-enabled instances for graphics-intensive workloads
- Employ high-memory instances for data-heavy operations
- Leverage instances with fast storage for I/O-bound tasks
3. Adaptive Resource Management:
- Elastically adjust computing power based on real-time needs
- Optimize resource allocation and reduce operational costs
- Implement automatic scaling to match workload fluctuations
4. Tailored Computing Environments:
- Deploy custom software configurations for specific projects
- Implement bespoke job scheduling policies
- Create segregated resource pools for distinct teams or tasks
5. Economical Large-Scale Processing:
- Harness cost-effective spot instances for flexible workloads
- Streamline spot instance management for uninterrupted processing
- Maximize cost efficiency in extensive data analysis tasks
6. Seamless On-Premises and Cloud Integration:
- Extend local resources to the cloud during peak demand periods
- Balance data proximity with cloud computing advantages
- Facilitate phased transition of workloads to cloud infrastructure
By synergizing HTCondor's robust workload management with AWS's versatile cloud platform, organizations can adeptly address diverse high-throughput computing challenges, achieving optimal performance and cost-effectiveness. Some of the common use-cases for HTCondor in AWS include, running large-scale scientific simulations, Data analysis in fields like genomics, physics, or climate modeling, financial modeling/market simulation and risk analysis.
While HTCondor offers significant benefits for high-throughput computing, its implementation an AWS environment, can present a complex landscape for many customers. The multi-component architecture of HTCondor, comprising Access Points, Execution Points, and Annexes, requires careful setup and coordination. Integrating with AWS necessitates basic understanding of various AWS services and their intricate configurations. Security configuration is another critical aspect, demanding proper setup of IAM roles, security groups, and HTCondor's native security features to ensure a robust and protected environment. Furthermore, the ongoing maintenance of the HTCondor pool and associated AWS resources adds another layer of complexity, demanding continuous attention and expertise. These challenges underscore the need for specialized knowledge and careful planning when implementing HTCondor in an AWS environment.
Setting Up HTCondor in AWS
Sample Blueprints are created to demonstrate how to configure HTCondor in AWS and are available here: https://github.com/aws-samples/HTCondorBaseline
The two CloudFormation templates provided are intended to create the following resources:
Access Point: This is similar to a "master" node in a cluster architecture. The Access Point is responsible for managing the overall state of the system and distributing workloads.
Execution Point: This is similar to a "worker" node in a cluster architecture. The Execution Points are responsible for actually performing the jobs and tasks that are assigned by the Access Point.
This setup follows a common pattern in distributed systems, where there is a central control plane (the Access Point) that manages the overall state and coordinates the work, while the individual worker nodes (the Execution Points) focus on executing the actual tasks. This division of responsibilities helps to scale the system and improve overall efficiency and reliability.
Prerequisites:
Configuring the Instance Metadata Service (IMDS) Defaults
When working with AWS EC2 instances, it's important to properly configure the Instance Metadata Service (IMDS) to control access to the instance's metadata. This metadata can contain sensitive information, so it's crucial to ensure the appropriate level of security. EC2 instances launched by HTCondor Annex retrieve metadata like the hostname and IP address from the Instance Metadata Service (IMDS) integrated within the instance.
In the AWS Management Console, you can configure the IMDS defaults for your EC2 instances. Let's walk through the steps:
- Navigate to the EC2 service.
- In the left-hand menu under settings, select "Data protection and security".
- Under the "IMDS Defaults" section, you'll see the following options:
- Instance Metadata - Ensure this is set to "enabled"
- Metadata version - Set this to "V1 and V2 (token optional)"
- Access to tags in metadata - set this to "Enabled"
- Note: It's important to note that these settings are regional, meaning you'll need to repeat these steps for each AWS region you're using. This ensures that your IMDS defaults are consistently applied across your infrastructure. Properly configuring the IMDS defaults is a crucial step in maintaining the security and integrity of your EC2 instances and the data they hold. By following these steps, you can help protect your environment from potential unauthorized access to sensitive instance metadata.
Creating a Keypair and Launching the Access Point Stack
To get started, you'll need to use an existing EC2 Keypair or create a new one in the AWS Management Console. This will allow you to securely connect to your EC2 instances.
To get started, you'll need to use an existing EC2 Keypair or create a new one in the AWS Management Console. This will allow you to securely connect to your EC2 instances.
- Navigate to the EC2 service.
- In the left-hand menu, click on "Key Pairs".
- Click the "Create key pair" button.
- Provide a name for your key pair, and choose the file format (e.g., .pem for Linux/macOS, .ppk for Windows).
- Click "Create key pair" to download the private key file to your local machine.
With the keypair created, you can now launch the "AP-Stack" CloudFormation stack.
- Navigate to the CloudFormation service.
- Click "Create stack" and choose "With new resources (standard)".
- Select "Template is ready" and choose "Upload a template file".
- Click "Choose file" and select the "HTCondorAP.yml" template file.
- Click "Next".
In the "Specify stack details" section, you'll need to provide the following parameters:
- AMI ID: You can get the AMI ID by navigating to the EC2 service, clicking "Launch Instance", and then scrolling to the "AMI" section on the right-hand side. Copy the AMI ID, but don't actually launch the instance. To ensure clarity and consistency, we recommend using the "Amazon Linux 2023 AMI" for the examples in this post. This default AMI was utilized during the preparation of this content.
- HTCondor Password: Type a password of your choice for the HTCondor service.
- VPC CIDR: This should already be prefilled, so you don't need to change it.
- Subnet CIDR: This should also be prefilled, so you don't need to change it.
- Your IP: Provide your IP address in the format of "x.x.x.x/32". The "/32" is important to ensure that only your IP address has access. This enables your device to connect to the Access Point.
Review the other parameters, and if everything looks correct, click "Next" to proceed.
On the next page, review the stack details and click "Create stack" to begin the deployment.
On the next page, review the stack details and click "Create stack" to begin the deployment.
Launching the Execution Point (EP) Stack
Now that the Access Point (AP) stack has been deployed, you can proceed to launch the Execution Point (EP) stack.
- Navigate to the CloudFormation service.
- Click "Create stack" and choose "With new resources (standard)".
- Select "Template is ready" and choose "Upload a template file".
- Click "Choose file" and select the "EP-Stack" template file.
- Click "Next".
In the "Specify stack details" section, you'll need to provide the following parameters:
- AMI ID: Use the same AMI ID that you used for the AP stack.
- VPC: Use the same VPC that you used for the AP stack.
- Subnet: Use the same subnet that you used for the AP stack.
- Key Pair: Use the same key pair that you used for the AP stack.
- Your IP: Provide your IP address in the format of "x.x.x.x/32". This should be the same as what you used for the AP stack. For troubleshooting purposes, we recommend enabling SSH access to the execution nodes. This approach provides a valuable tool for diagnosing and resolving issues directly on the nodes when necessary.
- HTCondor Password: Use the same password that you used for the AP stack.
- Custom Private DNS: You can find this value in the output of the AP stack.
- Source Stack Name: This is the name you gave to the AP stack.
Review the other parameters, and if everything looks correct, click "Next" to proceed.
On the next page, review the stack details and click "Create stack" to begin the deployment. After the EP stack has been deployed, you'll need to copy the ID of the image that was created. You can find this by navigating to the EC2 service, clicking on "Images", and locating the newly created image. This image will be used later for the Annex stack, so make sure to keep track of the image ID.
On the next page, review the stack details and click "Create stack" to begin the deployment. After the EP stack has been deployed, you'll need to copy the ID of the image that was created. You can find this by navigating to the EC2 service, clicking on "Images", and locating the newly created image. This image will be used later for the Annex stack, so make sure to keep track of the image ID.
Setting up the Annex
After successfully deploying the Access Point (AP) and Execution Point (EP) stacks, the next step is to configure the HTCondor Annex. This configuration establishes an execution pool for your HTCondor cluster, integrating the resources you've just set up. The Annex functionality allows you to add or remove worker nodes (Execution Points) to adjust your cluster's capacity based on workload requirements. This adaptability helps optimize resource utilization, enabling you to match your computational resources more closely to your current needs
Log in to the AP instance and navigate to the condor directory:
cd condor
Run the condor.sh script and start the condor_master:
. ~/condor/condor.sh
condor_master
Note: The purpose of the first command is to set up the necessary environment variables and configurations for the HTCondor software, which will be used in the subsequent steps. condor_master starts the HTCondor master daemon, which is responsible for managing the overall state of the HTCondor cluster.*Note: You need to run these commands if you ever stop and then start the EC2 instances, as they are necessary to set up the HTCondor environment. For production environments, it is recommended to configure condor_master to launch automatically as a system service. This ensures that the HTCondor service starts reliably upon system boot, maintaining the availability and continuity of your cluster operations.
Set up the Annex resources:
condor_annex -aws-region <your-region> -setup
Replace <your-region> with the AWS region you're working in.
- Update the "Check Connectivity" Lambda function timeout:
- Navigate to the Lambda service in the AWS Management Console.
- Locate the "Check Connectivity" function and go to its "Configuration" tab.
- Under "General configuration", update the "Timeout" to 1 minute.
- Add the necessary permissions to the Lambda function's role:
- Still in the Lambda service, go to the "Configuration" tab of the "Check Connectivity" function.
- Click on the role name under "Execution role".
- In the IAM console, click "Add permissions" and choose "Create inline policy"
In the JSON editor, paste the following policy:
Name the policy as desired and click "Create policy".
- Associate the VPC, subnet, and security group with each of the three Lambda functions (CheckConnectivity, sfrLeaseFunction, odiLeaseFunction):
- Navigate to the individual Lambda functions and go to their "Configuration" tabs.
- Under the "VPC" section, select the appropriate VPC, subnet, and security group which was created from AP Stack.
Verify the Annex setup:
condor_annex -check-setup
This command will check if the Annex setup looks okay, and you should see a "Setup looks ok" response. Now that the Annex is set up, you can proceed to configure it and test it.
Testing the Annex
Now that the Annex is set up, you can proceed to test it by running some sample jobs.
Create the annex.json file:
1. In the condor directory, create a new file named annex.json.
2. In this file, you'll need to configure the Annex settings, including the IAM Fleet Role and Instance Profile.
3. Replace the "arn:xxxxxxxxxxxx" placeholders with the appropriate values:
- IAM Fleet Role: Use the value from the output of the AP stack.
- Instance Profile: Also use the value from the output of the AP stack.
6. The rest of the configuration, such as the TargetCapacity, SpotPrice, and LaunchSpecifications, can be set as shown in the provided example.
7. The user data script in the LaunchSpecifications section is a base64-encoded version of the following script:
Create the test scripts:
In the condor directory, create a new file named sleep.sh with the following content:
Run the Annex and submit the test jobs:
For a comprehensive understanding of condor_annex commands and their usage, please refer to the official HTCondor documentation at https://htcondor.readthedocs.io/en/latest/man-pages/condor_annex.html. This resource provides detailed information on command syntax, options, and best practices for managing your HTCondor Annex.
In the condor directory, run the following command to create the Annex:
condor_annex -annex-name MyAnnex -aws-spot-fleet-config-file /home/********/condor/annex.json -slots 3
Run the following command to submit the test jobs:
condor_submit sleep.sub
Check the job status:
Run the following command to check the status of the submitted jobs:
condor_q
If you see some of the jobs running, then the Annex is working properly.
By following these steps, you've created an annex.json file with the necessary configuration, set up test scripts, and submitted jobs to the Annex. You can now observe the job execution and verify that the Annex is functioning as expected.
Cleanup:
To avoid ongoing costs, it's crucial to clean up all resources when you've finished using them. Follow these steps for a thorough cleanup:
- Delete the Execution Point (EP) Stack.
- Deregister the Amazon Machine Image (AMI) associated with your project.
- Delete the snapshot that was used to create the AMI.
Remember, simply deleting the EP Stack is not sufficient; you must also manually deregister the AMI and delete its associated snapshot to ensure all resources are properly removed and to prevent any unexpected charges.
Compute Best Practices for Cost-Effective HTCondor Usage in AWS
1. Leverage Spot Instances
AWS provides multiple EC2 (Elastic Compute Cloud) instance purchasing options, enabling customers to tailor their expenses to their specific requirements and usage trends. One such option, Spot Instances, allows users to harness idle EC2 capacity at substantial discounts—up to 90% off On-Demand rates. This approach enables customers to leverage AWS's vast infrastructure for running large-scale workloads cost-effectively. However, these savings come with a caveat: AWS reserves the right to reclaim Spot Instances when EC2 capacity is needed elsewhere, providing a two-minute warning to allow for graceful shutdown of running workloads. Spot Instances are particularly well-suited for short running, resilient, or stateless applications. Notably, HTCondor offers native support for Spot Instances through its condor_annex feature, facilitating smooth integration and enhanced cost efficiency. Learn more about Spot best practices here.
2. Optimize Instance Types
Choose the right instance types for your workloads. For example, compute-intensive jobs may benefit from instances with high CPU performance, while memory-intensive jobs may require instances with more RAM. AWS offers a variety of instance types to match different requirements, and HTCondor can be configured to utilize these optimally. AWS Graviton processors, offer competitive performance compared to x86-based instances, often with better price-performance ratios. Learn more about AWS Graviton processors here.
3. Use Reserved Instances and Savings Plans
When dealing with predictable workloads, it's worth exploring Reserved Instances as a cost-effective option. These instances provide substantial discounts compared to on-demand pricing, provided you commit to using a specific instance type in a particular region for either one or three years. Similarly, Savings Plans offer significant cost reductions akin to EC2 Reserved Instances, requiring a commitment to use a predetermined amount of computing power (measured in dollars per hour) over a one or three-year period. It's important to note that while Savings Plans offer financial benefits, they don't include capacity reservations. However, you can still secure capacity through On Demand Capacity Reservations and apply Savings Plans to reduce costs further. To maximize cost efficiency, many organizations implement a strategic mix of these purchasing options, allowing them to optimize savings while fully leveraging AWS's compute choices. Learn more about Amazon EC2 Billing and purchasing options here.
4. Monitor and continuously adjust the compute selection
It's crucial to consistently monitor the efficiency and expenses of your HTCondor pool. AWS offers various monitoring solutions, such as AWS Cost Explorer and AWS CloudWatch, which enable you to track resource usage and optimize allocation. Use the insights gained from these tools to fine-tune your configurations, ensuring cost-effectiveness. Another valuable resource is AWS Trusted Advisor, an automated system that offers recommendations on best practices across Amazon services, with cost optimization being one of its five focus areas. Additionally, consider leveraging AWS Compute Optimizer, a tool designed to help you select the most suitable Amazon EC2 instances for optimal efficiency and performance. This service employs machine learning algorithms to analyze your historical resource consumption patterns. By examining this data, Compute Optimizer suggests ways to enhance compute efficiency without incurring additional costs. It provides a clear picture of whether your resources are over-provisioned, under-provisioned, or optimally configured, allowing you to make informed decisions about your AWS infrastructure.
Conclusion
Harnessing the power of HTCondor in AWS presents a transformative opportunity for organizations seeking to elevate their high-throughput computing capabilities. By combining HTCondor's robust workload management features with AWS's scalable infrastructure, businesses can create a dynamic, cost-effective, and highly efficient HTC environment.
While the setup process can be complex, the long-term benefits far outweigh the initial challenges. By following best practices, utilizing pre-configured templates organizations can accelerate their implementation and quickly reap the rewards of this powerful combination.
While the setup process can be complex, the long-term benefits far outweigh the initial challenges. By following best practices, utilizing pre-configured templates organizations can accelerate their implementation and quickly reap the rewards of this powerful combination.
This article is contributed by
Sudhi Bhat, Principal Specialist Solution Architect, Compute
Bright Dike, Solutions Architect
Chris Marshall, Sr Solutions Architect Manager
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.