AWS Logo
Menu
Fine-tuning RoBERTa-base model for Semantic Similarity with Amazon SageMaker AI

Fine-tuning RoBERTa-base model for Semantic Similarity with Amazon SageMaker AI

This blog demonstrates how to fine-tune the HuggingFace RoBERTa-base model on the STS-B semantic similarity task using Amazon SageMaker AI.

Doron Bleiberg
Amazon Employee
Published Jun 9, 2025
Last Modified Jun 13, 2025
Large Language Model evaluation has become increasingly important for organizations deploying AI systems at scale. While API-based LLM services offer convenience and lower operational overhead, specialized evaluation tasks often require more targeted approaches that can deliver measurable, quantitative results. This notebook demonstrates how to fine-tune FacebookAI/roberta-base using Amazon SageMaker AI for the Semantic Textual Similarity Benchmark (STS-B) task, creating a purpose-built evaluation model that excels in specific domains.
The fine-tuned RoBERTa approach offers distinct advantages for organizations with specialized evaluation needs:
  1. Customer-specific learning: Fine-tuning enables the model to learn patterns and relationships specific to your domain, customer base, or organizational context
  2. Measurable regression outputs: Unlike general-purpose language models, this approach produces quantitative similarity scores on a continuous scale, providing precise metrics for evaluation workflows
  3. Automated evaluation capabilities: The model becomes ideal for automated short sentence pairs evaluation tasks such as response quality assessment, model comparison, and content consistency verification.
Amazon SageMaker AI streamlines the fine-tuning process with optimized training infrastructure, automated model deployment, and integrated monitoring capabilities, making it straightforward to develop and deploy specialized evaluation models for production use.

Semantic Textual Similarity Benchmark (STS-B)

STS-B is a machine learning task where, given two sentences, you predict a similarity score. This benchmark measures how closely two pieces of text match in meaning on a continuous scale from 0 to 5, where 0 indicates completely unrelated sentences and 5 represents perfect semantic equivalence.
While this notebook uses STS-B for demonstration, customers can easily bring their own custom datasets following the same format structure. STS-B serves as an excellent starting point because it provides a well-established baseline for semantic similarity tasks within the GLUE benchmark suite.
Expected Dataset Format For customers using their own data, the required format is straightforward and consists of three columns:
  • sentence1: First text in the pair (string)
  • sentence2: Second text in the pair (string)
  • score: Similarity score (float)
Example format:
sentence1,sentence2,score
"A man is playing a large flute","A man is playing a flute",3.8
"The cat sits outside","The dog plays in the garden",2.1

Let's start

Imports and IAM role for SageMaker

AWS and HuggingFace collaboration

Initial training code stab was taken from FacebookAI/roberta-base Train button. HuggingFace and Amazon SageMaker AI collaborate to streamline model deployment and training. Within the HuggingFace Model Hub, users can access ready-to-use code for both training and deploying models on SageMaker directly from each model’s webpage, making it quick and easy to move from model selection to running code.

Hyperparameters Configuration for RoBERTa Fine-tuning

model_name_or_path: Specifies the pre-trained model identifier from HuggingFace Hub or a local path to model files. In our case, 'FacebookAI/roberta-base' loads the base RoBERTa model with 125 million parameters, providing a solid foundation for semantic similarity tasks.
output_dir: Defines where model artifacts are saved within the training container . The path /opt/ml/model is SageMaker's designated directory for model outputs, ensuring proper artifact collection and S3 upload after training completion. After your training script saves model files to /opt/ml/model, SageMaker automatically handles the upload process of these artifacts to your Amazon S3 bucket as compressed files.
task_name: Identifies the specific GLUE benchmark task. Setting this to 'stsb' tells the training script to load the Semantic Textual Similarity Benchmark dataset and configure the appropriate evaluation metrics.
do_train: Boolean flag that explicitly enables training mode. Without this parameter set to True, the script may only perform evaluation or exit early, which explains why some users find empty model outputs.
max_train_samples: Limits the number of training samples to use, ideal for quick demos or debugging. Setting this to 50 dramatically reduces training time from hours to minutes while still demonstrating the fine-tuning process.
Note: For simplicity and readability purposes, several critical hyperparameters such as learning_rate and num_train_epochs use default values when not explicitly specified.

Default Training Script Usage

We use the default run_glue.py script from Hugging Face Transformers repository. This script is located at https://github.com/huggingface/transformers/tree/v4.49.0/examples/pytorch/text-classification and provides a comprehensive implementation for GLUE benchmark tasks.
The default script is sufficient for our use case because it:
  • Handles standard GLUE tasks: Pre-configured for all nine GLUE benchmark tasks including STS-B
  • Includes proper data loading: Automatically downloads and preprocesses the specified dataset
  • Implements best practices: Uses Hugging Face Trainer API with optimized training loops
  • Supports key features: Mixed precision, distributed training, and proper model saving
Amazon SageMaker AI provides exceptional flexibility for customers through its "Bring Your Own Script" (BYOS) and "Bring Your Own Container" (BYOC) capabilities. This flexibility allows customers to implement exactly what they wish while leveraging all the benefits of SageMaker's managed infrastructure, including automated scaling, distributed training, and seamless deployment.

When Custom Scripts Are Necessary

You should consider writing a custom training script when :
  • Custom loss functions: If your task requires specialized loss calculations beyond standard classification or regression
  • Non-standard data formats: When your data doesn't fit the expected GLUE format or requires custom preprocessing
  • Advanced training techniques: Implementing custom optimization strategies, learning rate schedules, or training procedures
  • Integration requirements: When you need to integrate with custom logging, monitoring, or external systems
  • Multi-task learning: Training on multiple tasks simultaneously or with custom task weighting
In our case, the default script covers most standard fine-tuning scenarios effectively, making custom scripts unnecessary for typical transfer learning applications. For our semantic similarity demonstration, the pre-built script handles all necessary components: data loading, model configuration, training loops, and artifact saving.

Understanding SageMaker Estimators

SageMaker estimators are high-level interfaces for SageMaker training that handle end-to-end Amazon SageMaker AI training and deployment tasks. An estimator is essentially a SageMaker Python SDK object for managing the configuration and execution of your SageMaker training job, which allows you to run training workloads on ephemeral compute instances and obtain a zipped trained model. Eestimator runs a training script in a SageMaker training environment. This estimator initiates the SageMaker-managed environment by using the pre-built Docker container (or your own container as mentioned above) and runs the training script that you provide through the entry_point argument.
The HuggingFace estimator simplifies direct model training from HuggingFace by providing several benefits:
  • Pre-configured Environment: The managed HuggingFace environment is an Amazon-built Docker container that executes functions defined in the supplied entry point Python script within a SageMaker Training Job.
  • Framework Integration: Direct integration with HuggingFace Transformers library and associated dependencies.
  • Version Management: Explicit control over transformers_version, pytorch_version, and py_version parameters.

Understanding our estimator configuration

entry_point: Path to the Python source file which should be executed as the entry point to training. source_dir: Path to a directory with any other training source code dependencies aside from the entry point file. transformers_version: Transformers version you want to use for executing your model training code. pytorch_version: PyTorch version you want to use for executing your model training code. py_version: Python version you want to use for executing your model training code.
When you specify framework parameters like transformers_version='4.49.0', pytorch_version='2.5.1', and py_version='py311' in your HuggingFace estimator, SageMaker uses these parameters to automatically resolve the appropriate Docker container image URI.
You can also create a PyTorch Framework-specific estimator instead. The PyTorch estimator executes a PyTorch script in a managed PyTorch execution environment. This approach provides more flexibility when you need newer versions of libraries that might not be available in the HuggingFace estimator.
Use PyTorch Estimator when:
  • You need newer versions of transformers library not yet supported by HuggingFace estimator.
  • You want to use higher PyTorch versions (current support goes to 1.12 compared to HuggingFace estimator limitations).
  • You need custom dependencies through requirements.txt
Use HuggingFace Estimator when:
  • You're working with standard Hugging Face workflows and supported versions.
  • You want simplified configuration for transformer model training.
  • The pre-configured environment meets your requirements.
Both estimators support SageMaker's "Bring Your Own Script" (BYOS) capability. Script mode in SageMaker allows you to take control of the training and inference process without having to create and maintain your own Docker containers. This flexibility enables you to write custom training and inference code while still utilizing common ML framework containers maintained by AWS.
Next step will be to run the training job.

SageMaker Training Job Lifecycle: What Happens After huggingface_estimator.fit()

When you call an estimator fit() method, SageMaker initiates a comprehensive training job lifecycle that involves multiple phases of infrastructure provisioning, data preparation, and model training.

Phase 1: Infrastructure Provisioning and Instance Launch

EC2 Instance Spinning Up SageMaker provisions the specified EC2 instances (hinted by the instance_type parameter) from the available capacity pool.
Container Image Retrieval SageMaker pulls the appropriate Docker container image based on your framework specifications from ECR. The container structure follows SageMaker's standardized /opt/ml directory layout:
  • /opt/ml/input/ - Contains configuration files and data channels
  • /opt/ml/model/ - Where your training script saves model artifacts
  • /opt/ml/code/ - Contains your training scripts and dependencies

Phase 2: Data and Code Preparation

Dataset Loading and Mounting If you specify input data through the fit() method's inputs parameter, SageMaker downloads data from S3 to the ML storage volumes during the Downloading phase. For GLUE tasks like STS-B, the HuggingFace run_glue.py script automatically downloads the dataset, so this phase may be brief.
Training Script and Dependencies Setup SageMaker copies your training script (run_glue.py) and source directory (./examples/pytorch/text-classification) to the /opt/ml/code/ directory within the container. When using git_config, SageMaker clones the specified repository branch and extracts the required files.
Your hyperparameters are written to /opt/ml/input/config/hyperparameters.json, making them available to your training script through environment variables.

Phase 3: Model Loading and Training Execution

Pre-trained Model Download The training script downloads the specified model (FacebookAI/roberta-base) from the HuggingFace Hub. This happens within your training script execution and is part of the actual training time. SageMaker also provides the capability to load models directly from S3 instead of downloading from external sources like HuggingFace Hub.
Training Script Invocation SageMaker starts your training script by executing the entry point (e.g. run_glue.py) with the configured hyperparameters.
Model Artifact Generation Your training script saves the fine-tuned model to /opt/ml/model/ as specified by the output_dir hyperparameter. This includes model weights, configuration files, and tokenizer components.

Phase 4: Cleanup and Artifact Upload

Model Artifact Compression and Upload After training completes, SageMaker compresses all contents from /opt/ml/model/ into a model.tar.gz file and uploads it to your specified S3 location.
Instance Termination SageMaker terminates the training instances and releases all associated resources. SageMaker only bills for the actual training time, not during the automated preparation phases. You are charged based on the BillableTimeInSeconds value, which represents the time interval between TrainingStartTime and TrainingEndTime. This means:
  • Not billed: Instance provisioning, container image pulling, data downloading, and artifact uploading.
  • Billed: Only the time from when training actually begins until it completes.
  • This billing model provides significant cost advantages because the infrastructure preparation overhead (which can take several minutes) doesn't contribute to your charges.
After this long explanation, lets run the training job.
That's it! we are done.
With the training process successfully finished, all model artifacts - including the fine-tuned RoBERTa weights - have been securely stored in Amazon S3. These artifacts are now available for inference deployment across a variety of platforms and services. You are not restricted to using SageMaker; you may deploy the model in any environment that suits your workflow.

Training Complete: Model Deployment and Management

Next Steps

While Amazon SageMaker offers a wide variety of inference options - each deserving its own dedicated discussion - the following steps focus on preparing your fine-tuned model for deployment.
After model fine-tuning, you have the option to register your model as a SageMaker Model Package. This package acts as a container for your trained model, housing all necessary artifacts, inference code, and metadata, and is designed for seamless deployment within SageMaker. For advanced model management, you may also consider using SageMaker Model Package Groups (optional).

SageMaker Model Package Groups for Model Management (Optional)

SageMaker Model Package Groups provide a centralized approach to model lifecycle management. A Model Package Group serves as a versioned container for models that share the same ML purpose, enabling systematic organization and governance of your model iterations.
Key Benefits
  • Model Versioning: Automatically track model iterations (v1, v2, etc.) with complete lineage and metadata
  • Organizational Structure: Group related models by business purpose or use case for improved discoverability
  • Governance Controls: Implement review and approval workflows before model promotion to production
  • Rollback Capabilities: Quickly revert to previous model versions when issues arise
  • CI/CD Integration: Seamlessly integrate with automated deployment pipelines for streamlined MLOps workflows
Model Package Groups streamline the transition from training to production by providing enterprise-grade model management capabilities that support both manual and automated deployment strategies.

Conclusion

Fine-tuning a HuggingFace RoBERTa-base model for semantic textual similarity using Amazon SageMaker AI offers a robust, scalable, and cost-effective approach to building specialized evaluation systems. By leveraging SageMaker’s managed infrastructure, you can quickly set up, train, and deploy models tailored to your specific needs - whether for LLM response evaluation, content consistency checks, or automated quality assurance.
SageMaker streamlines the entire ML workflow - from training with optimized frameworks and model versioning to seamless inference deployment. With features like Model Package Groups, you gain powerful model governance, traceability, and automation capabilities that simplify productionizing your models and integrating them into your MLOps pipelines.
Whether you’re just getting started with machine learning or scaling up your existing workflows, SageMaker empowers you to focus on innovation rather than infrastructure. Its flexibility - supporting everything from pre-built scripts to custom containers - ensures you can adapt to any use case or business requirement.
Ready to take your model deployment to the next level? Try fine-tuning RoBERTa-base on your own data with SageMaker and experience the benefits of managed ML infrastructure firsthand. Get started today by cloning the provided notebook, exploring different datasets, or integrating your models into your production systems. If you have questions or want to share your experience, join the AWS ML community or leave a comment below!
Happy training - and may your models always converge! 🚀
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments