Securing Generative AI Applications Against Adversarial Attacks - Part 1
This blog post provides an overview of adversarial attacks targeting generative AI applications powered by large language models (LLMs), including input-based, model-based, data-based, system-based, and operational attacks. It also discusses defense mechanisms to secure LLM systems against these threats, such as input validation, differential privacy, and secure integration practices.
Tony Trinh
Amazon Employee
Published Jul 24, 2024
Last Modified Jul 25, 2024
Authors: Gilbert Lepadatu, Tony Trinh
In the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML), a new frontier of cybersecurity challenges has emerged, particularly with the rise of generative AI applications which are powered by Large Language Models (LLMs) or other modalities of Foundation Models. These models are increasingly central to organizational strategies, driving decision-making, automating processes, and enhancing user experiences. However, as reliance on these technologies grows, so too does the attention of malicious actors who are devising sophisticated techniques to exploit them. Welcome to the intricate world of adversarial machine learning—a domain where the ongoing battle between attackers and defenders plays out over data and algorithms.
Adversarial attacks on ML applications, especially the one powered by LLMs, represent a unique and rapidly growing threat. Unlike traditional cyberattacks that might target software vulnerabilities or human error, these attacks exploit the fundamental principles upon which these models operate. Attackers manipulate input data to cause misclassifications, illegitimate responses, poison training datasets, or extract sensitive information. The consequences of such attacks are severe, ranging from data breaches to unauthorized access to sensitive information, potentially compromising the integrity of AI applications.
The complexity of LLMs, wherein there is no clear distinction between executable code and input data, presents unique vulnerabilities. LLMs are trained on vast amounts of textual data, which becomes an integral part of the model's knowledge base and influences its behavior during inference. This lack of separation allows attackers to craft malicious input data that can manipulate the model’s internal representations and induce unintended actions.
In this blog, we'll lay out the general landscape of the topic, introducing the various threats and the initial concepts of defense mechanisms. In subsequent blogs, we will dive much deeper into each relevant adversarial threat and defense mechanism, examining case studies, detailing specific attack vectors, and discussing advanced strategies for safeguarding AI systems. Our work will explore the multifaceted nature of adversarial attacks, categorizing them into distinct groups based on their methodologies and objectives.
Adversarial attacks can be broadly categorized into five types: input-based, model-based, data-based, system-based, and operational attacks. Each category has specific attack vectors and potential defense mechanisms that are crucial for maintaining the security, integrity, and reliability of applications using LLM technology. Let’s now discuss them in greater detail.
Input-based attacks manipulate the input data provided to the LLMs to generate incorrect or unexpected outputs. These attacks exploit the model's sensitivity to slight changes in input, such as synonym substitution, character manipulation, and grammar alteration. These attacks can lead to harmful content generation, misclassification, or even unauthorized access to sensitive information, severely impacting the model's reliability and security.
- Description: Evasion attacks involve subtly modifying the input data at inference time to deceive the model into making incorrect inferences.
- Examples:
- Synonym Substitution: Replacing words with synonyms to alter the model’s interpretation.
- Grammar and Syntax Alteration: Modifying sentence structure to confuse the model.
- Character-Level Manipulation: Introducing typos or misspellings to cause misunderstanding.
- Other tactics: with image models, attackers can alter the input such as the pixels or visual elements of an image to trick an image recognition model into misclassifying the object or scene.
- Description: Attackers craft input prompts that manipulate the model’s behavior to output harmful or unintended responses. In another subsequent blog, we will dive much deeper into different attack vectors and defense mechanisms.
- Examples:
- Direct Prompt Injection: Overwriting or revealing the system prompt used to initialize the LLM. Examples include biased prompts and command injection.
- Indirect Prompt Injection: Injecting prompts that are not human-readable but parsed by the LLM. Examples include malicious website input and hidden prompts in HTML/JavaScript or images fed to the prompt.
- Input Validation and Sanitization: Ensure that user inputs follow expected formats and remove potentially harmful content. Input validation ensures that user input follows the expected format, while sanitization removes potentially harmful content from user inputs.
- Remove or Escape Special Characters and Control Sequences: Prevent harmful inputs by sanitizing or escaping special characters that can be used to manipulate prompts.
- Limit Input Length: Restrict the length of inputs to avoid overly long, manipulative prompts that can bypass safeguards
- Content Filtering: Detect and block known malicious patterns or keywords using content filters and rules
- Regular Updates and Reviews: Continually update and review system prompts to address newly discovered vulnerabilities.
- Prompt Engineering and Hardening: Modifies the input instructions given to a language model to make its responses more controlled and less prone to generating harmful or biased content. For example, instead of simply asking the model to "write a news article", the prompt could be hardened to "write a factual, unbiased news article about [specific topic] without any inappropriate or harmful language."
- Segregation of User Input: Isolates user inputs from system instructions and other user inputs to prevent interference, maintain security, and ensure data integrity.
- This involves categorizing and labeling user inputs distinctly, applying specific validation and sanitization techniques for each category, and ensuring that user inputs do not mix with system commands or other critical operations.
- Example: User-generated content such as articles, comments, and media uploads are processed in isolation. Articles are checked for plagiarism and comments are filtered for abusive language
- Explicit Instructions to Ignore Overrides: Instruct the LLM to disregard any subsequent attempts to override initial directives. This is an additional defense mechanism that can complement prompt hardening. Even with a hardened prompt, an attacker may still attempt to override or inject new prompts during the conversation. Explicit instructions tell the language model to disregard any subsequent prompts that try to change or override the initial directives. Prompt hardening makes injections harder, while ignore instructions make them ineffective.
Model-based attacks target the internal workings and characteristics of the language model itself, rather than just the input or output. These attacks exploit vulnerabilities within the model's architecture, training, or parameters to achieve malicious objectives.
- Description: LLMs are trained on data sourced from various domains, including private conversations, medical records, financial data, and other sensitive information. Model inversion attacks aim to recover the training data used to create the language model by leveraging the model's knowledge and parameters to reconstruct or approximate the original training data, which could contain sensitive or private information.
- Examples:
- Gradient-Based Optimization: In gradient-based model inversion attacks, the attacker leverages the gradients of the language model to recover or approximate the original training data.
- Gradients are a measure of how the model's outputs change in response to small changes in the input. By carefully analyzing these gradients, the attacker can gain insights into the model's internal representations and use this information to reconstruct plausible training examples.
- Example: Imagine a language model trained on a dataset containing sensitive user information, such as financial records. An attacker could craft input prompts and observe the model's outputs and gradients. By optimizing these gradients through an iterative process, the attacker may be able to generate synthetic samples that closely resemble the original training data, potentially exposing the sensitive information.
- Generative Models: Using generative techniques to produce plausible training examples. In the case of model inversion attacks, attackers can leverage generative models to reconstruct or approximate the original training data used to create a language model.
- Description: Model memorization attacks aim to extract specific pieces of sensitive or private information that the language model has memorized during training. Attackers leverage the model's ability to recall verbatim data from the training set. While model inversion attacks aim to reconstruct or approximate training data, model memorization attacks focus on extracting pieces of training data. Inversion attacks involve analyzing the model's gradients and parameters to recreate data, whereas memorization attacks use specific prompts to directly recall data points from the model's memory. Inversion attacks produce data that closely resembles the original training set, while memorization attacks extract verbatim data points.
- Examples:
- Crafted Prompts: Using specific prompts to retrieve sensitive training data.
- Model Probing: Probing the model to take advantage of its memorization of data.
- Description: Model extraction attacks aim to obtain a copy or approximation of the target language model by probing its behavior and extracting information about its internal structure, parameters, and decision-making process.
- Examples:
- Query-Based Extraction: Sending numerous carefully crafted queries to the model and observing the outputs to build a surrogate model.
- Description: Model subversion attacks aim to cause the model to behave in unintended or malicious ways by exploiting vulnerabilities in the model's architecture, training process, or parameter updates.
- Examples:
- Backdoor / Trojan Attack: through training data, embedding hidden triggers in the model that, when activated, cause the model to perform malicious actions.
- Training Data Poisoning: Inserting false or harmful data into the training set to skew the model's learning.
- Test Data Poisoning: Modifying test data to cause the model to perform poorly or to produce biased results.
- Differential Privacy: Implementing differential privacy techniques to protect individual data points during training. Differential privacy is a technique used to protect the privacy of individual data points during the training process of a machine learning model, such as a language model. The core idea is to introduce controlled noise or randomness into the training data or the model's parameters, making it difficult for an attacker to infer the presence or absence of a specific data point.
- Continuous Monitoring and Regular Audits: Continuously monitoring and regularly auditing model inputs, output to detect and mitigate any leakage of sensitive information
- Regular Testing: Conducting regular testing to ensure the model does not memorize sensitive information.
- Data Anonymization: Anonymizing training data to protect sensitive information.
- Data Encryption: Encrypting data at rest and in transit to prevent unauthorized modifications.
- Rate Limiting: Limiting the number of queries a user can send to the model to prevent extensive probing.
- Anomaly Detection: Monitoring query patterns for unusual activity that may indicate extraction attempts.
- Robust Training Procedures: Implementing secure and robust training procedures to minimize vulnerabilities.
- Model Integrity Checks: Regularly checking the integrity of the model’s parameters and updates.
- Data Validation and Integrity Checks: Implementing strict validation and integrity checks to ensure that only clean and verified data is used for training.
- Tamper-Evident Logging: Using tamper-evident logging mechanisms to ensure the integrity of data logs.
- Regular Security Assessments: Conduct ongoing security evaluations and penetration testing.
- Adversarial Training: Train models on adversarial examples to improve robustness.
- Model Hardening: Apply techniques to make models less susceptible to trojan attacks by ensuring the integrity of the training data and process.
Defending against model-based and data-based attacks requires a multifaceted approach, including robust model architectures and training procedures, techniques for model transparency and interpretability, secure model versioning and update mechanisms, comprehensive testing and evaluation frameworks, and ongoing monitoring and response to emerging threats. Addressing these challenges is crucial for ensuring the safe and trustworthy deployment of language models in real-world applications.
This is the first part of a two-part blog. Read the second part [here].
- Arora, A., et al. (2020). Securing web applications and microservices: A survey of current solutions and open problems. arXiv preprint arXiv:2003.04884. https://arxiv.org/abs/2003.04884
- Cichonski, P., et al. (2012). Computer security incident handling guide (NIST Special Publication 800-61 Revision 2). National Institute of Standards and Technology. https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final
- Arora, A., et al. (2020). A review of deep learning security. Mathematical Problems in Engineering, 2020. https://doi.org/10.1155/2020/6535834
- Liu, Y., et al. (2023). Prompt injection attack against LLM-integrated applications. arXiv preprint arXiv:2306.05499. https://arxiv.org/abs/2306.05499
- Goodfellow, I., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. https://arxiv.org/abs/1412.6572
- Goodfellow, I., Shlens, J., & Szegedy, C. (2019). Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. https://arxiv.org/abs/1905.02175
- Willison, S. (2024, June 6). Accidental prompt injection. https://simonwillison.net/2024/Jun/6/accidental-prompt-injection/
- OWASP Foundation. (2023). OWASP top 10. https://owasp.org/www-project-top-ten/
- MITRE. (2023). MITRE ATT&CK. https://attack.mitre.org/
- OWASP Foundation. (2023). OWASP API security project. https://owasp.org/www-project-api-security/
- AVIDML. (n.d.). AI vulnerability database. Retrieved July 23, 2024, from https://avidml.org/
- METR. (n.d.). METR. Retrieved July 23, 2024, from https://metr.org/#work
- Anthropic. (2023, September 19). Anthropic's responsible scaling policy. https://www.anthropic.com/news/anthropics-responsible-scaling-policy
- Embrace The Red. (2023). ChatGPT plugin vulnerabilities. https://embracethered.com/blog/posts/2023/chatgpt-plugin-vulns-chat-with-code/
- Embrace The Red. (2023). ChatGPT cross-plugin request forgery and prompt injection. https://embracethered.com/blog/posts/2023/chatgpt-cross-plugin-request-forgery-and-prompt-injection/
- Research Square. (2023). Defending ChatGPT against jailbreak attack via self-reminder. Retrieved July 23, 2024, from https://www.researchsquare.com/article/rs-2873090/v1
- AI Village. (n.d.). Threat modeling LLM. Retrieved July 23, 2024, from http://aivillage.org/large%20language%20models/threat-modeling-llm/
- LLM Attacks. (2024, July 23). Universal and transferable adversarial attacks on aligned language models. https://llm-attacks.org/
- Embrace The Red. (2024, July 23). Direct and indirect prompt injections and their implications. https://embracethered.com/blog/posts/2023/ai-injections-direct-and-indirect-prompt-injection-basics/
- Kudelski Security Research. (2024, July 23). Reducing the impact of prompt injection attacks through design. https://research.kudelskisecurity.com/2023/05/25/reducing-the-impact-of-prompt-injection-attacks-through-design/
- IBM. (2024, March 21). What is a prompt injection attack? https://www.ibm.com/blog/prevent-prompt-injection/
- Wikipedia. (n.d.). Adversarial machine learning. Retrieved July 23, 2024, from https://en.wikipedia.org/wiki/Adversarial_machine_learning
- Wikipedia. (n.d.). Differential privacy. Retrieved July 23, 2024, from https://en.wikipedia.org/wiki/Differential_privacy
- Wikipedia. (n.d.). Prompt engineering. Retrieved July 23, 2024, from https://en.wikipedia.org/wiki/Prompt_engineering
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.