How to securely build product features using AI APIs
A Practitioner’s Guide to Consuming AI
Companies are rapidly adopting AI capabilities, and security teams need to keep up.
As a Security Engineer, I went looking for a pragmatic guide on securely adopting Large Language Model (LLM) APIs. I wanted to know what risks I should consider and what controls are available to apply. I didn’t find one - so here is mine.
Companies that haven't spent years proactively investing in AI are launching as AI consumers. This involves building product features, often incremental ones, on top of third-party LLMs. Companies like OpenAI and Anthropic offer access to LLMs via popular APIs.
These are emerging capabilities, and there is time-pressure to launch transformative features. Security teams need to enable their businesses to grow and succeed in this environment. That means rapidly coming up to speed on the risks of these sorts of product features. More importantly it means awareness of the pragmatic set of controls emerging to reduce these risks.
Many organizations and individuals are looking at the security risks of AI. The Berryville Institute of Machine Learning has identified 78 risks via an Architectural Risk Analysis, including their own Top 10. Groups like Team8, CSA, OWASP, and NIST have also produced substantive guidance.
The goal of this post is narrow. It will synthesize only those risks that are relevant when consuming AI and building features on top of LLM APIs. We’ll also highlight the controls available today to address this risks and vulnerabilities.
From a16z’s Emerging Architectures for LLM Applications
Given our focus on AI consumers, we’ll put aside considerations with most of the LLM stack involved in creating models, and serving APIs for them.
The most prominent class of risks in products built on top of AI APIs lies in user provided input, at query time, attacking the underlying model. There are already numerous permutations of this attack. Tools like garak, the Adversarial Robustness Toolbox (ART), and MITRE Arsenal are automating the process of identifying susceptibility to these attacks.
Prompt Injection is the most straightforward adversarial attack. It was identified as early as May 2022 (by Preamble, called “command injection” at the time). Riley Goodside then released the first public example of such an attack that September. 13
Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.
— Riley Goodside (@goodside)
Sep 12, 2022
In a basic Prompt Injection, an attacker can take advantage of the concatenation of user input to a pre-written prompt string to override the initial goals with attacker intent.11
A high impact example was found in MathGPT, which by design converts a natural language question into Python code that is then executed. This allowed for a nontraditional command injection vulnerability.
This is the exact same type of vulnerability we’ve been seeing for years in security:
Cross-Site Scripting: attacker-controlled input isn’t safely encoded for viewing on a web page
SQL Injection: attacker-controlled input gets mixed in with a database query
Command Injection: attacker-controlled input isn’t correctly shell escaped to be run as a command
Indirect Prompt Injection14
Later research, notably by Kai Greshake, has expanded on this idea. In Indirect Prompt Injection, instead of direct user input tainting the prompt an attacker can poisons data retrieved at inference time. The initial practical example exploited Bing’s access to the content on the current website.
In another example, Mark Riedl was able to taint Bing’s description of him via hidden text addressing Bing directly.
I have verified that one can leave secret messages to Bing Chat in web pages.
— Mark Riedl (more at @[email protected]) (@mark_riedl)
Mar 21, 2023
An additional exploitation vector used ChatGPT’s markdown image support as a native data exfiltration vector.
Positive Security took things a step further when researching AutoGPT, which is a tool that allows you to sequence a set of LLM tasks, including ones with capabilities like browsing the web and running python code. They were able to layer indirect prompt injection, code execution using the
execute_python_file feature, and escalation via either Docker escape or path traversal.
Prompt Leakage is one impact of Prompt Injection. Shawn Wang offers a great example of the process of Prompt Leakage, targeting Notion AI.28 In the wild examples also exist, such as manipulations of a “remote work” twitter bot.
However, Shawn also makes the cogent point that you should design your product such that this attack has no impact on your business.
Prompts are not moats
Plugin Request Forgery Attacks
Named after Cross-Site Request Forgery, this class of attacks demonstrates a notable application of confused deputy attacks via indirect prompt injection. These attacks require the presence of agents (such as ChatGPT plugins) that can take sensitive actions or are able to offer a data exfiltration channel.
Johann Rehberger found the initial example, which used the WebPilot plugin (vulnerable to indirect prompt injection on visited sites) and the Zapier plugin (to access the user’s email account and exfiltrate the data).25
Another example, “InjectGPT,” takes advantage of boxcars.ai's ability to run code to achieve traditional command injection.29
“Jailbreak” prompts, like “Do Anything Now” (DAN), are used to make an AI system perform unexpected jobs or ignore its prompt guardrails. For example, attackers could use this to turn a specific feature (like a support desk chatbot) into generic access to the backing model/API.5, 10
Researchers (like those behind gpt4free) have already made significant investments in reverse engineering APIs to gain free access to the underlying models.
Economic Denial of Service
Coined by Christofer Hoff as “Economic Denial of Sustainability” back in 2008, this attack is often discussed in relationship to cloud service providers. The attack takes the elasticity of the cloud and usage based billing, and posits an attacker who causes economic impact by driving resource consumption and incurs financial cost.
This applies to features using on LLM APIs due to the pricing and consumption models of these features, as well as the generally high relative cost-per-API-call
Even when building on LLM APIs, some concerns can arise around training data on your side of the shared responsibility model.
Outside of adversarial examples, data poisoning is a class of attacks that takes place before or alongside the actual user input.
Generally, models offered via API are pre-trained and frozen, moving the risks of data poisoning to the vendor side of the shared responsibility model. When adopting a third party model, interrogate what guarantees are offered against data poisoning. Ensure models only come from trusted sources, as malicious Trojans have been proven theoretically possible.
However, consumers may introduce task-specific fine-tuning via transfer learning on top of a generic model.13 In these cases, be thoughtful in using curated or licensed content that is validated and trusted.5
Models also can be provided context during inference. If the end user is offered any control over the content of the context, this could introduce bias, hijack responses to all end users, or even allow indirect prompt injection.33
Online models, which continue training during active use, carry a much higher level of risk. An attacker can introduce drift from the model’s intended operational use case, and otherwise poison it.1 However, these models are not traditional in the LLM API consumption pattern.
Attacks that successfully inject data into training models can be difficult to detect, impossible to remediate, and incur massive cost to retrain and redeploy the model.10
Training Data Confidentiality
In addition to poisoning risks, training data may be confidential or proprietary. As in data poisoning, much of this risk is carried on the model provider’s side of the shared responsibility model.
During the early days of ChatGPT’s popularity surge, there was considerable concern that the model might train on user’s inputs, and that user data could then be leaked by attackers. However, currently ChatGPT and similar models are not online and updating in real-time. This means user input isn’t part of their training data corpus at all.5
Often, features that are reliant on generative AI collect end user feedback. If this feedback is later used for future re-tuning, then attackers could leverage malicious feedback to taint the model, potentially introducing bias.33
So far, we’ve focused on risks around the inputs to models - whether that is training data or adversarial examples. However, the outputs from these models can also pose a threat.
BIMI proposes Output Integrity as a Top 10 risk that involves an attacker interposed between a model and the world. They posit that the “inscrutability of ML operations” lends itself more to this risk.1
Lack of Copyright Protection
The Team8 whitepaper notes that issues of generated content ownership, intellectual property infringement, and plagiarism are still unresolved. This leaves residual risk in usage of models. In fact, current guidance from The US Copyright Office refuses copyright protection for works produced by Generative AI.
Legally Sensitive Output
Beyond copyright, other legal consideration for output are expected to emerge, including around issues such as libel and defamation.10 OpenAI has already been sued for the latter.
Inadequate AI Alignment
OWASP LLM07:202327 , also referred to as “Edge Use Cases and Model Purpose” by the Team8 Whitepaper.
While there are generic models, models trained or fine tuned to specific tasks are also applied. These models can be fragile, if used for unintended purposes, in which case they can return inaccurate, incomplete, or false results. The models’ objectives and behavior can cause vulnerabilities or introduce risks via misaligned objectives and behavior.
This misalignment can be innate, but it can also occur as a result of model drift over time.
This attack relies on the fact that models often hallucinate, and in code generation these hallucinations can include non-existent software packages. An attacker who can predict such hallucinations can pre-emptively register and squat on the resource, and use it to deliver malicious content to end users who follow the model’s generated code.33
Having explored the breadth of AI risks facing this class of product or feature, we turn our attention to the controls practitioners have available today to mitigate those risks.
This is not a checklist, as controls present a set of tradeoffs between security assurance and product capabilities. At a high level, broader model and prompt flexibility allow more generic applicability, with a broader surface of risks. Models that are less easily transferable can provide more resiliency to attacker introspection.8
Design decisions also have outsized security impact43, such as:
Where to allow AI integration versus to build on alternative technologies
What steps in a business process are well suited to AI, for example output formatting benefits from determinism
The execution scope, permissions, and isolation of AI components
For example, significant risk can be reduced by isolating all authentication information from the LLM.
Tuning temperature (which influences randomness and creativity of the model), as more determinism can be safer but less organic
Traditional Governance, Risk, and Compliance Controls
It’s worth mentioning how standard practices on GRC still apply with these new LLM APIs, similar to any vendor. Some core considerations span:
Vendor Security: even in the base case, where you’re simply calling a pre-trained and frozen model via LLM API, you’re still passing data to a third party. Generic API concerns apply, including validation of the security of the vendor, and consideration for the sensitive of the data you’re therefor willing to share. Consider trying to quantify the impact to your business if the vendor has a major vulnerability or incident.
Data Compliance: when sending data to a vendor, you always need to consider whether that data is regulated or subject to a compliance regime. Are you authorized to share that data with the LLM vendor? For example, if you’re working with healthcare data, have you negotiated a BAA and the additional necessary steps to authorize sharing PHI?
Consolidation Risks: these vendors are seeing rapid adoption. As their profile and clientel grow, they become high value, centralized targets motivating more sophisticated adversaries.
Traditional Application Security Controls
Many of your standard security controls and practices maintain their significance when addressing the risks of AI products. One specific consideration for these controls is also the cost of AI APIs.
Access Control: Generally, AI-powered features can only be scalably offered to paying customers. Additionally, maintaining confidentiality across customers is a major concern for users of AI products, requiring standard authorization controls.
Caching: The architecture and cost of AI APIs favors caching whenever possible. As always, caching introduces complex failure models and potential for cross-tenant data leakage, depending on the implementation. ChatGPT has already had a Web Cache Deception vulnerability that could have resulted in account takeover. They have also had a bug in their use of Redis that led to cross-customer information disclosure.
Rate limiting: Controls on you users’ consumption are important as well - either via rate limits, usage caps, or usage-based pricing.
Data retention: Providers like OpenAI offer contractual commitments on data retention. Lowering the period for which data is retained reduces the blast radius of a vulnerability or incident.
Logging and Monitoring: maintaining a record of not only inputs but outputs can be important, due to the non-determinism of the models. Detecting attempted adversarial attacks or misuse are evolving considerations.
Protections against Adversarial Examples
When building on top of LLMs, it is crucial to acknowledge that preventing prompt injection is an unsolved problem. While Prompt Injection resembles historic issues like Cross Site Scripting and SQL Injection, prevention is significantly harder.
In those other cases, we only have to managed a constrained input space (e.g. the set of characters that can break out of a SQL context is bounded). With Prompt Injection and other Adversarial Examples, you have to contend with the full expressiveness of the native language interface. Also, generally the behavior of the LLM is not currently explainable, and responses are non-deterministic.
Despite the lack of a comprehensive solution, options for mitigation and risk reduction are rapidly manifesting.
Simon Willison’s Dual LLM pattern is one of the more interesting works in this space, but remains theoretical. It proposes a split between a Privileged LLM and a Quarantined LLM. The former acts on input from trusted sources, and is granted broad access. The latter is invoked when untrusted content needs to be processed, and is isolated from tools and sensitive data. The Quarantined LLM can also run verifiable processes. A Controller is then used to pass references between the two LLMs and the end user, providing assurance that content never contaminated the Privileged LLM.
Input allowlisting: If possible, limiting user input to a known set of queries can practically eliminate risks from adversarial examples.
Preflight prompt checks13: One novel approach to input validation, proposed by Yohei, involves sending a “preflight” prompt check to validate the user’s input doesn’t substantially change the expected function of the prompt.
Set a reasonable session length (or session depth) to balance extended context building, which can create more powerful, but higher risk, interaction models.
Platform support for control and data plane isolation
Ideally, LLM API providers will significantly reduce these risks by implementing a way to segment data from code.
OpenAI has taken the first steps with ChatML by making “explicit to the model the source of each piece of text, and particularly shows the boundary between human and AI text.” However, at this time the boundary is not enforceable in the model.
Vendors are rapidly cropping up and garnering investment to extend LLM API provider security or offer a secure intermediary between consumers and providers. To learn more, check out Venture in Security’s AI/ML Model Security Landscape.
Protections against Training Data Risks
Opt Out of data usage for training
Do not make proprietary consumer-owned training/finetuning data available to tenants
Do not offer shared tenancy when using a user-finetuned model
Isolate Indexes across tenants
Protections against Output Risks
Moderation and Safety Systems:
Treat model output as untrusted data: leverage the same security model with model output that you’d use with user-provided data. For example, parameterize any database queries sourced from model output.
Put a Human in the Loop: requiring human confirmation before taking action based on outputs can ensure that generated content matches user intention.
Output tokens and output allowlisting: narrowing possible outputs constrains misuse. In the most drastic version of this, you could allowlist outputs - such as only allowing the return of the best match from an existing knowledge base.
Watermarking: be aware of platform support for watermarking, and consider the utility. This is especially relevant when generating visual artifacts, but also applies to text generation. Some comapnies are voluntarily making assurances on watermarking, and some jurisdictions (namely China) are mandating watermarks.
Hands-on training (with AI CTFs)
One of the best ways to get a feel for these risks is by playing hands-on with adversarial examples. A number of free AI CTFs and challenges have cropped up, check them out!
Kaggle - AI Village Capture the Flag @ DEFCON
Gandalf | Lakera: “Your goal is to make Gandalf reveal the secret password for each level. However, Gandalf will level up each time you guess the password, and will try harder not to give it away. Can you beat level 7? (There is a bonus level 8)”
doublespeak.chat: “A text-based AI escape game by Forces Unseen”
Fortune ML CTF Challenge: “In this web application challenge, the 🕵️ security researcher needs to bypass AI Corp's Identity Verification neural network”
GPT Prompt Attack ⛳: “Goal of this game is to come up with the shortest user input that tricks the system prompt into returning the secret key back to you.”
Jupyter Notebook - ChatGPT Adversarial Prompting
🌟 Berryville Institute of Machine Learning: The Top 10 Risks of Machine Learning Security
Adversa: The Road to Secure and Trusted AI
Simon Willison: Prompt Injection attacks against GPT-3
🌟 Simon Willison: Prompt Injection: Whats the worst that can happen
Jose Selvi, NCC Group: Exploring Prompt Injection Attacks
Kai Greshake: How We Broke LLMS: Indirect Prompt Injection
🌟 Microsoft: Failure Modes in Machine Learning
Daniel Miessler: The AI Attack Surface Map v1.0
🌟 Kai Greshake: In Escalating Order of Stupidity
Will Pearce, Nick Landers: The answer to life the universe and everything offensive security
EmbracetheRed: Bing Chat: Data Exfiltration Exploit Explained
Phil Venables: AI Consequence and Intent - Second Order Risks
OpenAI: Safety best practices
EmbracetheRed: OpenAI Removes the "Chat with Code" Plugin From Store