• tl;dr sec
  • Posts
  • How to securely build product features using AI APIs

How to securely build product features using AI APIs

A Practitioner’s Guide to Consuming AI

Companies are rapidly adopting AI capabilities, and security teams need to keep up.

As a Security Engineer, I went looking for a pragmatic guide on securely adopting Large Language Model (LLM) APIs. I wanted to know what risks I should consider and what controls are available to apply. I didn’t find one - so here is mine.

Companies that haven't spent years proactively investing in AI are launching as AI consumers. This involves building product features, often incremental ones, on top of third-party LLMs. Companies like OpenAI and Anthropic offer access to LLMs via popular APIs.

90% of VC-Backed Companies Plan to Launch Generative AI in their Products, 64% this Year

These are emerging capabilities, and there is time-pressure to launch transformative features. Security teams need to enable their businesses to grow and succeed in this environment. That means rapidly coming up to speed on the risks of these sorts of product features. More importantly it means awareness of the pragmatic set of controls emerging to reduce these risks.

Currently, ML environments are illegible to security analysts as they have no operational insights.

AI Risks

Many organizations and individuals are looking at the security risks of AI. The Berryville Institute of Machine Learning has identified 78 risks via an Architectural Risk Analysis, including their own Top 10. Groups like Team8, CSA, OWASP, and NIST have also produced substantive guidance.

The goal of this post is narrow. It will synthesize only those risks that are relevant when consuming AI and building features on top of LLM APIs. We’ll also highlight the controls available today to address this risks and vulnerabilities.

Given our focus on AI consumers, we’ll put aside considerations with most of the LLM stack involved in creating models, and serving APIs for them.

Adversarial Examples

The most prominent class of risks in products built on top of AI APIs lies in user provided input, at query time, attacking the underlying model. There are already numerous permutations of this attack. Tools like garak, the Adversarial Robustness Toolbox (ART), and MITRE Arsenal are automating the process of identifying susceptibility to these attacks.

Prompt Injection

Prompt Injection is the most straightforward adversarial attack. It was identified as early as May 2022 (by Preamble, called “command injection” at the time). Riley Goodside then released the first public example of such an attack that September. 13

In a basic Prompt Injection, an attacker can take advantage of the concatenation of user input to a pre-written prompt string to override the initial goals with attacker intent.11

A high impact example was found in MathGPT, which by design converts a natural language question into Python code that is then executed. This allowed for a nontraditional command injection vulnerability.

This is the exact same type of vulnerability we’ve been seeing for years in security:

  • Cross-Site Scripting: attacker-controlled input isn’t safely encoded for viewing on a web page

  • SQL Injection: attacker-controlled input gets mixed in with a database query

  • Command Injection: attacker-controlled input isn’t correctly shell escaped to be run as a command

If this concept seems familiar, it is because it mirrors a well-known problem: the integration of control instructions and data, as seen in the Von Neumann architecture that permeate computing to this day.

Indirect Prompt Injection14

Later research, notably by Kai Greshake, has expanded on this idea. In Indirect Prompt Injection, instead of direct user input tainting the prompt an attacker can poisons data retrieved at inference time. The initial practical example exploited Bing’s access to the content on the current website.

In another example, Mark Riedl was able to taint Bing’s description of him via hidden text addressing Bing directly.

An additional exploitation vector used ChatGPT’s markdown image support as a native data exfiltration vector.

Positive Security took things a step further when researching AutoGPT, which is a tool that allows you to sequence a set of LLM tasks, including ones with capabilities like browsing the web and running python code. They were able to layer indirect prompt injection, code execution using the execute_python_file feature, and escalation via either Docker escape or path traversal.

Prompt Leakage

Prompt Leakage is one impact of Prompt Injection. Shawn Wang offers a great example of the process of Prompt Leakage, targeting Notion AI.28 In the wild examples also exist, such as manipulations of a “remote work” twitter bot.

However, Shawn also makes the cogent point that you should design your product such that this attack has no impact on your business.

Matt Rickard offers more examples in A List of Leaked System Prompts. Take Perplexity.AI - a product which primarily offered a summarization interface on top of search and ChatGPT:

Plugin Request Forgery Attacks

Named after Cross-Site Request Forgery, this class of attacks demonstrates a notable application of confused deputy attacks via indirect prompt injection. These attacks require the presence of agents (such as ChatGPT plugins) that can take sensitive actions or are able to offer a data exfiltration channel.

Johann Rehberger found the initial example, which used the WebPilot plugin (vulnerable to indirect prompt injection on visited sites) and the Zapier plugin (to access the user’s email account and exfiltrate the data).25

Another example, “InjectGPT,” takes advantage of boxcars.ai's ability to run code to achieve traditional command injection.29


“Jailbreak” prompts, like “Do Anything Now” (DAN), are used to make an AI system perform unexpected jobs or ignore its prompt guardrails. For example, attackers could use this to turn a specific feature (like a support desk chatbot) into generic access to the backing model/API.5, 10

Researchers (like those behind gpt4free) have already made significant investments in reverse engineering APIs to gain free access to the underlying models.

Economic Denial of Service

Coined by Christofer Hoff as “Economic Denial of Sustainability” back in 2008, this attack is often discussed in relationship to cloud service providers. The attack takes the elasticity of the cloud and usage based billing, and posits an attacker who causes economic impact by driving resource consumption and incurs financial cost.

Anecdotally, we’ve heard app layer AI companies spend 5% - 10%+ of their revenue on LLM costs today.

This applies to features using on LLM APIs due to the pricing and consumption models of these features, as well as the generally high relative cost-per-API-call

… there’s basically never a time where you come up with a good idea and say, “We can’t afford to build it.” That’s just not a thing, right?
That changes when you’re building with AI. There are features we could build that we won’t because they’re too expensive.

Training Data

Even when building on LLM APIs, some concerns can arise around training data on your side of the shared responsibility model.

Data Poisoning

Outside of adversarial examples, data poisoning is a class of attacks that takes place before or alongside the actual user input.

Generally, models offered via API are pre-trained and frozen, moving the risks of data poisoning to the vendor side of the shared responsibility model. When adopting a third party model, interrogate what guarantees are offered against data poisoning. Ensure models only come from trusted sources, as malicious Trojans have been proven theoretically possible.

However, consumers may introduce task-specific fine-tuning via transfer learning on top of a generic model.13 In these cases, be thoughtful in using curated or licensed content that is validated and trusted.5

Models also can be provided context during inference.  If the end user is offered any control over the content of the context, this could introduce bias, hijack responses to all end users, or even allow indirect prompt injection.33

Online models, which continue training during active use, carry a much higher level of risk. An attacker can introduce drift from the model’s intended operational use case, and otherwise poison it.1 However, these models are not traditional in the LLM API consumption pattern.

Attacks that successfully inject data into training models can be difficult to detect, impossible to remediate, and incur massive cost to retrain and redeploy the model.10

Training Data Confidentiality

In addition to poisoning risks, training data may be confidential or proprietary. As in data poisoning, much of this risk is carried on the model provider’s side of the shared responsibility model.

During the early days of ChatGPT’s popularity surge, there was considerable concern that the model might train on user’s inputs, and that user data could then be leaked by attackers. However, currently ChatGPT and similar models are not online and updating in real-time. This means user input isn’t part of their training data corpus at all.5

Feedback Poisoning

Often, features that are reliant on generative AI collect end user feedback. If this feedback is later used for future re-tuning, then attackers could leverage malicious feedback to taint the model, potentially introducing bias.33


So far, we’ve focused on risks around the inputs to models - whether that is training data or adversarial examples. However, the outputs from these models can also pose a threat.

Output Integrity

BIMI proposes Output Integrity as a Top 10 risk that involves an attacker interposed between a model and the world. They posit that the “inscrutability of ML operations” lends itself more to this risk.1

The Team8 whitepaper notes that issues of generated content ownership, intellectual property infringement, and plagiarism are still unresolved. This leaves residual risk in usage of models. In fact, current guidance from The US Copyright Office refuses copyright protection for works produced by Generative AI.

Legally Sensitive Output

Beyond copyright, other legal consideration for output are expected to emerge, including around issues such as libel and defamation.10 OpenAI has already been sued for the latter.

Inadequate AI Alignment

OWASP LLM07:202327 , also referred to as “Edge Use Cases and Model Purpose” by the Team8 Whitepaper.

While there are generic models, models trained or fine tuned to specific tasks are also applied. These models can be fragile, if used for unintended purposes, in which case they can return inaccurate, incomplete, or false results. The models’ objectives and behavior can cause vulnerabilities or introduce risks via misaligned objectives and behavior.

This misalignment can be innate, but it can also occur as a result of model drift over time.

Hallucination Abuse

This attack relies on the fact that models often hallucinate, and in code generation these hallucinations can include non-existent software packages. An attacker who can predict such hallucinations can pre-emptively register and squat on the resource, and use it to deliver malicious content to end users who follow the model’s generated code.33

Security Controls

Having explored the breadth of AI risks facing this class of product or feature, we turn our attention to the controls practitioners have available today to mitigate those risks.

This is not a checklist, as controls present a set of tradeoffs between security assurance and product capabilities. At a high level, broader model and prompt flexibility allow more generic applicability, with a broader surface of risks. Models that are less easily transferable can provide more resiliency to attacker introspection.8

Design decisions also have outsized security impact43, such as:

  • Where to allow AI integration versus to build on alternative technologies

  • What steps in a business process are well suited to AI, for example output formatting benefits from determinism

  • The execution scope, permissions, and isolation of AI components

  • Tuning temperature (which influences randomness and creativity of the model), as more determinism can be safer but less organic

Traditional Governance, Risk, and Compliance Controls

It’s worth mentioning how standard practices on GRC still apply with these new LLM APIs, similar to any vendor. Some core considerations span:

  • Vendor Security: even in the base case, where you’re simply calling a pre-trained and frozen model via LLM API, you’re still passing data to a third party. Generic API concerns apply, including validation of the security of the vendor, and consideration for the sensitive of the data you’re therefor willing to share. Consider trying to quantify the impact to your business if the vendor has a major vulnerability or incident.

  • Data Compliance: when sending data to a vendor, you always need to consider whether that data is regulated or subject to a compliance regime. Are you authorized to share that data with the LLM vendor? For example, if you’re working with healthcare data, have you negotiated a BAA and the additional necessary steps to authorize sharing PHI?

  • Consolidation Risks: these vendors are seeing rapid adoption. As their profile and clientel grow, they become high value, centralized targets motivating more sophisticated adversaries.

Traditional Application Security Controls

Many of your standard security controls and practices maintain their significance when addressing the risks of AI products. One specific consideration for these controls is also the cost of AI APIs.

  • Access Control: Generally, AI-powered features can only be scalably offered to paying customers. Additionally, maintaining confidentiality across customers is a major concern for users of AI products, requiring standard authorization controls.

  • Caching: The architecture and cost of AI APIs favors caching whenever possible. As always, caching introduces complex failure models and potential for cross-tenant data leakage, depending on the implementation. ChatGPT has already had a Web Cache Deception vulnerability that could have resulted in account takeover. They have also had a bug in their use of Redis that led to cross-customer information disclosure.

  • Rate limiting: Controls on you users’ consumption are important as well - either via rate limits, usage caps, or usage-based pricing.

  • Data retention: Providers like OpenAI offer contractual commitments on data retention. Lowering the period for which data is retained reduces the blast radius of a vulnerability or incident.

  • Logging and Monitoring: maintaining a record of not only inputs but outputs can be important, due to the non-determinism of the models. Detecting attempted adversarial attacks or misuse are evolving considerations.

Protections against Adversarial Examples

When building on top of LLMs, it is crucial to acknowledge that preventing prompt injection is an unsolved problem. While Prompt Injection resembles historic issues like Cross Site Scripting and SQL Injection, prevention is significantly harder.

In those other cases, we only have to managed a constrained input space (e.g. the set of characters that can break out of a SQL context is bounded). With Prompt Injection and other Adversarial Examples, you have to contend with the full expressiveness of the native language interface. Also, generally the behavior of the LLM is not currently explainable, and responses are non-deterministic.

Despite the lack of a comprehensive solution, options for mitigation and risk reduction are rapidly manifesting.

Simon Willison’s Dual LLM pattern is one of the more interesting works in this space, but remains theoretical. It proposes a split between a Privileged LLM and a Quarantined LLM. The former acts on input from trusted sources, and is granted broad access. The latter is invoked when untrusted content needs to be processed, and is isolated from tools and sensitive data. The Quarantined LLM can also run verifiable processes. A Controller is then used to pass references between the two LLMs and the end user, providing assurance that content never contaminated the Privileged LLM.

Protections against Training Data Risks

  • Opt Out of data usage for training

  • Finetuning isolation

    • Do not make proprietary consumer-owned training/finetuning data available to tenants

    • Do not offer shared tenancy when using a user-finetuned model

    • Isolate Indexes across tenants

Protections against Output Risks

  • Moderation and Safety Systems:

  • Treat model output as untrusted data: leverage the same security model with model output that you’d use with user-provided data. For example, parameterize any database queries sourced from model output.

  • Put a Human in the Loop: requiring human confirmation before taking action based on outputs can ensure that generated content matches user intention.

  • Output tokens and output allowlisting: narrowing possible outputs constrains misuse. In the most drastic version of this, you could allowlist outputs - such as only allowing the return of the best match from an existing knowledge base.

  • Watermarking: be aware of platform support for watermarking, and consider the utility. This is especially relevant when generating visual artifacts, but also applies to text generation. Some comapnies are voluntarily making assurances on watermarking, and some jurisdictions (namely China) are mandating watermarks.

Hands-on training (with AI CTFs)

One of the best ways to get a feel for these risks is by playing hands-on with adversarial examples. A number of free AI CTFs and challenges have cropped up, check them out!

  1. Kaggle - AI Village Capture the Flag @ DEFCON

  2. Gandalf | Lakera: “Your goal is to make Gandalf reveal the secret password for each level. However, Gandalf will level up each time you guess the password, and will try harder not to give it away. Can you beat level 7? (There is a bonus level 8)”

  3. doublespeak.chat: “A text-based AI escape game by Forces Unseen”

  4. Fortune ML CTF Challenge: “In this web application challenge, the 🕵️ security researcher needs to bypass AI Corp's Identity Verification neural network”

  5. GPT Prompt Attack ⛳: “Goal of this game is to come up with the shortest user input that tricks the system prompt into returning the secret key back to you.”

  6. Jupyter Notebook - ChatGPT Adversarial Prompting


  1. 🌟 Berryville Institute of Machine Learning: The Top 10 Risks of Machine Learning Security

  2. Adversarial Machine Learning - Industry Perspectives 

  3. Impacts and Risk of Generative AI Technology on Cyber Defense 

  4. Adversa: The Road to Secure and Trusted AI

  5. 🌟 Team8: Generative AI and ChatGPT Enterprise Risks

  6. ENISA: Artificial Intelligence and Cybersecurity Research

  7. Trail of Bits: Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems

  8. ENISA: Multilayer Framework for Good Cybersecurity Practices for AI

  9. NIST: Artificial Intelligence Risk Management Framework (AI RMF 1.0)

  10. CSA: Security Implications of ChatGPT

  11. Simon Willison: Prompt Injection attacks against GPT-3

  12. 🌟 Simon Willison: Prompt Injection: Whats the worst that can happen

  13. Jose Selvi, NCC Group: Exploring Prompt Injection Attacks

  14. Kai Greshake: How We Broke LLMS: Indirect Prompt Injection

  15. OWASP: AI Security and Privacy Guide

  16. 🌟 Microsoft: Failure Modes in Machine Learning

  17. Preamble: Declassifying the Responsible Disclosure of the Prompt Injection Attack Vulnerability of GPT-3

  18. Introducing Google’s Secure AI Framework

  19. leondz / garak 

  20. jiep / offensive-ai-compilation 

  21. Daniel Miessler: The AI Attack Surface Map v1.0

  22. IANS: ChatGPT: Uncovering Misuse Scenarios and AI Security Challenges

  23. IDC: Generative AI: Mitigating Data Security and Privacy Risks

  24. 🌟 Kai Greshake: In Escalating Order of Stupidity

  25. EmbracetheRed: Plugin Vulnerabilities: Visit a Website and Have Your Source Code Stolen

  26. Will Pearce, Nick Landers: The answer to life the universe and everything offensive security

  27. OWASP Top 10 List for Large Language Models version 0.1 

  28. 🌟 sxyx: Reverse Prompt Engineering for Fun and (no) Profit

  29. InjectGPT: the most polite exploit ever 

  30. MITRE: Achieving Code Execution in MathGPT via Prompt Injection

  31. EmbracetheRed: Bing Chat: Data Exfiltration Exploit Explained

  32. Phil Venables: AI Consequence and Intent - Second Order Risks

  33. 🌟 Wiz: How to leverage generative AI in cloud apps without putting user data at risk

  34. OpenAI: Safety best practices

  35. MLSMM: Machine Learning Security Maturity Model

  36. EmbracetheRed: ChatGPT Plugin Exploit Explained: From Prompt Injection to Accessing Private Data

  37. Roman Samoilenko: New prompt injection attack on ChatGPT web version. Markdown images can steal your chat data.

  38. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

  39. Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security

  40. MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems)

  41. mitre-atlas / arsenal

  42. Prompt Injection attack against LLM-integrated Applications

  43. Reducing The Impact of Prompt Injection Attacks Through Design

  44. A Complete List of All (arXiv) Adversarial Example Papers

  45. Red-Teaming Large Language Models

  46. Copyright Protection and Accountability of Generative AI:

    Attack, Watermarking and Attribution

  47. Threat Modeling LLM Applications

  48. Hacking Auto-GPT and escaping its docker container

  49. A List of Leaked System Prompts

  50. NVIDIA AI Red Team: An Introduction

  51. Ian Goodfellow - Presentations

  52. EmbracetheRed: OpenAI Removes the "Chat with Code" Plugin From Store

  53. Using Large Language Models Effectively

  54. Prompt Injection Attacks and Mitigations

  55. MLSecOps Top 10