When an LLM processes untrusted content โ user input, retrieved documents, emails, web pages โ and that content contains instructions designed to override the system prompt or manipulate the model's behavior, that's prompt injection. The stakes rise dramatically when the agent has real-world capabilities.
The two attack classes
Direct Prompt Injection
The attacker directly controls the user input and injects instructions designed to override the system prompt, extract confidential context, or hijack the agent's behavior.
Indirect Prompt Injection
The attack payload is embedded in external content the agent retrieves โ a webpage, email, document, or database record โ and executes when the model processes it. The user never types the malicious instruction themselves. This is the harder problem to solve.
Five layers of defense
1. Input Validation & Sanitization
Detect and filter known injection patterns, instruction-like text in untrusted inputs, and anomalous token sequences. Not sufficient on its own โ determined attackers evade pattern matching โ but eliminates low-effort attacks.
2. Privilege Separation
Follow least-privilege: agents should only have access to the tools and data required for their specific task. An email assistant should not have delete access to the entire inbox. An attacker who succeeds in injecting instructions is limited by what the agent can actually do.
3. Separate Instruction and Data Channels
Where possible, keep trusted instructions (system prompt) structurally separate from untrusted data (retrieved content, user input). Use XML delimiters or structured formats that make instruction/data boundary explicit and harder to cross.
4. Output Validation
Validate LLM outputs against expected schemas and behavior patterns before acting on them, especially for agentic pipelines where outputs trigger tool calls. Detect outputs that deviate from expected task scope (e.g., a summarizer that suddenly issues API calls).
5. Human-in-the-Loop for High-Stakes Actions
Require explicit human approval before irreversible or high-impact actions (sending emails, making payments, deleting data). Even if injection succeeds, the attacker cannot complete the action without human confirmation. Highest-ROI defense for agentic systems.
Security-by-default on MoltBot
Input validation, privilege separation, output guardrails โ built into every agent. 14-day free trial.
Start Free Trial โ