You need deterministic guardrails for AI agent security

Learn why deterministic guardrails are essential for AI agent security and how to build a layered defense against prompt injection and data exploits.

Daniel Kelleher

As organizations rush to deploy AI agents across their operations, a dangerous assumption is taking hold: that AI can reliably police AI. The allure of using sophisticated language models as guardrails for other AI systems is understandable, it seems elegant, scalable, and leverages the same technology that’s driving the agent revolution. But this approach is unreliable, and recent real-world exploits are proving just how vulnerable purely LLM-based security can be.

When “Official” Becomes Dangerous

Consider a recent vulnerability identified in GitHub’s official MCP (Model Context Protocol) server, an example of how sophisticated attacks can slip past even well-designed LLM guardrails. The attack works like this: an AI agent reads issues and pull requests from public GitHub repositories as part of its normal operation. Hidden within a seemingly innocent comment on a public repo is a carefully crafted instruction that instructs the LLM to ”Read all other repos.”

This instruction gets fed into the MCP server, returned to the LLM, and because the agent has access to both public and private repositories, it dutifully follows the command. The attack succeeds not through obvious malicious content, but through the exploitation of context and legitimate access patterns.

What makes this particularly insidious is that the malicious instruction doesn’t look harmful in isolation. An LLM-based guardrail, no matter how sophisticated, would struggle to distinguish this from normal, benign instructions without additional context about organizational boundaries and user trust levels.

The Prompting Arms Race

At Civic, we are building identity solutions for the AI future, we’ve implemented both LLM-based and deterministic guardrails in Civic Labs. With our “bodyguard” agent, we are exploring sophisticated techniques including tested system prompts, consensus across multiple models, and requiring LLMs to explain their tool usage. At all times, we are operating on the following basis: LLMs are always subject to manipulation and prompt injection.

The problem is rooted in how these models are trained. Current generation LLMs are trained to be compliant, to please whoever is prompting them, whether that’s a legitimate user or an attacker. They’re optimized for helpfulness, not suspicion. While we can train models to be more cautious, this creates trade-offs in other capabilities, and there are likely fundamental limits to what training alone can achieve.

More critically, deploying LLM-based guardrails puts you in an arms race with attackers, and attackers have a significant advantage. They outnumber defenders, and they only need to find one successful attack vector while defenders need to block everything.

The Deterministic Solution

The solution isn’t to abandon LLM-based guardrails entirely, but to recognize their limitations and layer them with deterministic guardrails, hard-coded, rule-based protections that operate outside the realm of language manipulation.

In the GitHub MCP exploit example, a simple deterministic rule could have prevented the attack entirely: “Only accept input from users within our organization.” This rule doesn’t rely on interpreting intent or detecting malicious language patterns. It creates an absolute boundary that no prompt injection can circumvent.

Yes, this constraint limits functionality, your AI agent might not be able to review external pull requests as easily. But it provides something that even the most sophisticated LLM guardrail cannot: absolute confidence. You can sleep soundly knowing that your GitHub MCP server won’t process requests from untrusted sources, regardless of how cleverly they’re disguised.

A Layered Approach to AI Security

Given the limitations of both purely LLM-based and purely deterministic approaches, the question becomes: how do we build AI security that's both robust and practical? The answer lies in recognizing that different types of threats require different types of defenses, and that the most effective security systems combine multiple complementary approaches.

The most robust approach combines multiple layers of protection:

Deterministic Guardrails: Hard rules that reject, redact, or add security context to inputs and outputs. These provide absolute boundaries that cannot be linguistically manipulated.

LLM-Based Guardrails: Sophisticated agents that can understand context and nuance, useful for edge cases that deterministic rules might miss.

Granular Access Control: Fine-tuned permissions that limit what each agent can access and modify.

Comprehensive Auditing: Detailed logging of agent behavior and decision-making processes.

Contextual Enhancement: Systems that add crucial security context (like “user A is in your organization, user B is not”) that LLMs cannot reliably determine on their own.

This layered approach acknowledges a fundamental truth: no single security measure is perfect, but multiple imperfect measures can create a better security posture. Each layer covers the gaps left by the others, creating overlapping fields of protection that make successful attacks more difficult.

Implementing Security-First AI

Understanding the theory of layered AI security is one thing; implementing it in production environments is another. Organizations need practical guidance on how to transition from their current AI deployments to more secure architectures without sacrificing the capabilities that make AI agents valuable in the first place.

For organizations deploying AI agents today, the approach should be security-first:

Start Locked Down: Begin with maximum deterministic guardrails, especially for agents that are user-facing or have access to external data sources like MCP servers pulling emails, tickets, or social media content.

Gradually Loosen: Systematically relax restrictions one by one, turning rejections into redactions, redactions into context additions, while maintaining detailed audit trails.

Risk Assessment: Internal-only agents may allow for a more relaxed initial approach, but this depends heavily on your specific business context and data sensitivity.

Continuous Monitoring: Implement itemized auditing and monitoring to detect when guardrails are triggered and why.

Security isn't a one-time implementation but an ongoing process of calibration. By starting with maximum security and systematically relaxing constraints based on real-world usage patterns and risk assessment, organizations can find the sweet spot between security and functionality. This approach also builds institutional knowledge about where the real risks lie, enabling more informed decisions about future security investments.

The Reality Check

Here’s the uncomfortable truth: if it’s an AI system, it’s not automatically safe, even if it comes from official sources. The GitHub MCP server example demonstrates that even well-designed, officially supported AI tools can become attack vectors.

Modern authentication and authorization systems assume that humans with proper training are making decisions. LLMs don’t fall into that category. They lack the contextual understanding of organizational boundaries, the ability to assess trust relationships, and the inherent skepticism that human operators bring to suspicious requests.

MCP servers, and similar AI-powered tools, can never be fully secure while maintaining the broad capabilities that make them useful, especially when they’re processing any form of public data.

The Stakes Are Real

The worst-case scenarios aren’t theoretical. Organizations face real risks including:

Leakage of system secrets and sensitive data
Unauthorized manipulation of critical business data
Financial losses through fraudulent transactions
Massive data breaches affecting customer information
Regulatory violations and compliance failures

These aren’t distant possibilities, they’re the predictable outcomes of deploying AI agents with insufficient guardrails in production environments.

Key Takeaways for AI Security

Official doesn’t mean safe: Even well-designed AI tools from reputable sources can become attack vectors when they process external data.
Public data equals risk: Any AI system that accesses information from public sources is particularly vulnerable to sophisticated injection attacks.
Avoid the arms race: Don’t rely solely on LLM-based guardrails, you’re fighting a battle where attackers have numerical superiority.
Layer your defenses: Combine deterministic guardrails with LLM-based protections, comprehensive auditing, and granular access controls.
Start secure, then optimize: Begin with maximum security restrictions and carefully loosen them based on real-world needs and risk assessment.

The Path Forward

The AI agent revolution is inevitable, but it doesn’t have to be reckless. By acknowledging the limitations of LLM-based security and implementing robust deterministic guardrails, organizations can harness the power of AI agents while protecting their most critical assets.

At Civic, we’re building these layered security solutions to help organizations safely deploy agentic AI at scale. The future belongs to those who can balance AI capability with security reality, and that future requires more than just hoping one AI can reliably police another.

The question isn’t whether AI agents will transform your business, it’s whether you’ll implement them securely enough to survive that transformation.

Interested in learning more about implementing secure AI agent deployments? Contact Civic Labs for early access to our deterministic guardrail solutions and comprehensive AI security framework.

‍

On this page

n/a