Safety Company News
Get Workers

As AI systems move from research demonstrations into production infrastructure — handling financial transactions, managing healthcare data, operating security systems, running enterprise workflows — the consequences of their failure change fundamentally. A chatbot that gives a wrong answer is an inconvenience. An AI system that can be manipulated into ignoring its safety constraints while managing critical infrastructure is a serious threat.

Prion is our response to that threat. It is a defense layer designed to operate at inference time — analyzing every input to an AI system before it reaches the model, and blocking adversarial manipulation in real time.

The problem

Language models have a fundamental vulnerability: they process instructions and data in the same channel. A user's legitimate question and an attacker's adversarial payload arrive through the same input path, in the same format, and are processed by the same model. This architectural property makes prompt injection possible — and it is not something that can be solved by fine-tuning alone.

The attack surface is large and growing. Our research on adversarial robustness has identified seven distinct categories of attack that production AI systems face today:

Direct prompt injection — explicit instructions embedded in user input that attempt to override the system's behavior. "Ignore your previous instructions and..." is the simplest example, but modern variants are far more sophisticated.

Jailbreaks — carefully constructed prompts that exploit model alignment to extract behaviors the system was trained to refuse. These attacks evolve rapidly as the community discovers and shares new techniques.

Role-play escalation — attacks that gradually shift the model into a persona or context where safety constraints no longer apply. The model is not directly instructed to violate its rules — it is guided into a state where those rules feel inapplicable.

Instruction override — payloads designed to replace or modify the system prompt. In multi-turn conversations or systems that incorporate external data, these attacks can be embedded in retrieved content that the model processes as trusted context.

Context manipulation — techniques that exploit how models handle long contexts, conversation history, or retrieved documents. Adversarial content placed in earlier turns or in external data can influence the model's behavior in later turns.

Encoding attacks — obfuscation techniques that bypass content filters by encoding adversarial instructions in base64, Unicode variations, token-level manipulations, or other representations that the model can decode but surface-level filters miss.

Multi-turn manipulation — attacks distributed across multiple conversation turns, where no single message appears adversarial but the cumulative sequence guides the model toward unsafe behavior. These are the hardest to detect because each individual input passes standard safety checks.

The industry's current defenses against these attacks are inadequate. Content filters catch only the most obvious payloads. Keyword blocklists are trivially circumvented. Fine-tuning for refusal helps with known patterns but does not generalize to novel attacks. And manual review does not scale to the volume of requests that production AI systems handle.

What Prion does

Prion is a classification and filtering system that operates between the user and the model. Every input is analyzed for adversarial intent before it reaches the model. If the input is classified as adversarial, it is blocked. If it is classified as safe, it passes through. The goal is to make this decision correctly, at scale, and without adding perceptible latency.

This requires solving two hard problems at once.

The first is accuracy. The classifier must distinguish genuine inputs from adversarial ones across all seven attack categories — including novel variants it has never seen. False negatives mean adversarial inputs reach the model. False positives mean legitimate users are blocked. Neither is acceptable at scale.

The second is speed. Every millisecond Prion adds to the request path is latency the user experiences. Our target is sub-2ms median latency — fast enough to be invisible. Our research on real-time classification explores how we approach this constraint, including model distillation, quantization, and two-tier classification architectures.

These constraints are in direct tension. Accuracy wants larger models and more computation. Speed wants smaller models and less computation. The core technical challenge of Prion is finding the right point in this trade-off — and pushing the frontier of what's achievable at that point.

Why this approach

We chose to build Prion as a separate system — not as a feature of any particular model — for three reasons.

First, model-agnostic defense matters. Organizations use different models from different providers for different tasks. A defense layer that only works with one model or one provider is insufficient. Prion is designed to protect any LLM, from any provider, deployed in any configuration.

Second, defense should not depend on the model being defended. If an attacker can manipulate the model, they may also be able to manipulate its built-in safety mechanisms. An external defense layer that processes inputs independently is more robust than one that relies on the model's own judgment about whether it is being attacked.

Third, separation enables specialization. A general-purpose language model is not optimized for adversarial classification. A purpose-built system can be trained, evaluated, and iterated specifically for this task — with benchmarks, attack datasets, and evaluation frameworks designed for adversarial robustness rather than general capability.

The broader context

We believe inference-time defense will become a standard layer in AI infrastructure — the same way firewalls became standard in network infrastructure, and TLS became standard in web infrastructure. Not because any regulation requires it, but because organizations deploying AI in production will need verifiable protection against adversarial inputs.

Today, most organizations deploying AI have no systematic defense against prompt injection. They rely on the model's built-in alignment, which is better than nothing but is not designed to resist determined adversarial attacks. As AI systems handle more sensitive data and take more consequential actions, this gap will become untenable.

Prion is our contribution to closing that gap.

Current status

Prion is in active development. Our research team is focused on expanding the adversarial attack taxonomy, improving classification accuracy on edge cases — particularly multi-turn and encoding attacks — and optimizing inference latency toward our sub-2ms target.

We are not yet accepting external users. We publish our research as we go, and we will announce availability when the system meets our standards for accuracy and reliability in production environments.

If you are working on related problems, or are interested in learning more as development progresses, we'd like to hear from you.

Get in touch