Safety Company News
Get Workers
← News

Adversarial robustness in language model defense: methods and benchmarks

We are releasing the full text and supporting materials for our research on adversarial robustness in large language model defense systems. This work, conducted over the past nine months by Neuraphic's adversarial research group, proposes a structured taxonomy for classifying attacks against language models and introduces a benchmark suite for evaluating defensive mechanisms across seven distinct attack categories.

The paper is available on our research page. Here, we summarize its principal contributions and discuss their implications for the broader field of AI security.

A taxonomy of adversarial inputs

Existing literature on adversarial attacks against language models tends to treat the problem as monolithic — a model is either robust or it is not. This framing obscures important distinctions between fundamentally different attack strategies that require fundamentally different defenses. Our taxonomy identifies seven categories: prompt injection, jailbreak sequences, role-play escalation, instruction override, context manipulation, encoding attacks, and multi-turn manipulation.

Each category exploits a different aspect of how language models process and respond to input. Prompt injection targets the boundary between system instructions and user content. Jailbreak sequences exploit alignment weaknesses through carefully constructed payloads. Role-play escalation gradually shifts the model's behavioral frame across conversational turns. Instruction override attempts to supersede system-level directives. Context manipulation poisons the information a model uses for grounded responses. Encoding attacks use character-level obfuscation to evade input filters. Multi-turn manipulation distributes malicious intent across a sequence of individually benign messages.

Benchmark design

We constructed a benchmark of 14,200 adversarial samples across the seven categories, balanced against 28,000 benign samples drawn from production traffic patterns. Each adversarial sample was generated through a combination of automated methods and manual crafting by our red team. Critically, we included samples at varying levels of sophistication — from naive attacks that any keyword filter would catch to highly obfuscated inputs designed to evade state-of-the-art classifiers.

The benchmark evaluates both detection accuracy (can the system correctly identify an adversarial input?) and classification granularity (can it correctly categorize which type of attack is being attempted?). We believe the latter is essential for building effective defenses. A system that can identify an input as adversarial but cannot distinguish between a prompt injection and a multi-turn manipulation cannot apply the appropriate mitigation strategy.

Results across attack categories

We evaluated five defensive architectures against our benchmark, including two commercially deployed systems and three research prototypes. The results reveal significant variation in performance across attack categories. All systems performed well against prompt injection (F1 scores above 0.91) and encoding attacks (F1 above 0.88). Performance degraded substantially for multi-turn manipulation (best F1 of 0.73) and role-play escalation (best F1 of 0.69).

These findings confirm a pattern we have observed in our own development work: attacks that distribute intent across time are significantly harder to detect than attacks concentrated in a single input. This has direct architectural implications. Effective defense against the full spectrum of adversarial strategies requires systems that maintain state across conversational turns — a requirement that most current classification approaches do not meet.

Implications for applied defense

This research directly informs the design of Prion, our inference-layer security system. The taxonomy provides the classification schema that Prion uses to route detected threats to category-specific mitigation handlers. The benchmark serves as a regression test for each iteration of our detection models.

We are releasing the benchmark dataset under an open license to enable independent evaluation and to establish a common standard for measuring progress in adversarial defense. The field cannot improve what it does not measure, and it cannot measure what it has not agreed to define. We hope this work contributes to both.

Open questions

Several important problems remain unresolved. The relationship between model scale and adversarial vulnerability is not well understood — larger models appear more robust to some attack categories and less robust to others, for reasons that resist simple explanation. The arms race between attack and defense methods shows no sign of reaching equilibrium. And the question of how to evaluate defenses against attacks that have not yet been invented remains, by definition, open.

We will continue publishing our findings as this research progresses. The full paper, benchmark data, and evaluation code are available on our research page.