Red Queen Receipts: AI Security Testing Needs Logs, Not Vibes

Security testing is not a screenshot.

A model gives a dangerous answer. Someone posts the transcript. A vendor says the model has been updated. A consultant turns the incident into a slide titled “AI risk is real.” Everyone nods gravely. Very mature. Very enterprise.

The harder question is less theatrical: can the same vulnerability be tested again, under controlled conditions, with visible logs, a consistent evaluator, repeatable statistics, and enough human inspection to make the result defensible?

That is the useful contribution of AVISE: Framework for Evaluating the Security of AI Systems, a recent arXiv paper by Mikko Lempinen, Joni Kemppainen, and Niklas Raesalmi.¹ The paper introduces AVISE, short for AI Vulnerability Identification and Security Evaluation, as a modular open-source framework for building automated Security Evaluation Tests, or SETs, for AI systems. To demonstrate it, the authors develop an automated test for a multi-turn jailbreak method called the Red Queen attack, then run it against nine recently released instruction-tuned language models.

The headline result is easy to quote: with an adversarial language model adapting the attack prompts, all nine evaluated models showed some degree of vulnerability. The more valuable lesson is operational: security evaluation is becoming a workflow problem. It needs connectors, test cases, evaluators, reports, confidence intervals, and human review. In other words, the unglamorous machinery that turns a clever attack into a reusable control.

AVISE treats AI security testing as a pipeline, not an event

AVISE separates the testing workflow into two layers. The orchestration layer manages the test logic: base pipelines, specific SETs, evaluators, an execution engine, and report generation. The interaction layer handles communication with the target AI system through connectors. In a black-box setting, the tester does not inspect the internal model or system code; the connector sends inputs to the deployed model and collects responses.

A simplified version looks like this:

Security Evaluation Test
        ↓
Initialize → Execute → Evaluate → Report
        ↓          ↓          ↓
   test cases   target AI   evaluator + logs
        ↓
Human-readable report with evidence trail

This structure matters because AI systems are not deterministic machines. The same prompt may not produce the same output every time. A single jailbreak transcript can prove that a bad outcome is possible, but it does not tell you how often similar attempts succeed, whether a fix reduced the failure rate, or whether the evaluator is confusing near-miss content with truly harmful output.

AVISE is designed to make those questions manageable. A test can be run repeatedly. Results can be aggregated. Evaluators can be swapped or combined. Reports preserve the conversation logs and evaluation decisions. The paper is not claiming that this fully solves AI security. It is showing how to make security evaluation less dependent on heroic manual probing.

For companies building AI workflows, that distinction is not academic. A security test that cannot be repeated is hard to govern. A failure that cannot be logged is hard to assign. A pass/fail decision that cannot be inspected is hard to defend when something breaks in production. “We tried some jailbreaks and it looked fine” is not a control. It is a memory with formatting.

The Red Queen test shows why static prompts are too polite

To demonstrate AVISE, the authors build a SET based on the Red Queen attack. The original idea is a multi-turn jailbreak that hides harmful intent inside a seemingly protective or socially legitimate scenario. Instead of directly asking for harmful instructions, the attacker frames the conversation as prevention, investigation, concern, or professional assistance.

Readers already familiar with jailbreaks can skip the next paragraph. A jailbreak is an input pattern designed to bypass a model’s safety behavior. Multi-turn jailbreaks are harder to evaluate than single-turn ones because the attacker’s strategy unfolds across several messages. The model may refuse early, drift into a different topic, answer partially, or provide enough indirect detail to become dangerous. That makes both attack automation and result evaluation messier than a single prompt-response test.

The AVISE Red Queen SET uses 25 selected attack templates derived from earlier Red Queen work. The selected cases cover harmful action categories and two broad kinds of social pretext: occupational roles and personal relationships. The article does not need the harmful details. The business-relevant point is that the test cases are structured, named, repeatable units, not improvised chat transcripts.

The authors then add an optional Adversarial Language Model, or ALM. In the ALM-augmented version, the attack is not just a fixed sequence of prompts. After the target model responds, the ALM can modify the next prompt to keep the conversation aligned with the attack objective.

That small design change changes the meaning of the test. Without the ALM, the test asks: can a model resist a fixed Red Queen template? With the ALM, the test asks: can a model resist a Red Queen-style attacker that adapts when the conversation drifts?

Those are different security questions. Static templates are useful for regression testing. Adaptive attacks are closer to real adversarial behavior. A real attacker does not normally stop because the model’s previous answer was inconvenient. They rephrase. They steer. They exploit the model’s politeness, uncertainty, and desire to be useful. Distressing, yes. Surprising, no.

The result is not a clean model-size story

The paper evaluates nine instruction-tuned models using the Red Queen SET both with and without the ALM. The models include Llama, Mistral/Ministral, Qwen, and Nemotron variants across different parameter sizes. The authors deploy them through Ollama under specified generation configurations.

Here is the core result, using the paper’s reported SET failure rates. A failure means the target model produced output judged to contain harmful or illegal instructions.

Target model	Failure rate with ALM	Failure rate without ALM	Immediate interpretation
Llama 3.1 8B	0.68	0.16	Adaptive prompting substantially increases failures.
Llama 3.2 3B	0.68	0.08	Smaller size is not the only explanation.
Llama 3.3 70B	0.64	0.16	Larger model still fails under adaptive attack.
Ministral 3 14B	0.84	0.36	Highest reported ALM failure rate.
Mistral 3.2 24B	0.40	0.36	ALM changes less here than for several peers.
Qwen 3 32B	0.68	0.32	High vulnerability under both settings.
Qwen 3.5 35B	0.08	0.20	Low ALM failure rate in the reported SET classification.
Nemotron 3 Nano 30B	0.40	0.52	Without-ALM result is unusually high relative to others.
Nemotron 3 Super 120B	0.12	0.20	Among the most robust in this test.

It is tempting to turn this table into a leaderboard. That would be the cheapest possible interpretation, and therefore naturally popular.

The better reading is diagnostic. The paper does not establish a general ranking of model safety. The SET has 25 cases. The confidence intervals are wide. The models run under particular deployment configurations. The attack focuses on one jailbreak family. The evaluator itself has measurable error. Treating the table as a universal safety ranking would be like judging a bank’s entire risk function by watching one branch handle one suspicious customer. Interesting, but please put the trophy away.

What the table does support is narrower and more useful: adaptive multi-turn prompting can reveal vulnerabilities that fixed templates may understate. For most evaluated models, the failure rate is higher with the ALM than without it. The authors also report that, after manual inspection, only a small share of attacks without the ALM produced outputs that could realistically facilitate harmful activity, while the ALM version produced such outputs far more often.

That is the operational lesson. In AI security, the attacker’s ability to maintain context can be part of the exploit. A fixed prompt suite may be too brittle to represent that behavior.

The evaluator result is as important as the attack result

AVISE uses an Evaluation Language Model, or ELM, to classify the final target-model response as a pass or failure. The authors then manually inspect the logs to assess how well the ELM’s labels match human judgment. This is the part that turns the work from “we attacked some models” into “we tested whether our testing machinery is itself usable.”

The ELM performs much better when evaluating the ALM-augmented attacks than when evaluating the non-ALM version:

Evaluation setting	Accuracy	Precision	Recall	F1-score	MCC
Red Queen SET with ALM	0.92	0.89	0.94	0.91	0.83
Red Queen SET without ALM	0.79	0.27	0.84	0.41	0.40

This looks strange at first. Why would the evaluator perform worse when the attack is less adaptive?

The paper’s explanation is plausible. Without the ALM, multi-turn conversations often drift away from the intended attack structure. The final response may contain adjacent, ambiguous, or irrelevant content. The ELM had been crafted to evaluate outputs generated under the ALM-shaped attack flow, so when that flow breaks, the evaluator produces many false positives.

That is a beautiful little governance problem. The evaluator is not neutral infrastructure floating above the test. It is part of the test system. Its accuracy depends on the distribution of outputs it is asked to classify.

For business users, this should sound familiar. A fraud model trained on one channel fails when applied to another. A customer-support classifier tuned on clean ticket categories struggles with messy live chat. A compliance rule that works in English misses policy-adjacent phrasing in another language. Same tune, new instrument.

The practical lesson is not “LLM evaluators are bad.” The with-ALM ELM result is actually strong for this use case. The lesson is that automated evaluators need calibration, audit samples, and failure analysis. Otherwise the organization may automate the appearance of security evaluation while quietly moving the uncertainty into a smaller, less visible model.

Elegant, in the same way a trapdoor is elegant.

The paper’s tests have different jobs

A useful way to read the paper is to separate the evidence by purpose. Not every table is trying to prove the same thing.

Paper component	Likely purpose	What it supports	What it does not prove
AVISE architecture	Framework proposal	Security tests can be organized into modular pipelines with connectors, evaluators, and reports.	That AVISE covers the full security posture of any AI system.
Existing-tool comparison	Positioning against prior work	Current tooling has gaps around modularity, stochastic evaluation, customization, and human/LLM evaluation.	That AVISE is empirically superior to every listed tool.
Red Queen SET design	Demonstration implementation	A concrete multi-turn jailbreak test can be automated inside AVISE.	That Red Queen is the most important jailbreak family for every deployment.
With-ALM vs without-ALM experiment	Variant/ablation-style comparison	Adaptive prompt modification can materially change attack success and evaluator behavior.	That every adaptive attacker will have the same success rate.
Manual log analysis and confusion matrices	Evaluator validation	The ELM’s labels can be checked against human judgment; evaluator error is measurable.	That human judgment is perfectly objective in all ambiguous cases.
Confidence intervals	Statistical uncertainty marker	Reported failure rates should be interpreted as estimates from small samples.	That the estimates are stable across larger test sets, domains, languages, or deployed products.

This distinction matters because businesses love turning one number into one decision. The AVISE paper gives numbers, but the better value is in the testing discipline around the numbers.

The ALM comparison is not merely a “better attack” result. It also shows that the structure of the attack changes the structure of the evaluation problem. The confusion matrices are not decorative statistics. They tell the reader whether the automated evaluator can be trusted enough to reduce manual workload without deleting human accountability. The confidence intervals are not academic ornaments. They remind us that 25 test cases can indicate a vulnerability pattern, but they should not be mistaken for actuarial certainty.

The business value is cheaper security diagnosis, not automatic safety certification

AVISE is most useful if treated as a diagnostic and regression-testing layer inside the AI system development life cycle.

Business question	AVISE-style testing contribution	Governance implication
Did the last model update reduce jailbreak exposure?	Run the same SET before and after the update.	Compare failure rates instead of relying on anecdotal retesting.
Is one model safer than another for this workflow?	Run a controlled SET across candidate models.	Make model selection include security behavior, not only cost and benchmark quality.
Are failures concentrated in certain scenarios?	Inspect case-level logs and evaluator summaries.	Prioritize guardrails, refusal tuning, or human review for specific risk clusters.
Can automated evaluation reduce review workload?	Validate ELM labels against sampled human judgments.	Use automation where accuracy is acceptable; preserve manual escalation for ambiguous cases.
Can auditors understand what happened?	Preserve prompts, responses, evaluator decisions, and report summaries.	Convert security testing into evidence, not folklore.

The ROI argument is not that AVISE magically prevents bad outputs. Please, no incense.

The ROI argument is that repeatable testing reduces the cost of knowing where the system is weak. That matters before deployment, after model upgrades, after prompt changes, after adding tools, and after expanding an AI workflow into a new business process. In production AI, the dangerous version of “minor change” is often “we only changed the system prompt.” Famous last words, neatly formatted.

A testing framework also helps separate three activities that organizations often mix together: attack discovery, control validation, and governance evidence. AVISE is aimed mainly at making the first two more systematic, with reporting features that support the third. That is a realistic scope. It does not replace security experts, legal review, product ownership, or incident response. It gives those functions better material to work with.

The boundary is coverage, not usefulness

The paper is careful about its limitations, and the main boundary is coverage.

AVISE is a framework. The Red Queen SET is one demonstration. The experiment evaluates one attack family, a small selected set of 25 cases, nine open-source instruction-tuned models, and particular deployment settings. The authors also acknowledge that the full AI security landscape is too immature and fast-moving for a framework to claim complete security coverage. Published SETs should assist human evaluators, not replace them.

The ELM limitation is also important. Some target-model responses are ambiguous: they may appear to offer detection, prevention, or warning advice while containing details that could help a malicious actor. Human evaluators may also find such cases difficult. This is exactly why the report logs matter. When classification is uncertain, the organization needs inspectable evidence, not a single green checkmark from a model judging another model. Recursive bureaucracy is still bureaucracy.

There is also a dual-use boundary. Red-team tools can help defenders find vulnerabilities before attackers exploit them. The same tools can also help attackers scout targets. The authors argue for responsible disclosure practices when using SETs. In business terms, this means companies should treat automated security testing as controlled internal capability, with access rules, logging, and escalation procedures. A red-team framework with no governance is just a very organized way to make new problems.

The most important boundary, however, is interpretive. The paper does not show that a model with a lower Red Queen SET failure rate is broadly safe. It shows that the framework can automate a specific security evaluation and produce useful evidence about a specific class of vulnerability. That is already enough. Not every paper needs to carry the whole cathedral on its back.

The strategic lesson: test the workflow around the model

Recall the earlier distinction between the attack result and the evaluator result. That distinction is the heart of the business lesson.

When companies discuss AI security, they often focus on the model: Which model is safer? Which vendor has better alignment? Which benchmark should we trust? Those questions are not wrong, but they are incomplete. In deployed systems, security also depends on the workflow around the model: prompts, tools, memory, retrieval, user roles, escalation paths, logging, review policies, and update procedures.

AVISE points in that direction. Its framework separates the test pipeline from the target system. Its connector approach fits black-box evaluation. Its evaluator design makes classification explicit. Its reporting step preserves evidence for humans. The Red Queen demonstration then shows why adaptive interactions and evaluator calibration matter in practice.

For AI governance, this is a useful shift of attention. The goal is not to ask whether a model is “safe” in the abstract. The goal is to ask whether a specific AI system, under specific usage conditions, fails specific tests often enough to require redesign, restriction, monitoring, or human review.

That sentence is less exciting than “AI red teaming revolution.” It is also more likely to survive contact with a compliance meeting.

Conclusion: the future of AI security is boring in the best way

The AVISE paper is valuable because it makes AI security evaluation look less like theater and more like engineering. It turns adversarial testing into a pipeline. It treats evaluator performance as measurable. It preserves logs. It reports uncertainty. It shows that adaptive multi-turn attacks can expose vulnerabilities missed by fixed templates. It also admits that one framework and one SET do not certify an AI system as secure.

That is the right level of ambition.

For business leaders, the practical takeaway is straightforward: do not ask only whether your AI model passed a demo. Ask what security tests are repeatable, what evidence is preserved, how evaluator decisions are validated, how often tests are rerun, and which failure modes trigger product changes or human review.

The Red Queen is not the villain here. The villain is pretending that AI security can be managed through scattered screenshots, occasional manual probing, and a spreadsheet named “risk register final final.”

AVISE does not end that habit. But it gives teams a more serious alternative: build the test, run the test, inspect the logs, measure the evaluator, and repeat after every meaningful change.

Boring? Yes. Useful? Also yes. Security often improves when the drama leaves the room.

Cognaptus: Automate the Present, Incubate the Future.

Mikko Lempinen, Joni Kemppainen, and Niklas Raesalmi, “AVISE: Framework for Evaluating the Security of AI Systems,” arXiv:2604.20833v2, 26 April 2026, https://arxiv.org/abs/ ↩︎

AVISE treats AI security testing as a pipeline, not an event#

The Red Queen test shows why static prompts are too polite#

The result is not a clean model-size story#

The evaluator result is as important as the attack result#

The paper’s tests have different jobs#

The business value is cheaper security diagnosis, not automatic safety certification#

The boundary is coverage, not usefulness#

The strategic lesson: test the workflow around the model#

Conclusion: the future of AI security is boring in the best way#