An answer can look safe and still leave fingerprints.
That is the uncomfortable point behind GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision.1 The paper is not merely saying that multimodal models can be unsafe. We knew that. Congratulations, the fire is hot. Its sharper claim is architectural: once a model reasons over both images and text, the safety problem no longer lives only at the input or the final answer. It also lives in the middle.
That middle is where modern reasoning models interpret visual context, infer user intent, consider possible actions, and decide how to respond. In the old question-answer world, a safety guard could inspect the prompt and the output. In the new question-thinking-answer world, that is like auditing a company by reading only the first email and the final invoice. The interesting misconduct may have happened in the spreadsheet no one opened.
GuardTrace-VL is built around that missing spreadsheet. The authors introduce GuardTrace, a multimodal QTA benchmark with 9,862 training examples and 2,000 test examples, each containing an image-text question, a reasoning trace, and a final answer. They then train GuardTrace-VL, a 3B vision-language safety auditor, to judge the full trajectory. On their GuardTrace-Test benchmark, the full model reaches 93.10% average F1, ahead of GPT-5 at 88.86% and LLaMA-4-Guard-12B at 79.55%.
That headline number matters. But the real lesson is not “small model beats big model.” The better lesson is: a safety system trained to inspect the right object can beat a stronger model inspecting the wrong one.
The unsafe part can sit between the prompt and the answer
The familiar misconception is simple: if the final answer refuses the harmful request, the interaction is safe.
For ordinary QA moderation, that assumption is tempting. A user asks something risky. The model says no. The guard sees the refusal and passes the exchange. Case closed, stamp applied, dashboard updated. Everyone gets to go home and pretend governance happened.
Multimodal reasoning breaks that comfort. The model may recognize the risk, reason through visual details, expose sensitive intermediate information, and still end with a polite refusal. The answer looks aligned. The reasoning trace, however, may contain the operationally dangerous part.
The paper frames this as a Question-Thinking-Answer, or QTA, problem:
| Layer | What it contains | Why QA-only moderation can miss it |
|---|---|---|
| Question | Text plus possible image input | Risk may be embedded visually, indirectly, or through jailbreak-style image prompts |
| Thinking | The model’s intermediate reasoning over image and text | Unsafe inferences or procedural details may appear before the final response |
| Answer | The final user-facing output | A safe refusal can hide unsafe reasoning that already occurred |
This is especially important in multimodal settings because the image is not decorative. A visual cue can change the safety meaning of an otherwise vague question. A text-only detector may see a generic request; a vision-aware auditor may see that the image grounds the request in a restricted, harmful, private, or otherwise sensitive context.
The authors contrast three moderation styles. A QA guard sees the question and answer. A text-only reasoning guard sees the textual trace but lacks visual grounding. GuardTrace-VL sees the image, question, reasoning trace, and answer together. That last combination is the mechanism. Without it, the guard either misses the visual threat or gets distracted by the final answer’s surface-level safety.
This is why the article should not be read as another benchmark announcement. The benchmark is the instrument. The conceptual move is to redefine the safety perimeter.
GuardTrace turns the hidden middle into a measurable object
The paper’s first contribution is the GuardTrace dataset. Its value is not just size, though 9,862 training and 2,000 test instances is large enough to support a serious evaluation. Its value is that it turns multimodal reasoning safety into a supervised object.
The construction pipeline has three stages.
First, the authors start from text-based safety prompts, mainly from S-Eval, and expand them into multimodal inputs. They include several input forms: no-image baselines, irrelevant images, semantically aligned images, and jailbreak-oriented image constructions such as FigStep, HADES, and CS-DJ. This matters because multimodal safety failures are not all the same. Some images distract. Some clarify intent. Some carry the attack payload.
Second, they generate complete QTA triples using multimodal large reasoning models. For the training data, they use open-source reasoning models such as Qwen3-VL-30B-A3B-Thinking, Kimi-VL-A3B-Thinking, and GLM-4.1V-9B-Thinking. For the test set, they also include traces from closed-source models such as GPT-5-mini, Qwen3-VL-Plus, and doubao-seed-1.6. The paper notes that closed-source models are difficult to use for large-scale training-data collection because API filtering and stronger alignment restrict the diversity of unsafe traces. That is not a minor logistical footnote; it shapes what kind of safety data can realistically be built.
Third, the data is annotated through a human-AI collaborative protocol. Three multimodal models produce structured “Analysis-Judgment” labels. Cases with strong agreement become high-confidence training material. Ambiguous cases are reviewed by human experts. The labels use three levels: safe, potentially harmful, and harmful. For binary evaluation, the paper maps “potentially harmful” into the harmful class, a conservative choice that fits moderation practice but also simplifies the underlying nuance.
The dataset covers eight broad risk categories inherited from S-Eval: crime and illegal activity, hate speech, physical and mental health, ethics and morality, data privacy, cybersecurity, extremism, and inappropriate suggestions. That breadth matters less because it is exhaustive—it is not—and more because it prevents the benchmark from becoming a one-trick jailbreak demo.
The test set is also deliberately split. S-Eval-VL and HADES-Eval serve as in-domain evaluations. MM-Eval and MMJ-Eval serve as out-of-distribution tests. The latter is especially useful because a safety model that only performs well on the exact attack style it saw during training is not a guardrail. It is a memorized burglar alarm.
The training pipeline teaches the model where ambiguity lives
GuardTrace-VL is trained from Qwen2.5-VL-3B-Instruct, but the interesting part is the curriculum.
The authors use a three-stage process:
| Stage | Data used | Likely purpose | What it supports |
|---|---|---|---|
| SFT | 4,625 high-confidence examples with unanimous agreement | Main training foundation | Learn the safety taxonomy and basic QTA judgment format |
| DPO | 4,950 preference pairs from 2:1 voting splits | Boundary learning | Learn distinctions among disputed safety judgments |
| OGDPO | 1,013 refined examples, including 726 hard negatives and 287 expert-resolved ambiguous cases | Hard-case refinement | Improve performance near the messy edge of safety classification |
This sequence is more than standard “train, tune, improve” choreography. It reflects how safety work actually functions in organizations. Easy cases are not the problem. The real cost sits in borderline decisions: a medical-looking question that becomes harmful through visual context, a cybersecurity explanation that drifts from defensive to enabling, a legal or financial response that sounds analytical but crosses into reckless advice.
The ablation table makes the role of this curriculum visible. The base Qwen2.5-VL-3B-Instruct model performs poorly as a QTA safety judge: F1 is 43.61 on S-Eval-VL and 34.27 on HADES-Eval. After supervised fine-tuning, F1 jumps to 89.89 and 94.14 on those same subsets. DPO adds smaller but consistent gains. OGDPO adds another lift, bringing the full model to 93.33 on S-Eval-VL, 95.88 on HADES-Eval, 91.31 on MM-Eval, and 92.39 on MMJ-Eval.
The magnitude tells a useful story. SFT supplies the main competence. DPO and OGDPO refine judgment at the boundary. In business language, the first stage builds the junior auditor; the preference stages teach it not to embarrass itself when the case is not in the handbook.
That distinction matters because companies often overinvest in generic model strength and underinvest in the evaluation object. A bigger general model may reason better, but if it is not trained to inspect the full multimodal trajectory under a safety taxonomy, it may still miss the dangerous middle.
The benchmark results support the architecture, not just the leaderboard
The main experimental table compares GuardTrace-VL against moderation APIs, prompted general-purpose models, safety-aligned multimodal models, and dedicated guard models. The results are straightforward:
| Model | Average Accuracy | Average F1 |
|---|---|---|
| OpenAI Moderation API | 67.25 | 64.86 |
| GPT-5 | 88.50 | 88.86 |
| Qwen3-VL-Plus | 85.30 | 87.54 |
| Qwen2.5-VL-32B-Instruct | 83.75 | 84.93 |
| LLaMA-4-Guard-12B | 77.51 | 79.55 |
| GuardReasoner-VL-7B | 77.75 | 74.32 |
| GuardTrace-VL-3B | 93.00 | 93.10 |
The clean interpretation: GuardTrace-VL performs best on the authors’ multimodal QTA benchmark.
The more useful interpretation: the benchmark rewards three capabilities at once. The guard must understand the image. It must inspect the reasoning trace. It must resist being reassured by a safe-looking final answer. Models designed for only one or two of these capabilities are structurally disadvantaged.
This is also why the comparison with GPT-5 should be read carefully. GPT-5 is a stronger general-purpose model, but it is being used as a prompted safety evaluator. GuardTrace-VL is a smaller model specialized for this exact supervision target. So the result does not prove that 3B specialized models generally outperform frontier models. It shows that specialization on the right safety object can outperform general capability on a narrowly defined audit task.
The paper’s multimodal ablation reinforces that point. When visual inputs are replaced by captions and text-based guards evaluate the resulting text-only QTA pairs, performance falls behind the direct multimodal model. On MMJ-Eval, GuardTrace-VL reports 92.39, while caption-augmented ReasoningShield reaches 88.85. The gap is not enormous, but it is meaningful. Captioning is a lossy safety interface. It may describe the image, but it does not preserve every visual relation, embedded cue, or adversarial artifact that matters for risk judgment.
That is the boring sentence businesses should pay attention to. “We caption the image and send the text to our existing guardrail” may be cheaper. It may also be precisely where the leak begins.
The appendix tests robustness, not a second thesis
The supplementary experiments should not be overread. They are useful, but they are not the core claim.
On the text-only ReasoningShield-Test benchmark, GuardTrace-VL achieves 88.11% F1, slightly below the specialized text-only ReasoningShield-3B at 90.23%. The authors explain that ReasoningShield is trained on in-domain text-only reasoning data, while GuardTrace-VL is operating out of domain. This is best read as a robustness result: the multimodal QTA guard does not collapse when evaluated on text-only reasoning safety, but it does not dethrone the specialist.
On broader QA moderation benchmarks, including BeaverTails, WildGuard, and SPA-VL-Test, GuardTrace-VL reaches a sample-weighted average of 84.50% F1, ahead of the listed dedicated guard baselines in that table. Again, useful. But the paper is not really about ordinary QA moderation. It is about trajectory-level multimodal safety.
The annotation reliability checks are more central. The authors evaluate their automated annotation pipeline on 150 randomly sampled test instances. A majority vote among three VLM judges reaches 95.33% accuracy and 92.79% F1 against human labels. Qwen3-VL-Plus as an external judge reaches 96.00% accuracy and 96.82% F1. Human annotators also reach a Fleiss’ Kappa of 0.74 on ambiguous cases, which the paper treats as substantial agreement.
These checks support a practical claim: scalable safety datasets probably need hybrid labeling. Pure human annotation is expensive. Pure model annotation is risky. A voting-and-escalation design, where consensus cases are handled automatically and ambiguous cases go to experts, is a plausible middle path. Not glamorous, but governance rarely is. Governance is mostly deciding which mess deserves a human salary.
The annotation-prompt ablation also matters. Using the full structured protocol, Qwen2.5-VL-32B-Instruct reaches 90.00% accuracy and 82.76% F1 on a 150-sample annotation test. Removing in-context examples drops F1 to 74.73. Removing structured analysis drops F1 to 70.48. Replacing the tailored prompt with a default LLaMA Guard prompt drops F1 to 59.56.
That is not a footnote about prompt hygiene. It says safety labeling itself is an engineered process. Generic “please classify this as safe or unsafe” prompting is not enough for multimodal reasoning traces. Shocking, I know: the checkbox was not the control.
Business value comes from changing the audit object
The business implication is not that every company should immediately deploy GuardTrace-VL. The dataset is restricted because it contains harmful content, and the paper evaluates curated benchmarks rather than live enterprise workflows. The model is a research artifact, not a compliance product with procurement paperwork and a friendly account executive.
The practical implication is deeper: organizations using multimodal agents should decide whether their safety controls inspect endpoints or trajectories.
| What the paper directly shows | Cognaptus interpretation | Boundary |
|---|---|---|
| QTA-aware multimodal auditing outperforms QA-only and text-only baselines on GuardTrace-Test | Safety architecture should include reasoning-trace inspection where traces are available | Many commercial systems do not expose full reasoning traces to customers |
| Direct image-text processing beats caption-only substitutes in the ablation | Captioning images before moderation can lose safety-critical information | The exact loss depends on domain, image type, and caption quality |
| SFT gives large gains; DPO and OGDPO refine hard cases | Safety models need curricula that include ambiguity, not only obvious violations | The paper does not provide production cost, latency, or false-positive economics |
| Hybrid annotation achieves strong agreement with expert labels | Scalable safety datasets may need AI voting plus human escalation | Human labels still reflect the chosen taxonomy and policy assumptions |
For enterprise AI systems, this points to a three-layer safety design.
First, keep input and output moderation. It still matters. A bad prompt and a bad answer are still bad. Revolutionary stuff.
Second, add multimodal grounding checks when images, screenshots, documents, video frames, or diagrams materially affect the user request. A safety layer that ignores the visual channel is not “modality agnostic.” It is blind.
Third, where reasoning traces or intermediate planning artifacts exist, audit the trajectory. This applies most clearly to agentic workflows: troubleshooting, claims review, document analysis, site inspection, medical triage support, legal intake, cybersecurity assistance, and financial advisory copilots. In those settings, the model’s intermediate interpretation can influence downstream actions even if the final answer is polished.
The ROI is not simply “fewer unsafe outputs.” The better operational value is cheaper diagnosis. If a final response fails, a trajectory-aware auditor can help identify whether the problem came from the visual input, the model’s inferred intent, the reasoning path, or the final wording. That matters for incident review, model improvement, and internal accountability.
The uncomfortable constraint: many systems hide the trace
Now the limitation that actually matters.
GuardTrace-VL assumes the auditor can inspect the QTA triple. In research settings, the reasoning trace is available. In many deployed systems, it may not be. Some providers deliberately hide chain-of-thought-style reasoning. Some expose summaries rather than raw reasoning. Some agent frameworks produce logs, tool traces, plans, memory updates, and scratchpads, but not in a standardized format.
This creates a practical fork.
If the system exposes meaningful intermediate artifacts, trajectory-level safety auditing is feasible. The guard can inspect image-text input, reasoning or planning trace, and answer. If the system exposes only final answers, GuardTrace-style auditing becomes difficult. The company may still audit tool calls, retrieval results, intermediate plans, or structured logs, but it is no longer auditing the same object studied in the paper.
There is also the false-positive problem. The paper maps “potentially harmful” to harmful for binary evaluation, which is sensible in safety-critical moderation. But in business workflows, escalation has a cost. A compliance system that flags every mildly ambiguous trace may protect the organization while quietly destroying usability. The paper’s F1 scores are helpful, but they do not replace domain-specific threshold setting.
Finally, the benchmark is curated. It draws from established safety datasets and jailbreak methods, generates QTA traces from selected models, and uses a defined risk taxonomy. That is exactly what a benchmark should do. It is not the same as proving performance on messy enterprise traffic with local policies, multilingual users, proprietary documents, and domain-specific exceptions.
None of these boundaries weaken the paper’s central insight. They simply prevent the usual slide-deck disease where a research result becomes a universal solution after three arrows and a cloud icon.
What AI teams should take from GuardTrace-VL
The paper gives AI teams a useful checklist.
If your model processes images and text, do not assume text moderation is enough. If your model reasons explicitly or produces intermediate plans, do not assume final-answer moderation is enough. If your safety labels are generated by generic prompts, do not assume the labels are reliable enough for preference training. And if your benchmark contains only obvious safe-versus-harmful cases, do not assume the model has learned the boundary where real governance work happens.
The strongest part of GuardTrace-VL is that it treats safety as a trajectory problem. It asks not only “Was the answer safe?” but also “Did the model get to that answer safely?” That second question will become more important as multimodal agents move from chat interfaces into operational workflows.
The final answer is no longer the whole product. It is just the visible residue of a process. GuardTrace-VL reminds us that safety has to inspect the process too.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuxiao Xiang et al., “GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision,” arXiv:2511.20994, 2025. https://arxiv.org/abs/2511.20994 ↩︎