Hallucination-Resistant Security Planning: When LLMs Learn to Say No

Security teams do not need an AI that sounds decisive. They already have enough decisive systems. Some of them are called “legacy tools.” Some are called “urgent executive dashboards.” A few are called “we should probably reboot it.”

What security operations need is more uncomfortable: an AI system that can propose useful response actions, explain why they might work, and then refuse to act when its own reasoning becomes unstable. That refusal matters. In an incident-response workflow, a hallucinated recommendation is not merely a bad paragraph. It can isolate the wrong host, patch a vulnerability that does not exist, wipe evidence too early, or generate a playbook that looks official while quietly wasting the first thirty minutes of response time.

The paper behind today’s article, Hallucination-Resistant Security Planning with a Large Language Model, is interesting because it does not treat hallucination as a cosmetic language problem.¹ It treats hallucination as an operational planning risk. The authors’ central move is simple: stop asking the LLM to be a trusted decision-maker. Use it as a candidate generator inside a verification loop.

That sounds modest. It is not. In high-stakes systems, modesty is often the missing architecture.

The wrong lesson is “use a stronger model”

The easy interpretation of LLM security tools is still model-centric: take a stronger frontier model, give it longer context, write a better prompt, and hope the response plan improves. This is not irrational. Larger models often reason better, retrieve more relevant latent knowledge, and produce cleaner explanations.

But the paper is built around a different hypothesis: reliability does not come only from model scale. It comes from the decision loop around the model.

That distinction matters because incident response is not a normal text-generation task. The model is not being asked to summarize a report after the fact. It is being asked to choose the next recovery action from partial evidence: intrusion-detection logs, system descriptions, incident summaries, and previous actions. A plausible next step is not necessarily a useful next step. The authors therefore define a hallucinated action operationally: an action is hallucinated if it does not reduce the expected remaining time required to complete the task.

That definition is refreshingly unsentimental. It does not ask whether the sentence sounds believable. It asks whether the action moves recovery forward. If the model suggests patching a vulnerability that is not present, that is hallucination. If it recommends shutting down a component unrelated to the incident and thereby creates extra work, that is also hallucination. In this paper, hallucination is not “false text.” It is wasted or counterproductive operational movement.

This is the first useful shift for business readers. The question is not “Can the model answer like an expert?” The question is “Does the model reduce the number of validated actions needed to restore the system?”

The mechanism: generate, look ahead, abstain, learn

The framework has six moving parts. None is exotic on its own. The value is in how they are wired together.

Step	What happens	Why it matters operationally
Candidate generation	The LLM proposes multiple possible next actions, not just one.	Multiple candidates expose uncertainty that a single polished recommendation hides.
Lookahead prediction	The LLM predicts the expected recovery time after each candidate action.	Actions are scored by likely downstream effect, not by rhetorical confidence.
Consistency scoring	The system measures whether predicted outcomes agree or diverge.	Large disagreement becomes a warning signal for hallucination risk.
Abstention	If consistency is below a calibrated threshold, the system refuses to select an action.	“No action yet” becomes a safety feature, not a product failure.
External feedback	The rejected candidate is evaluated externally, for example by a digital twin or expert.	The system gets grounded feedback without pretending the LLM knows everything.
In-context learning	Feedback is appended to the model context and the candidate generation repeats.	The model adapts during the task without retraining its weights.

In plain language, the framework asks the model several times what to do next, asks it to estimate what each action would achieve, checks whether those estimates form a coherent story, and only then allows an action to pass. When the model’s internal picture is too scattered, the system does not negotiate with the chatbot. It abstains.

That is the paper’s most business-relevant idea. A safe AI assistant is not one that always answers. It is one whose answer is conditional on passing a verification gate.

The framework can be expressed as a compact loop:

incident logs + system description
        ↓
LLM generates candidate actions
        ↓
LLM predicts lookahead recovery impact
        ↓
consistency score is computed
        ↓
if consistent: select lowest predicted recovery-time action
if inconsistent: abstain, collect feedback, retry

The LLM is still doing important work. It generates candidate actions. It estimates consequences. It incorporates feedback. But it is no longer trusted as the sole authority. It is a hypothesis engine with a brake pedal. A shocking innovation, apparently.

Consistency is used as a practical proxy for risk

The paper’s consistency function measures how dispersed the lookahead predictions are. If the predicted remaining recovery times are close to each other, the candidate set is treated as more coherent. If they spread wildly, the candidate set is treated as unreliable.

The paper defines consistency as:

$$ \lambda(A_t) = \exp\left(-\frac{\beta}{N}\sum_{i=1}^{N}(T^i_{t+1} - \bar{T}_{t+1})^2\right) $$

Here, $A_t$ is the set of candidate actions at step $t$, $N$ is the number of candidates, $T^i_{t+1}$ is the predicted remaining recovery time after candidate action $i$, $\bar{T}_{t+1}$ is the average predicted remaining time, and $\beta$ controls how quickly the score decays as predictions disagree.

The output sits between 0 and 1. A score near 1 means high consistency. A score near 0 means the model’s predicted outcomes are scattered. The framework then compares this score with a threshold $\gamma$:

$$ \pi_\gamma(A_t)= \begin{cases} \varnothing, & \text{if } \lambda(A_t) \leq \gamma \ \arg\min_{a^i_t \in A_t} T^i_{t+1}, & \text{if } \lambda(A_t) > \gamma \end{cases} $$

In less formal language: if the candidates do not look internally coherent, select nothing. If they do, choose the action with the lowest predicted remaining recovery time.

There is an important nuance here. Consistency is not truth. A model can be consistently wrong. A committee can agree on a bad idea; this is called “a meeting.” The paper does not claim that consistency alone proves correctness. Instead, consistency is used as a risk indicator that can be calibrated against historical hallucinated examples.

That calibration step is what turns this from a heuristic into a controllable decision rule. The authors collect a calibration dataset of candidate-action sets where the selected action is hallucinated. In their experiment, they use $n = 100$ such sets and configure $\gamma = 0.9$, which bounds hallucination probability by 0.05 under the paper’s i.i.d. calibration assumption.

For a security manager, the practical message is not “copy $\gamma = 0.9$.” The message is: build a rejection threshold from observed failure cases, not from a product manager’s optimism.

The theory says “controllable,” not “magical”

The paper gives two theoretical results, and both should be read carefully.

First, it shows that the hallucination probability can be upper-bounded by tuning the consistency threshold using a calibration set. This is the abstention guarantee. The cost is obvious: a stricter threshold means more refusals and more feedback requests. In low-risk advisory work, a higher tolerance may be acceptable. In automated or semi-automated incident response, the threshold should be harsher. Security automation is one of those areas where being “mostly right” can become an expensive personality trait.

Second, the paper gives a Bayesian regret bound for the in-context learning process under explicit assumptions. The authors interpret ICL as approximate Bayesian learning and frame the feedback process like a bandit problem. Under the assumptions that the LLM’s action distribution aligns with the posterior over optimal actions and that feedback is an unbiased reward signal, the regret after $K$ ICL iterations is bounded as:

$$ R^K_t \leq C\sqrt{|A|K\ln K} $$

This result matters because it connects the feedback loop to a known learning dynamic: the system should improve as it receives feedback, rather than merely accumulate prompt debris. But the assumptions are strong. Real security environments do not always provide clean, unbiased feedback. Digital twins can be incomplete. Experts can disagree. Logs can be missing. The LLM’s implicit distribution is not guaranteed to behave like a neat posterior distribution just because a proof would find that convenient.

So the theoretical contribution should be interpreted as a design discipline, not a production warranty. It tells us what kind of structure is needed for controllable risk: calibration, abstention, feedback, and bounded learning under stated conditions.

That is still valuable. In enterprise AI, “this guarantee depends on assumptions” is much better than “our benchmark looks nice and the demo music was inspiring.”

The experiment tests incident-response planning, not autonomous remediation

The evaluation applies the framework to incident response across 25 incidents from four public datasets:

Dataset	Systems	Attacks / logs
CTU-Malware-2014	Windows XP SP2 servers	Malware and ransomware examples, using Snort alerts
CIC-IDS-2017	Windows and Linux servers	Denial-of-service, web attacks, Heartbleed, SQL injection, and related incidents, using Snort alerts
AIT-IDS-V2-2022	Linux and Windows servers / hosts	Multi-stage attack activity including reconnaissance, cracking, and escalation, using Wazuh alerts
CSLE-IDS-2024	Linux servers	Incidents such as SambaCry, Shellshock, and CVE-2015-1427 exploitation, using Snort alerts

Each incident is represented by logs and a textual system description. The model’s job is to generate a sequence of response actions that recovers the system. The paper evaluates three metrics:

Recovery time, measured as the number of actions required to recover.
Percentage of ineffective actions, including hallucinated or unnecessary actions.
Percentage of failed recoveries, where the generated sequence does not recover the system.

The recovery criteria follow incident-response stages: containment, assessment, evidence preservation, eviction, hardening, and restoration. This is important because the paper is not simply judging whether the answer sounds good. It is checking whether the generated plan covers required recovery states.

The experimental framework uses a fine-tuned DeepSeek-R1-14B model, running with 4-bit quantization on four RTX 8000 GPUs. It generates $N = 3$ candidate actions and uses $\beta = 0.9$ in the consistency function. The comparison baselines are larger frontier models: DeepSeek-R1, Gemini 2.5 Pro, and OpenAI O3.

That setup creates a useful puzzle: why would a 14B model wrapped in a planning loop outperform much larger models used directly?

The answer is not that the smaller model is secretly smarter. It is that the framework changes the job. The baseline models are asked to produce plans. The proposed system is allowed to generate, test internal coherence, refuse bad candidates, receive feedback, and revise. Better workflow beats bigger monologue.

The main result: shorter plans and fewer wasted actions

Across all evaluation datasets, the proposed framework achieves the shortest recovery time. On average, the framework reaches a recovery time of 12.02 actions, compared with 16.21 actions for the next-best model, Gemini 2.5 Pro. OpenAI O3 and DeepSeek-R1 score higher average recovery times at 17.28 and 17.09, respectively.

This means the result is not just “the proposed method wins by a rounding error.” The average plan is materially shorter. Since recovery time is defined as number of actions, not wall-clock time, the result should be read as decision-efficiency: fewer validated steps are needed to complete the recovery sequence.

The paper also reports the average percentage of ineffective actions:

Model	Average recovery time	Average ineffective actions	Average failed recoveries
Proposed framework	12.02	7.62%	2.50%
Gemini 2.5 Pro	16.21	11.12%	3.30%
OpenAI O3	17.28	12.26%	4.21%
DeepSeek-R1	17.09	11.99%	4.48%

The business interpretation is straightforward but should not be overstated. The paper shows that, in this benchmark construction, a calibrated planning loop produces shorter and cleaner incident-response plans than direct use of larger models. It does not show that the system can autonomously remediate a live enterprise network. It does not measure wall-clock recovery time. It does not measure the political time required to convince someone that, yes, the server really should be isolated now.

Still, action-count efficiency matters. In an incident, every unnecessary step increases cognitive load. Every irrelevant recommendation forces the analyst to spend attention rejecting it. The ROI is not only faster recovery. It is less analyst time wasted separating useful operational movement from fluent noise.

The ablations explain where the performance comes from

The ablation study is the part business readers should not skip. It tells us whether the framework is one clever trick or a system whose components each earn their place.

The paper tests variants with and without ICL feedback, lookahead, and abstention. The pattern is clear: removing any major component degrades performance. Two numbers are especially useful:

Test	Likely purpose	Result	What it supports
Remove both lookahead and ICL	Ablation of planning and learning components	Average completion time rises from about 12 to 21 actions	The performance gain is not just from the base model or prompt format.
Remove abstention	Ablation of refusal mechanism	Hallucination probability rises from 0.02 to 0.06	The refusal gate materially reduces hallucination risk.
Track feedback requests	Implementation-cost check	Average of 2.24 feedback requests, including 1.13 unnecessary requests	Reliability has an operational cost; feedback loops are not free.
Track regret over ICL iterations	Learning-dynamics check	Regret plateaus after about 8 iterations	The feedback loop appears to converge in the tested setup.
Vary number of candidate actions	Compute-scaling check	Sequential lookahead scales roughly linearly; parallel execution keeps planning time nearly constant	Candidate evaluation can be computationally manageable with parallel resources.

These are not all the same kind of evidence. The main comparison with frontier models supports performance. The ablation study supports mechanism. The feedback-request analysis supports implementation cost. The regret curve supports the claim that ICL feedback stabilizes after repeated correction. The compute-time test supports deployability under parallel hardware.

Bundling all of them into one sentence—“the framework works”—would be lazy. The better reading is that the paper gives a chain of evidence: the method performs better, the components explain part of the improvement, and the operational costs are visible enough to discuss.

Feedback is useful, but it is also a budget line

The framework requests feedback when candidate actions fail the consistency threshold. In the paper’s experiment, feedback is a textual evaluation of why the action that would have been selected is effective or ineffective. That evaluation is then appended to the model context.

In production, this feedback could come from several places:

Feedback source	What it can validate	Main weakness
Digital twin	Whether an action behaves safely in a simulated environment	The twin may not match the live system closely enough.
Expert analyst	Whether the action is operationally sensible	Human review is scarce, slow, and expensive.
Rule base / policy engine	Whether the action violates known constraints	Rules may miss novel incident context.
Historical incident repository	Whether similar actions worked before	Historical similarity can be misleading when architecture changes.
Automated test harness	Whether commands or configuration changes are syntactically and procedurally valid	Valid execution is not the same as correct response.

This is where business implementation becomes less glamorous. A feedback loop is only as good as the feedback substrate. If the enterprise has no useful incident history, no maintained asset inventory, no simulation environment, and no clear response criteria, the LLM wrapper will not magically discover operational truth from vibes.

The paper does not pretend otherwise. It frames the system as decision support. The final response plan is returned to a security operator. That operator validates the plan. In business terms, this is not a replacement for SOC expertise. It is a way to move the analyst’s work from log-sifting and generic playbook adaptation toward plan validation.

That is a meaningful labor shift. It is also a less silly promise than “autonomous cyber defense.”

The playbook comparison is really about actionability

The paper compares the framework with incident-response playbooks in an appendix. The point is not that playbooks are obsolete. They are still useful: textual, stage-based, and compatible with existing SOC workflows. The authors explicitly position their framework as a complement.

The difference is granularity. Traditional playbooks often provide generic response instructions. They must be configured by experts and kept current as threats and system architectures evolve. The paper argues that generated response plans can be more context-specific because they are conditioned on the system description, logs, incident state, and previous actions.

The authors also note a striking practical contrast: generated response plans typically span 2–3 pages, while playbooks can run 40+ pages. Longer is not automatically worse; sometimes documentation needs breadth. But during an active incident, a 40-page playbook can become a very scholarly way to lose time.

The best interpretation is not “replace playbooks with LLMs.” The better interpretation is:

Keep playbooks as governance and procedural scaffolding.
Use LLM planning loops to adapt response actions to current logs and system context.
Use abstention and feedback to prevent the adaptation layer from becoming a confident improvisation engine.
Keep the operator in the approval path unless the organization has validated a narrower automated action class.

That last point matters. Some actions can be automated safely after enough validation, such as generating a proposed firewall rule for review. Other actions, such as wiping a host or changing production access controls, should remain heavily gated. “The model had high consistency” is not a change-management policy.

What Cognaptus infers for business use

The paper directly shows improved incident-response planning in a benchmark based on public datasets and manually constructed ground-truth response plans. It also directly shows that abstention, lookahead, and ICL contribute to the performance in the tested setup.

Cognaptus would infer three broader business lessons, with boundaries attached.

1. The valuable product is not a chatbot. It is a controlled planning service.

A SOC-facing AI product should not merely expose a chat window over logs. The useful architecture is closer to a planning service:

data ingestion → candidate response generation → constraint checks → lookahead scoring
→ calibrated abstention → feedback request → revised plan → human validation

This architecture creates several enterprise controls: thresholds, audit logs, feedback records, approval gates, and measurable action efficiency. It also makes procurement conversations less mystical. Buyers can ask: What is the abstention rate? What is the false-acceptance rate? How is the threshold calibrated? What happens when the model refuses?

A plain chatbot usually answers those questions with a brand deck. Delightful, but not a control surface.

2. Abstention should be priced as risk reduction, not counted as failure.

Many product teams dislike refusal because it looks like non-performance. In high-stakes workflows, refusal is performance when the alternative is an unsafe recommendation.

For incident response, the business metric should not be “percentage of prompts answered.” It should be closer to:

accepted action precision;
number of unnecessary actions avoided;
analyst review time saved;
recovery-action count;
escalation quality when the system abstains;
post-incident auditability of feedback and thresholds.

This changes how ROI is discussed. The value of the system is not only that it generates plans faster. It also filters its own bad plans before they reach the operator.

3. Smaller models can compete when surrounded by better process.

The paper’s proposed framework uses DeepSeek-R1-14B while the baselines are much larger frontier models. The result does not prove that smaller models generally beat frontier models. It shows that a smaller model inside a better decision loop can outperform larger models used more directly in this specific task.

For enterprises, that matters because cost and deployment constraints are real. A smaller model with domain fine-tuning, calibrated thresholds, local deployment, and external verification may be more attractive than sending sensitive logs to a frontier model and hoping the prompt is sufficiently majestic.

The uncertain part is portability. The benchmark uses 25 incidents and a specific experimental construction. Before a business generalizes the result, it should test the framework on its own incident types, network architecture, alert quality, analyst workflow, and permitted response actions.

Where the paper’s boundaries matter

The paper is promising, but several boundaries materially affect practical interpretation.

First, recovery time is measured as the number of actions, not elapsed wall-clock time. This is defensible because it abstracts away implementation details. But in real incidents, some actions take seconds and others take hours. A six-action plan may still be slow if one action requires cross-team approval or forensic imaging.

Second, the ground-truth response plans are manually constructed from dataset metadata. That is useful for evaluation, but it is not the same as observing response success in live enterprise conditions. The plans are only as good as the construction process and assumptions behind them.

Third, the theoretical guarantees rely on calibration and distributional assumptions. Proposition 1 assumes the calibration and test examples are i.i.d. The ICL regret analysis assumes Bayesian alignment and unbiased bandit feedback. These assumptions are not decorative; they define where the guarantee lives. Outside them, the framework remains sensible, but the formal comfort should be reduced.

Fourth, the feedback loop has cost. The paper reports an average of 2.24 feedback requests and 1.13 unnecessary feedback requests. That is not a flaw, but it is an implementation fact. If feedback depends on senior analysts, this cost appears as human workload. If feedback depends on simulation, it appears as engineering and infrastructure cost.

Finally, the system is decision support. The paper’s algorithm returns actions as support to a security operator. That framing should be preserved. The path to automation may exist for narrow, validated action classes, but the paper does not justify broad autonomous remediation.

The real lesson: refusal is part of intelligence

The most useful idea in this paper is not that LLMs can help with incident response. That has been obvious for a while. Logs are verbose, playbooks are long, and analysts are overloaded. The industry did not need another reminder that text models are good at text.

The useful idea is that LLM reliability can be engineered through a planning loop: generate several candidate actions, predict outcomes, measure consistency, abstain when coherence is weak, and learn from external feedback. The model’s job becomes narrower and safer. It proposes. The system checks. Feedback corrects. The operator validates.

That is a more mature design pattern for agentic AI in security operations. It is also applicable beyond security: finance, infrastructure, healthcare operations, compliance workflows, and any setting where confident wrongness is more dangerous than silence.

The old AI product instinct was to make the assistant answer more often. The better instinct is to make it answer only when the surrounding system has earned enough confidence.

Sometimes the smartest thing an AI can say is: not yet.

Cognaptus: Automate the Present, Incubate the Future.

Kim Hammar, Tansu Alpcan, and Emil C. Lupu, “Hallucination-Resistant Security Planning with a Large Language Model,” arXiv:2602.05279, 2026. ↩︎

The wrong lesson is “use a stronger model”#

The mechanism: generate, look ahead, abstain, learn#

Consistency is used as a practical proxy for risk#

The theory says “controllable,” not “magical”#

The experiment tests incident-response planning, not autonomous remediation#

The main result: shorter plans and fewer wasted actions#

The ablations explain where the performance comes from#

Feedback is useful, but it is also a budget line#

The playbook comparison is really about actionability#

What Cognaptus infers for business use#

1. The valuable product is not a chatbot. It is a controlled planning service.#

2. Abstention should be priced as risk reduction, not counted as failure.#

3. Smaller models can compete when surrounded by better process.#

Where the paper’s boundaries matter#

The real lesson: refusal is part of intelligence#