Beam planning sounds like the sort of work automation should have solved years ago.
There is a target. There are organs at risk. There are dose constraints. There is an optimizer. Surely the machine should find the best plan while humans do something more dignified than nudging parameters inside a treatment planning system for the seventeenth time.
And yet radiation oncology has not been waiting for a generic chatbot with a confident voice. It has been waiting for something much less glamorous: an automated planner that can make a good plan, explain why it made that plan, accept a physicist’s correction, and leave behind enough evidence that someone responsible can sleep at night.
That is the useful lens for reading the new SAGE paper on automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent.1 The tempting headline is “AI matches human planners.” The better headline is more specific: the study compares three planning modes — human clinical plans, a non-reasoning LLM agent, and a reasoning LLM agent — and asks where deliberation actually pays.
This matters because radiotherapy planning is not a benchmark puzzle. It is a constrained clinical workflow. A plan can be numerically acceptable and still operationally awkward. A model can hit dose limits and still be hard to trust. An agent can be “autonomous” in a demo and still be useless if every failure requires forensic archaeology.
So let us not ask the lazy question: will AI replace dosimetrists?
The sharper question is: which planning architecture belongs in which part of the clinic?
The comparison is the point, not the decoration
The paper evaluates SAGE, short for Secure Agent for Generative Dose Expertise, on 41 retrospective single-target brain-metastasis stereotactic radiosurgery cases. Each case used a single-fraction prescription of 18 Gy. The plans were created inside Varian Eclipse, with the beam geometry fixed to match the clinical plan configuration. That last detail matters. The agent was not allowed to win by secretly changing the game board.
The study compares:
| Planning mode | What it represents | Main operational question |
|---|---|---|
| Human clinical plans | Existing expert workflow | What does current accepted practice achieve? |
| Non-reasoning SAGE | Faster direct-output LLM planning | Can an agent produce acceptable plans without deliberative traces? |
| Reasoning SAGE | Slower deliberative LLM planning | Does explicit reasoning improve plan quality, refinement, reliability, or auditability? |
Both SAGE variants operated through an iterative loop: receive the clinical scenario and current optimizer state, adjust optimization parameters, calculate dose, evaluate the plan, and repeat up to ten iterations unless clinical goals were already satisfied. Then came the human-in-the-loop stage. A board-certified medical physicist accepted the plan or sent it back for a standardized natural-language refinement prompt focused on improving conformity while preserving target coverage and organ-at-risk constraints.
This design is more interesting than a simple “model versus doctor” contest. It separates three things that are often blurred together in AI healthcare discussions: dosimetric performance, responsiveness to expert feedback, and process transparency.
A model that scores well only on the first dimension may still be clinically irritating. Congratulations, it made a plan. Now please explain why it pushed dose in that direction, why conformity stalled, and whether the optic pathway constraint was checked before or after the parameter change. Medicine loves miracles, but it invoices through documentation.
On primary endpoints, reasoning mostly earns the right to stay in the room
The reasoning agent’s headline result is comparability with clinical plans on primary dosimetric endpoints.
The reasoning model achieved median planning target volume coverage of 96.8%, compared with 96.5% for clinical plans and 96.2% for the non-reasoning model. No patient in either AI cohort fell below the 95% coverage threshold. Median maximum doses stayed below the 21.6 Gy threshold across all three groups. In paired comparisons between the reasoning agent and clinical plans, target coverage, maximum dose, conformity index, and gradient index did not show statistically significant differences.
That is not the same as “the AI is better than humans.” It is closer to “the reasoning agent produced clinically comparable plans under the conditions tested.”
That distinction is not pedantry. It is the difference between a product claim and an evaluation result.
The paper uses paired non-parametric Wilcoxon signed-rank tests and applies Benjamini-Hochberg false discovery rate correction within pre-specified endpoint families. But the authors also state a key boundary: this was not formal equivalence testing with pre-specified margins. So “no significant difference” should not be inflated into a regulatory-grade equivalence claim.
Here is the practical interpretation:
| Result category | What the paper directly shows | Business meaning | Boundary |
|---|---|---|---|
| Target coverage | Reasoning SAGE met clinical coverage expectations across the cohort | The agent can produce plans that clear core acceptability checks | Retrospective, single-institution, 41 cases |
| Maximum dose | Reasoning plans were comparable to clinical plans | The agent did not buy coverage by obviously violating dose ceilings | Not a patient-outcome study |
| CI and GI | No significant paired difference versus clinical plans | Plan shape quality was not obviously inferior | No formal equivalence margins |
| OAR sparing | One significant advantage: right cochlea dose | The agent may continue optimizing after thresholds are met | Clinical impact on toxicity remains unknown |
For deployment, this is still valuable. In many clinical workflows, the first barrier is not superhuman performance. It is reaching a level where automation can draft a plausible plan without creating new review burden. A plan that starts near clinical acceptability can save time. A plan that starts near clinical acceptability and explains its reasoning can save attention.
Attention, not compute, is the scarce resource hiding inside many “AI productivity” stories.
The cochlea result is a clue, not a victory parade
The most statistically visible advantage for the reasoning agent appears in right cochlear sparing. The reasoning model achieved significantly lower right cochlear dose than clinical plans after correction. Other lateral organ-at-risk endpoints — left cochlea, right optic nerve, left optic nerve — did not differ significantly.
This is exactly the kind of result that needs careful handling.
The bad interpretation is: AI protects hearing better than human planners.
The better interpretation is: in this cohort, the reasoning agent found additional dose reduction for one secondary endpoint, even though both AI and human plans stayed below the relevant tolerance threshold.
The authors offer a plausible operational explanation. Human planners work under time pressure. Once an organ-at-risk dose is below its clinical threshold, further optimization may not be worth the marginal effort, especially if other cases are waiting. This is not laziness. It is rational workflow management. A hospital is not a philosophy seminar where every acceptable plan is refined until moral perfection.
The reasoning agent, by contrast, has no shift fatigue and no emotional relationship with queue length. Once a constraint is satisfied, it may continue applying the ALARA principle — as low as reasonably achievable — more aggressively.
But the laterality is unresolved. The paper notes possible contributors: the spatial distribution of tumor locations, beam geometry inherited from clinical plans, or statistical variation across a modest sample. That uncertainty should stay in the article’s foreground because it protects the finding from being over-sold.
The useful business takeaway is not “AI discovered a hidden human failure.” It is “agents may be useful for chasing secondary optimization gains after humans would reasonably stop.”
That is a very different procurement argument. Less cinematic, more believable. Annoying, I know.
The human prompt test is where agentic planning becomes operational
The human-in-the-loop refinement stage is one of the paper’s most practical components.
Plans failing conformity benchmarks were sent back to SAGE with an identical natural-language prompt requesting better conformity while maintaining coverage and organ-at-risk constraints. Both model variants improved their conformity index after refinement. The non-reasoning model improved with $p = 0.007$; the reasoning model improved with $p < 0.001$.
The authors correctly warn against reading the smaller p-value as simply “bigger improvement.” They interpret it as greater consistency across patients, with lower variance. That matters because clinical deployment punishes variance.
An assistant that sometimes makes a brilliant adjustment and sometimes wanders into malformed output is not an assistant. It is a slot machine with a medical license application.
The reasoning model’s post-refinement conformity approached clinical benchmark values more closely than the non-reasoning model. This is where the comparison-based structure earns its keep. If we only compare the reasoning agent to human plans, we miss the operational question: does reasoning make the agent more responsive to expert steering?
The answer suggested by the paper is yes, especially when the human prompt is not a hard-coded instruction but a natural-language clinical correction.
This is not a small detail. Many AI deployment plans imagine a clean handoff: model generates, human approves. Real workflows are messier. Humans redirect. Models revise. The question becomes whether the system can participate in a correction loop without forcing the human to translate clinical intent into machine ritual.
SAGE’s refinement test is therefore not just an accuracy test. It is an interface test disguised as dosimetry.
The dialogue analysis is not decorative explainability
The most distinctive part of the study is the mechanistic content analysis of optimization dialogues.
The authors categorize reasoning traces into six System 2-like behaviors: problem decomposition, prospective verification, self-correction, mathematical reasoning, trade-off deliberation, and forward simulation. This is not a claim that the model has human cognition. The paper treats the System 1/System 2 distinction as a behavioral analogy, which is the sane version of that metaphor.
The results are stark. Problem decomposition, prospective verification, and self-correction were detected exclusively in the reasoning model: 537, 457, and 735 instances respectively. The non-reasoning model produced zero instances in these categories. Mathematical reasoning appeared mostly in the reasoning model, with 1,162 instances versus 49. Trade-off deliberation showed 609 versus 7. Forward simulation showed 1,888 versus 58.
The reasoning model also produced five-fold fewer format errors: 25 total versus 122 for the non-reasoning model, with a median of 0 versus 3 per patient.
This is the part of the paper that moves the argument beyond “reasoning sounds nicer.” The reasoning traces correlate with observable workflow advantages: better structured outputs, more explicit constraint checking, and more consistent refinement behavior.
Still, the purpose of this analysis should be classified carefully.
It is not the main clinical evidence. The dosimetric endpoints carry that role.
It is not an ablation in the strictest sense, because the two models differ in more than a single controllable reasoning switch. The reasoning model is Qwen QwQ-32B-Reasoning; the non-reasoning model is Llama 3.1-70B. Their training data, architecture choices, and inference behavior are not identical except for one magic “think” button.
It is best understood as behavioral evidence: a structured comparison of how the two agent variants act during planning.
| Experiment component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Clinical dosimetric comparison | Main evidence | Reasoning SAGE can produce plans comparable to clinical plans on tested endpoints | Superiority over human planners |
| Human-in-the-loop refinement | Operational workflow test | Agents can respond to standardized expert feedback; reasoning appears more consistent | Real-world multi-reviewer deployment readiness |
| Dialogue content analysis | Behavioral mechanism analysis | Reasoning model produces more verification, self-correction, trade-off language, and fewer format errors | Causal proof that chain-of-thought alone caused dosimetric gains |
| Right cochlea result | Secondary endpoint finding | Reasoning agent may pursue additional OAR sparing beyond thresholds | Generalized toxicity reduction |
This distinction is essential for business readers. “Explainability” often becomes a decorative dashboard bolted onto an opaque model after the fact. Here, the explanation is closer to an operational trace: a record of constraint checking and trade-off reasoning generated during the planning process itself.
That is much more useful than an apologetic heatmap saying, “The model looked over here, probably.”
The non-reasoning agent is not useless; it is cheaper judgment about where not to overthink
One of the more mature parts of the paper is its refusal to make reasoning universally superior.
Both model variants met clinical constraints for target coverage and organ-at-risk dose limits. Both improved conformity after human feedback. The non-reasoning model was faster and cheaper computationally. The reasoning model required roughly three times longer inference per plan.
This creates a practical routing problem.
For routine cases with clear constraints and limited degrees of freedom, a non-reasoning agent may be sufficient. It can generate acceptable plans more efficiently. In that setting, paying for extended deliberation may be like hiring a constitutional lawyer to read a parking sign.
For complex SRS cases, subtle geometric trade-offs, or situations where auditability is critical, the reasoning agent may justify its slower inference cost. The value is not merely better numbers. It is a more reviewable planning process.
A useful implementation would not ask, “Should we deploy reasoning agents?” It would ask:
| Case condition | Better default architecture | Reason |
|---|---|---|
| Routine anatomy, stable constraints, low ambiguity | Non-reasoning planner plus standard QA | Faster and likely adequate |
| Tight target-OAR geometry | Reasoning planner | Deliberation may improve trade-off handling |
| Plans requiring expert refinement | Reasoning planner | Better response to natural-language correction |
| High audit burden or disputed constraint trade-off | Reasoning planner | Traceable verification and decision logic matter |
| Resource-constrained environment | Non-reasoning planner or hybrid routing | Reasoning inference cost may dominate benefits |
This is the business argument hiding inside the technical paper: reasoning should be treated as a scarce workflow resource. Use it where reasoning creates value. Do not sprinkle it everywhere just because “agentic” looks good in a slide deck.
The real ROI is not replacing planners; it is reducing review friction
The easiest sales pitch is labor substitution. Automated planning reduces planner workload, therefore hospitals save money.
That pitch is too thin for this paper.
A more credible ROI pathway has three layers.
First, SAGE may reduce the time spent generating initial acceptable plans. That is the obvious productivity layer. In a field facing planner shortages and compressed treatment timelines, even a strong draft can matter.
Second, reasoning traces may reduce review friction. A physicist does not only need a plan. They need to know whether constraints were checked, which trade-offs were considered, and whether a refinement request was followed correctly. If the agent’s process is inspectable, review becomes less like black-box inspection and more like supervised delegation.
Third, human-in-the-loop refinement may shift expert effort from parameter manipulation to decision oversight. The human’s role becomes less “manually search the parameter space” and more “judge whether the agent’s plan and reasoning satisfy clinical intent.”
That is a better fit for healthcare than the fantasy of full autonomy. Hospitals do not need an AI that impersonates responsibility. They need automation that makes responsibility easier to exercise.
The paper’s architecture points in that direction. It keeps a human decision point. It uses natural-language feedback. It produces structured traces. It compares reasoning and non-reasoning behavior rather than assuming every LLM is the same animal wearing a different badge.
Where the evidence stops
The study is promising, but its boundaries are not cosmetic.
The cohort contains 41 patients from one institution. The cases are single-target brain-metastasis SRS with a single-fraction 18 Gy prescription. External validation across centers, beam configurations, planning conventions, and more complex indications is still needed.
The study is retrospective. It evaluates dosimetric endpoints, not patient survival, toxicity, hearing preservation, or long-term clinical outcomes. The right cochlear sparing result is interesting, but whether it translates into reduced late toxicity remains unknown.
The comparison between reasoning and non-reasoning agents is also not a pure causal ablation. The models differ: Llama 3.1-70B for the non-reasoning variant and Qwen QwQ-32B-Reasoning for the reasoning variant. The authors reasonably frame the distinction behaviorally, but business readers should not interpret the result as proof that any chain-of-thought wrapper on any model will reproduce the same gains.
There is also the compute problem. The reasoning model took about three times longer per plan. In a large academic center with institutional GPU infrastructure, that may be acceptable for complex cases. In a smaller clinic, the deployment calculus changes.
Finally, the content analysis relies on keyword and phrase pattern matching with manual verification of a sample. That is useful for behavioral comparison, but not the same as fully validated cognitive process measurement. The traces are operationally valuable because they are reviewable, not because they prove the model “thinks” like a physicist in the human sense.
The planner of the future may be a router, not a robot
The article’s title says “think before you beam,” but the better operational rule is more conditional: think when thinking pays.
The reasoning agent in this paper earns attention because it does three things at once. It produces clinically comparable SRS plans on primary dosimetric endpoints. It responds meaningfully to human refinement prompts. It leaves behind richer traces of constraint verification, trade-off deliberation, and forward simulation than the non-reasoning variant.
That combination is more important than any single metric.
A non-reasoning agent can be useful when the task is straightforward. A reasoning agent becomes attractive when geometry, clinical trade-offs, and accountability collide. Human planners remain essential because the agent’s output still requires judgment, approval, and contextual interpretation.
That is not a disappointing compromise. It is probably the only deployable version of the story.
In healthcare, intelligence is not enough. Performance is not enough. Even explainability is not enough if it arrives after the decision like a press secretary after a scandal.
The valuable agent is one that can plan, revise, and show its work while a qualified human remains in control.
Not AI instead of physics. AI that has learned, at least behaviorally, to respect why physics is hard.
Cognaptus: Automate the Present, Incubate the Future.
-
Humza Nusrat et al., “Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent,” arXiv:2512.20586, 2025. https://arxiv.org/abs/2512.20586 ↩︎