The surgeon does not need another chatbot

Operating rooms already have enough things demanding attention. Monitors, tools, imaging, staff coordination, alarms, procedural checklists, and the small matter of the patient. In robotic surgery, the problem becomes sharper: the surgeon’s hands are occupied and their visual attention is locked into the console. The data may be nearby, but nearby is not the same as usable.

That is the practical gap behind Voice-Interactive Surgical Agent for Multimodal Patient Data Control, which introduces VISA, a hierarchical LLM-based multi-agent system for controlling patient data during robotic surgery by voice.1 The paper is not proposing an autonomous surgeon. It is not asking an LLM to decide where to cut, what to remove, or whether a vessel is safe. Good. Nobody sensible ordered that.

The actual proposal is narrower and more useful: let a surgeon speak naturally, then let an orchestration layer decide whether the request should retrieve patient information, move through CT slices, or manipulate a 3D anatomical model overlaid on the surgical video.

That distinction matters. The paper is not about replacing surgical judgement. It is about reducing interaction friction at the exact moment when interaction friction is expensive.

The mechanism is the contribution, not the microphone

A weak version of this idea would be a voice menu: “say command A to show CT, command B to hide CT, command C to rotate model.” Useful, perhaps, but brittle. Surgeons do not always speak in clean command syntax. Speech recognition mishears medical vocabulary. “CT” becomes “city.” “Coronal” becomes “corona.” “Lung” becomes “long.” This is not a future-of-work issue. It is Tuesday with a microphone.

VISA’s more interesting move is to split the work into stages. The system listens, transcribes, corrects, validates, reasons, routes, and then acts. The hierarchy is the point.

At the top sits a workflow orchestrator. It decides which function should run next and whether the workflow should continue, retry, or stop. Below it are workflow functions for audio capture, speech-to-text, correction and validation, and command reasoning. Below those are three task-specific agents:

Agent What it controls Operational role
Information Retrieval (IR) Clinical patient data overlays “Show diagnosis,” “display PFT,” “hide patient info”
Image Viewer (IV) CT DICOM views Scroll, zoom, move axial/coronal/sagittal slices
Anatomy Rendering (AR) 3D anatomical models Show structures, rotate views, zoom into anatomy, remove models

The architecture is deliberately modular. The orchestrator does not need to know every detail of CT slice movement. The IV agent does. The AR agent does not need to manage wake-word detection. The audio function does. This separation is not glamorous, which is probably why it is useful.

The paper’s core design claim is that clinical voice interaction becomes more robust when the system is allowed to repair and route commands before execution, rather than forcing a single model to map raw speech directly into an action.

Reliability comes from staged correction, not heroic reasoning

The most important evidence in the paper is not merely the final success rate. It is the pattern of recovery across stages.

VISA evaluates 240 commands: 44 for information retrieval, 81 for CT image viewing, and 115 for anatomy rendering. The dataset is annotated across three command dimensions: structure, type, and expression. Structure distinguishes single commands from composite ones. Type separates explicit, implicit, and natural-language-question forms. Expression separates baseline wording, abbreviations, and paraphrases.

This matters because a system that only handles “show CT” has not solved the operating room interface problem. It has solved a demo script.

The paper reports the following overall stage-level results:

Stage Reported accuracy Interpretation
Real-time audio wake-word detection 88.6% The system recognised “davinci” in 266 of 300 attempts; failures were mostly false rejections
Speech-to-text 73.8% The weakest stage, mainly because medical terms are acoustically fragile
Command correction 98.3% The LLM repaired many transcription errors using medical correction rules and memory
Command reasoning 98.3% Agent routing remained strong after correction
Action determination 95.8% Some loss appeared when choosing exact actions and parameters
Orchestration flow 100% The workflow sequence itself remained controlled

The obvious reading is “speech-to-text is bad, LLM fixes it.” That is partly true, but too simple. The more useful reading is that the system treats early-stage failure as expected input. Transcription errors are not catastrophic because later stages are designed to absorb them.

For example, the paper describes cases where the STT output was wrong but still recoverable: “CT” being heard as “city,” or “age info” becoming something malformed enough to need correction before routing. The correction-and-validation stage uses the recent memory state and domain-specific rules to revise the command. Then the command reasoning stage selects the appropriate agent.

This is the lesson for enterprise agent systems generally: robustness often comes less from making the first model perfect and more from designing the workflow so that the first model is allowed to be imperfect.

MOEM is useful because single accuracy numbers lie politely

The authors introduce the Multi-level Orchestration Evaluation Metric, or MOEM, to evaluate both stage-level and workflow-level performance. That is a welcome change from the usual “intent classification accuracy” beauty contest, where a system can look excellent while failing the actual workflow.

VISA reports three workflow success conditions:

Condition Reported success rate What it means
Strict 65.8% Every stage succeeds without invalid loops
Single-pass 89.2% Early errors can be corrected in the same attempt
Multi-pass 95.8% The workflow may return to audio input and allow restatement, up to three invalid loops

The gap between strict and single-pass is the story. Under a brittle evaluation, the system looks mediocre: 65.8%. Under a workflow-aware evaluation, it looks much more useful: 89.2% in single-pass and 95.8% in multi-pass.

This is not statistical decoration. It changes what the result means.

Strict success asks: did the system get everything right the first time? Single-pass asks: did the system recover internally before bothering the user? Multi-pass asks: can the human and system recover together when the first attempt fails?

In clinical interfaces, the third question is often the most realistic. Humans repeat themselves. Assistants ask for clarification. Systems that fail gracefully may be operationally better than systems that only perform well under sterile phrasing.

Still, the multi-pass number should not be mistaken for full autonomy. It depends on users restating commands more clearly. That is acceptable for a control interface. It would be less acceptable if the system were making clinical decisions. Again: this is interface intelligence, not surgical authority.

The agent split reveals where complexity actually lives

The task-specific breakdown is instructive. The IR agent performs best because retrieving and displaying predefined clinical fields is comparatively constrained. The IV and AR agents face more spatial and parameter complexity.

Workflow-level success follows that pattern. In the paper’s agent-level results, IR reaches 86.4% strict, 93.2% single-pass, and 97.7% multi-pass. IV is more volatile: 48.1% strict, 90.1% single-pass, and 97.5% multi-pass. AR reaches 70.4% strict, 87.0% single-pass, and 93.9% multi-pass.

The IV agent’s low strict rate is not because CT navigation is impossible. It is because the STT stage often misrecognises CT-related terms. Once corrected, the agent recovers strongly. That tells a specific product story: the bottleneck is not necessarily reasoning about image movement; it is reliable recognition of the vocabulary that triggers the movement.

The AR agent’s weaker multi-pass rate is more structural. Manipulating 3D anatomy involves more actions, parameters, viewpoints, rotations, and zoom histories. There are more ways to be slightly wrong.

That difference is commercially important. Buyers will not purchase “an LLM surgery agent.” They will purchase specific capabilities with specific risk profiles. Clinical information overlays are one class of problem. CT navigation is another. 3D anatomical manipulation is another. Pretending they are the same because they all sit behind a voice interface is how procurement decks become expensive fiction.

Memory turns ambiguous commands into usable commands

One of VISA’s practical mechanisms is memory. The system stores previous commands, selected agents, and parameters. This helps when a surgeon says something underspecified like “zoom in,” “reset,” or “surgical view.”

Without memory, “zoom in” is not a command. It is a small semantic hostage situation. Zoom in on what? A CT plane? A 3D anatomical model? Which view? Which structure?

VISA uses recent context to resolve that ambiguity. The correction-and-validation stage can inherit the previous agent when the command is empty or implicit. The command reasoning stage also receives recent revised commands and associated agents. The IV and AR agents preserve state across sequential video clips: CT slice positions continue, AR zoom paths can be reversed, and reset commands restore default states.

This is not just convenience. It is the difference between a command system and an interaction system.

A command system treats every utterance as isolated. An interaction system understands that the next utterance is often a continuation of the last one. In surgery, manufacturing, field maintenance, and logistics, that continuity is not a luxury. It is how humans actually work.

The category tests are robustness probes, not a second thesis

The paper’s category-level evaluation is best read as a robustness analysis. It tests how the system handles variations in structure, intent clarity, and phrasing.

The results are unsurprising in the useful way:

Category Reported multi-pass success Likely purpose of the test What it supports
Single commands 96.9% Main robustness check VISA handles straightforward one-step commands well
Composite commands 80.0% Stress test Multi-action requests remain fragile
Explicit commands 97.5% Baseline linguistic clarity Clear commands are stable
NLQ commands 95.0% Natural language variation Question-style speech is mostly manageable
Implicit commands 95.0% Memory dependence Context helps resolve underspecified commands
Baseline expressions 97.2% Prompt-aligned phrasing Expected wording works well
Abbreviations 100.0% Domain shorthand test Prompted abbreviations are handled cleanly
Paraphrases 92.5% Unseen expression test Linguistic flexibility exists, but has limits

The composite-command result is the one to watch. VISA can handle some combined instructions when one function can absorb the parameters. For example, moving a CT slice to the middle and zooming in can be represented as one image-viewer action with parameters. But commands requiring genuinely sequential multi-step actions are harder. The paper identifies failures involving requests such as zooming and rotating, or initialising and zooming, where the system struggles with execution decomposition.

This is the same frontier facing many agentic systems outside healthcare. Single-step tool use is increasingly manageable. Multi-step execution under ambiguity is where the floorboards creak.

Voice robustness looks promising, but the test is still polite

The paper also includes an applicability analysis for real surgical settings. This section is best treated as a robustness and sensitivity test, not as definitive clinical validation.

The authors compare four synthesized Microsoft Edge TTS voices with human speech. Synthesized voices produce higher STT accuracy because they are clear and native-like. Human speech has lower strict success because of STT failures, but it achieves the highest multi-pass success because humans can restate commands more clearly after failure.

That is a useful finding, but it should not be overextended. Synthetic voices are tidy. Real operating rooms are not. Accents, masks, stress, overlapping conversation, device noise, multilingual switching, and institution-specific shorthand will all push the interface harder than the paper’s setup.

The unsupported-command test is also useful. The wake-word module stayed active for one hour, with unsupported commands spoken every four minutes. The system recorded zero unintended wake-word activations. After the wake word was triggered, however, three of 15 unsupported commands were misclassified as valid patient-information requests: vital signs, CO₂ pressure, and frozen-section timing. Because those fields were not present in the available data columns, the system either displayed previous information or hid the overlay.

That is not a disaster, but it is exactly the kind of edge case clinical software teams must respect. A system that says “I cannot access that field” is safer than one that silently does something adjacent. Adjacent is where operational risk likes to live.

What the paper directly shows

The direct evidence supports three claims.

First, hierarchical orchestration improves robustness for voice-controlled multimodal data interaction. The system does not merely classify intent; it plans stage execution, corrects transcription errors, routes commands, and invokes specialised agents.

Second, memory is not optional for natural voice control. It helps resolve implicit commands and maintain state across sequential interactions.

Third, workflow-aware evaluation is more informative than one-shot accuracy. VISA looks very different under strict success, single-pass recovery, and multi-pass recovery. That difference is not a methodological footnote. It is the product.

The paper’s strongest evidence comes from the 240-command evaluation, the stage-level breakdown, the workflow success conditions, the category analysis, the voice-type comparison, the unsupported-command test, and the sequential-command demonstrations. The category and voice tests are robustness analyses. The unsupported-command test is a safety-adjacent validation probe. The sequential-command examples are an implementation-focused continuity check. None of them should be inflated into proof of clinical deployment readiness.

What Cognaptus infers for business use

The business relevance is not “LLMs can do surgery.” Please do not put that on a slide unless the goal is to frighten the adults in the room.

The better interpretation is that agent orchestration can make hands-free operational interfaces more flexible than fixed command menus. Robotic surgery is the high-stakes example, but the design pattern travels.

In healthcare, the near-term pathway is likely to be data interaction: clinical overlays, imaging navigation, procedural documentation, and context-aware retrieval inside existing surgical or radiology workflows. The ROI logic is attention preservation, not labour replacement. If an interface reduces the need to look away, call an assistant, or manually manipulate another screen, it may improve workflow continuity.

Outside healthcare, the same pattern applies wherever workers need data while their hands and eyes are busy: manufacturing inspection, aviation maintenance, warehouse operations, field service, oil and gas, and emergency response. The useful architecture is not “one chatbot for operations.” It is an orchestrator plus specialised agents, each with bounded authority and measurable failure modes.

For vendors, the product lesson is equally blunt: sell the workflow, not the model. The model is only one component. The orchestrator, validation rules, memory state, domain vocabulary, fallback loops, and UI integration are what make the system operational.

The boundaries are not cosmetic

The study has real limits, and they affect interpretation.

The evaluation uses 240 commands and patient data from a single retrospective lung-surgery case involving surgical video, clinical information, CT images, and 3D anatomical models. That is enough to test feasibility, not enough to establish broad clinical reliability.

The command set is structured and curated. It includes explicit, implicit, NLQ, abbreviation, paraphrase, single, and composite examples, but it still cannot represent the full linguistic mess of a real operating room. The authors acknowledge that multilingual mixing and clinical variability remain open issues.

The system is also closely tied to its prompting strategy and Gemma3 27B setup. The paper reports local execution through Ollama, with the model occupying approximately 17 GB on a single NVIDIA RTX 4500 Ada GPU, while a second GPU handles other APIs, models, or agents. That is plausible for a research platform, but sequential LLM calls create cumulative latency. In a time-sensitive clinical environment, delay is not merely annoying. It changes whether the interface is usable.

There are also product-specific constraints. The IR agent displays a predefined set of clinician-selected fields. Different departments and surgeons will want different fields. The AR agent uses 3D models reconstructed from CT images using TotalSegmentator and 3D Slicer; clinical deployment would require validated software pipelines and stronger anatomical precision guarantees. The unsupported-command failures show that validation needs to be more conservative when a request asks for data outside the configured scope.

Most importantly, composite commands remain fragile. That is the difference between “show this” and “show this, then move that, then rotate the other thing.” Agentic workflows keep running into this wall. Surgery merely makes the wall less forgiving.

The quiet lesson: autonomy begins as orchestration

VISA is not exciting because it lets a surgeon talk to a system. Voice control has been around for decades. It is exciting because it treats voice interaction as a workflow orchestration problem rather than a command-recognition problem.

That is the right abstraction. Real operational AI will not arrive as a single grand model making grand decisions. It will arrive as layers: capture, transcription, correction, validation, routing, task-specific execution, state memory, error recovery, and human override. Less cinematic, more useful. Tragic for conference keynotes, excellent for actual work.

The paper’s contribution is therefore not a promise that robotic surgery is about to become autonomous. It is a proof-of-concept for a more disciplined kind of autonomy: bounded, modular, recoverable, and aware that humans often speak imperfectly while doing difficult things.

That is where many enterprise AI systems should be heading. Not toward magical assistants with unlimited authority, but toward orchestrated workflows that know what they can do, repair what they can repair, and ask again when the command is not safe enough to execute.

Cognaptus: Automate the Present, Incubate the Future.


  1. Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, and Kyungsang Kim, “Voice-Interactive Surgical Agent for Multimodal Patient Data Control,” arXiv:2511.07392, 2025, https://arxiv.org/abs/2511.07392↩︎