A chatbot can say yes to almost anything. That is part of the charm. It is also part of the problem.
Ask an agent to “clean this dataset, train a model, compare alternatives, and generate a report,” and the conversation feels wonderfully frictionless. The system can interpret intent, improvise steps, write code, call tools, and explain itself in a tone that suggests adult supervision is somewhere nearby.
But in real business and scientific workflows, the important question is not whether the agent sounds competent. The important question is: what exactly ran?
That question is embarrassingly simple. It is also where many agentic AI designs start to wobble.
The paper Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows argues that the central architectural problem is not “how autonomous should the agent be?” but where execution authority should live.1 In other words: should the LLM decide what computation runs, or should execution be allowed only when the proposed action passes a machine-checkable boundary?
The authors’ answer is schema-gated orchestration. Let the user talk freely. Let the model reason freely. But when the system moves from conversation to execution, nothing runs unless the full action validates against a schema.
This sounds like a technical detail. It is not. It is closer to a constitutional rule for agentic AI systems: conversation may propose; schemas dispose.
The real trade-off is not intelligence versus automation
Most public discussion of agentic AI still circles around autonomy: Can the agent plan? Can it use tools? Can it complete a task without hand-holding? Can it replace a junior analyst, a research assistant, or the poor intern who used to copy numbers between spreadsheets?
The paper takes a more useful route. It studies the tension between two requirements that industrial R&D practitioners actually care about:
- Execution determinism: the system must run stable, constrained, replayable operations.
- Conversational flexibility: users must be able to express intent naturally, iterate quickly, and avoid rigid workflow authoring overhead.
The authors derive these requirements from semi-structured interviews with 18 experts across 10 industrial R&D stakeholders. The interviews are not just ornamental qualitative wallpaper. They are used to identify what practitioners repeatedly asked for: integration, workflow automation, data preparation, security, domain expertise, data search, natural-language interaction, visualisation, explainability, and human oversight.
The striking point is that practitioners want both poles at once. They want the freedom of natural language and the discipline of reproducible computation. They want the agent to be conversational when interpreting a request, but bureaucratic when running something consequential. A rare case where bureaucracy is not the villain.
The authors operationalise this tension using two ordinal axes:
| Axis | What it measures | What high score means |
|---|---|---|
| Execution Determinism (ED) | How strongly “what runs” is constrained before execution | Validated, replayable, versioned execution artifacts |
| Conversational Flexibility (CF) | How directly natural language can shape system actions | Natural language can select, parameterise, or drive actions |
This framing is more useful than the usual “agent versus workflow” debate because it separates two questions that are often blurred:
- How flexible is the user interface?
- How controlled is the execution boundary?
A system can have a friendly chat interface and still refuse to run anything unsafe. Or it can be highly deterministic but force users into configuration files, DAGs, YAML, and other forms of professional suffering.
Three architectural families, three different failure modes
The paper reviews 20 systems across five groups, but the practical comparison reduces to three broad paradigms:
| Paradigm | Execution authority lives in… | Strength | Failure mode |
|---|---|---|---|
| Generative / tool-augmented agents | The LLM or agent loop | Very flexible interaction | Hallucinated APIs, silent variation, weak reproducibility |
| Workflow-centric systems | A workflow specification, registry, DAG, or DSL | Strong reproducibility and provenance | Configuration friction and limited conversational exploration |
| Schema-gated orchestration | A schema-validated tool/workflow layer | Tries to preserve both flexibility and control | Limited by registry coverage and schema maintenance |
This is the paper’s central business-relevant comparison.
Generative agents give users the feeling of momentum. The user asks, the agent acts, and work appears to move forward. The problem is that “work moved forward” is not the same as “a valid, auditable, repeatable process occurred.” If the LLM writes code, chooses defaults, calls tools opportunistically, or modifies a plan across runs, the final artifact may depend on conversational drift. The output may be useful for exploration, but it is weak as an organisational record.
Workflow-centric systems solve the opposite problem. Tools such as scientific workflow managers, data pipelines, and DAG-based systems make computation explicit. They know what ran, in what order, with what dependencies. That is excellent for reproducibility. It is less excellent when the researcher or analyst wants to ask, “try the same thing, but with this variable excluded,” without editing a workflow specification.
Schema-gated orchestration tries to split the difference. The LLM can interpret the user’s intent, ask clarifying questions, retrieve candidate workflows, and help fill parameters. But the executable artifact is not free-form model output. It is a validated invocation object: a workflow ID, version, parameter set, dependencies, and constraints that pass a formal gate before execution.
That distinction matters. Tool calling alone does not solve the problem.
Tool schemas are not enough when workflows have dependencies
A common misconception is that once an AI system uses JSON schemas or function calling, execution is already controlled. That is only partly true.
Tool-level schemas can validate individual calls. They can check that a function receives a string where it expects a string, a number where it expects a number, and an allowed value where the interface requires one. Good. Better than letting an agent freestyle Python in the production basement.
But scientific and business workflows are rarely single calls. They are chains.
The paper gives a materials-discovery style example: load a dataset, train a surrogate model, then run inverse design using that model. Each step can be locally valid while the overall workflow is wrong. The dataset may not contain the target columns requested by the model training step. The trained model may predict properties that do not match the inverse design objective. The steps may be individually well-typed but collectively unsound.
That is the key move from schema-gated tool execution to schema-gated orchestration.
| Validation level | What it can catch | What it may miss |
|---|---|---|
| Tool-level schema | Missing fields, wrong primitive types, invalid parameter values for one call | Cross-step mismatches, incompatible outputs and inputs, invalid dependency order |
| Workflow-level schema | Inter-step data-flow types, dependency ordering, parameter compatibility across a DAG | Scientific appropriateness, data quality, model validity, stochastic behaviour inside tools |
The second row is where the paper becomes more than another “use structured outputs” essay. The authors argue that the execution boundary should apply to the composed workflow, not just the individual tool call.
This is the difference between checking whether each Lego brick is real and checking whether the bridge assembled from those bricks can actually stand.
The schema gate changes the unit of accountability
The paper’s most useful phrase is not “agentic AI.” It is execution authority.
Execution authority refers to the component that determines the concrete executable behaviour contributing to the record. In a code-generating agent, execution authority effectively sits with the model-generated script or agent loop. In a workflow system, it sits with the explicit workflow specification. In schema-gated orchestration, it sits with the schema-validated invocation layer.
That shift changes the unit of accountability.
Instead of asking, “What did the assistant say?” or “What code did the model happen to generate?” the organisation can ask:
- Which workflow was invoked?
- Which version?
- Which parameters were resolved?
- Which datasets, models, and tools were referenced?
- Which schema accepted the invocation?
- Which user approved it?
- Which artifacts were produced?
This is not glamorous. It is operationally valuable.
A validated invocation object becomes the shared object for execution, approval, provenance, and audit. The same object can be inspected by the user before execution, stored in logs after execution, and replayed later if the environment is controlled.
That is why the paper’s architecture separates two forms of authority:
| Authority type | What it does | Where it should live |
|---|---|---|
| Conversational authority | Interprets intent, proposes actions, asks questions, explains options | LLM and orchestration controller |
| Execution authority | Decides what can run and under what constraints | Schema-validated tool/workflow layer |
This separation is the paper’s core design principle. The LLM may be persuasive, creative, and helpful. It may not become the runtime’s monarch. Sensible system design, finally showing a little self-respect.
The evidence is architectural, not a deployed product benchmark
The paper has three evidence layers, and they should not be confused.
| Evidence / analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Interviews with 18 experts across 10 stakeholders | Requirements elicitation | Industrial R&D users value both deterministic execution and conversational flexibility | That all industries or all users rank these needs the same way |
| Review of 20 systems on ED/CF axes | Architectural comparison | Existing systems cluster along a trade-off between flexibility and determinism | That the ordinal scores are precise quantitative performance metrics |
| Reference architecture and schema examples | Design proposal | A plausible way to separate conversational and execution authority | That a deployed schema-gated system will outperform alternatives in task completion, cost, or user satisfaction |
This distinction matters because the paper is strongest as an architectural argument. It is not a full product evaluation. It does not show a large deployed proof-of-concept beating alternatives across controlled tasks. The authors are explicit that empirical evaluation of a deployed system remains future work.
That is not a weakness if read correctly. The paper is doing architecture before benchmark theatre. In an area where many agent demos are basically “look, the robot clicked the button,” this restraint is refreshing.
The interview analysis gives the design problem. The system review maps the design space. The reference architecture proposes a resolution. The paper’s claim is not “we solved agentic AI.” The claim is closer to: if you want both conversational flexibility and reproducible execution, you need to stop letting the conversation itself define what runs.
The ED/CF map is a business decision map
The authors score 20 systems on execution determinism and conversational flexibility using a 1–5 ordinal rubric. The paper reports substantial-to-near-perfect agreement across multiple LLM-based scoring sessions, with Krippendorff’s alpha of 0.80 for ED and 0.98 for CF across the full scoring process.
The exact coordinates should not be overread. These are ordinal architectural placements, not measurements like latency or accuracy. But the map is useful because it turns a vague vendor question into a sharper diagnosis.
When evaluating an AI workflow platform, the buyer should not merely ask:
Does it have agents?
That question is nearly useless now. Everything has agents, copilots, assistants, copilots for assistants, and assistants for copilots. The better questions are:
- What is the final execution artifact?
- Can execution happen outside a validated schema?
- Are multi-step dependencies checked before execution?
- Is provenance captured automatically?
- Can users iterate conversationally without creating untracked variants?
The paper’s ED/CF framework helps compare systems by where they place execution authority. A flexible agent with weak gates may be acceptable for brainstorming, notebook prototyping, or low-stakes exploratory analysis. A workflow-centric system may be ideal for regulated pipelines but too rigid for early exploration. A schema-gated system is most attractive when the organisation needs both: chat-first interaction and governed execution.
For business-process AI, that is a familiar requirement. An operations manager may want to say, “prepare the monthly sales variance analysis, exclude discontinued SKUs, and compare against the last approved forecast.” The AI should understand that request conversationally. But it should not invent a new analysis pipeline each time. It should map the request to an approved workflow, validate parameters, check permissions, and produce a replayable record.
The user experience can feel like conversation. The execution layer should feel like accounting.
What schema-gated orchestration would look like in a company
For Cognaptus-style business automation, the paper suggests a practical architecture that is broader than scientific workflows.
A schema-gated business process platform would have five layers:
| Layer | Business role | Example |
|---|---|---|
| Chat interface | Lets users express intent naturally | “Generate the weekly customer churn report and highlight abnormal segments.” |
| Orchestration controller | Converts intent into candidate actions and parameter requests | Finds the approved churn-analysis workflow and asks for missing date range |
| Validated registry | Stores approved tools and workflows with schemas | Report generator, customer segmentation tool, anomaly detector |
| Execution engine | Runs only validated invocations | Executes workflow after checking inputs, permissions, dependencies |
| Provenance layer | Records what ran, with versions and outputs | Logs workflow ID, schema version, dataset snapshot, user approval |
This design changes what “AI automation” means.
The naive automation pitch is: let the agent do the task.
The schema-gated version is: let the agent help the user reach a valid invocation of an approved task.
That sounds less magical because it is less magical. Also less likely to create an invisible compliance bonfire.
For repetitive business workflows, the value is obvious. Sales reporting, invoice reconciliation, customer support triage, procurement checks, HR document generation, and compliance monitoring all involve known procedures with variation at the edges. Users do not want to author workflows. They also do not want every run to become a unique improvisation.
A schema gate lets the variation occur in parameters, not in hidden execution logic.
The ROI is not just automation; it is cheaper diagnosis
The business value of schema-gated orchestration is often described as governance. That is true, but incomplete.
The deeper value is cheaper diagnosis.
When an agentic workflow produces a wrong result, debugging can be painful. Did the model misunderstand the user? Did it select the wrong tool? Did it use the wrong parameter? Did a dependency fail? Did it generate subtly different code? Did the dataset change? Did the output parser quietly accept nonsense? Wonderful questions. Terrible afternoon.
With schema-gated orchestration, failures become more localised:
| Failure point | What the system can expose |
|---|---|
| Missing parameter | Clarification request before execution |
| Invalid parameter | Schema validation error |
| Unsupported operation | Registry coverage gap |
| Incompatible workflow step | Dependency/type validation failure |
| Tool runtime failure | Execution log tied to a validated invocation |
| Bad scientific or business judgment | Human review of assumptions and outputs |
This does not eliminate errors. It makes them less mysterious.
That matters because enterprise AI cost is not only model inference, software licensing, or integration work. A large hidden cost is human time spent investigating whether the system did something reasonable. When every run has an inspectable invocation object, review becomes more structured. Approval gates and audit trails are not separate features awkwardly glued onto the product. They become natural consequences of the execution object.
In plain business terms: schema gates reduce the cost of asking, “Why did this happen?”
Where the paper is careful, and where adopters must be more careful
The authors are appropriately cautious about the limits of schema-gated orchestration.
First, schema gates do not guarantee scientific correctness. A workflow can be structurally valid and still answer the wrong question. It can use a poor dataset, a weak model, an inappropriate assumption, or a misleading metric. Schemas validate shape, dependency, and admissibility. They do not replace domain judgment.
Second, boundary determinism is not full determinism. Even if the invocation is stable, tools inside the workflow may remain stochastic. Reproducibility still requires controlled seeds, pinned dependencies, containerised environments, stable datasets, and versioned models. The schema gate tells us what was supposed to run. The infrastructure must still make that run repeatable.
Third, registry coverage is the central cost. A schema-gated system can execute only what its registry represents. If the needed tool or workflow is absent, the system must refuse, ask for authoring, or move into a less governed mode. This shifts work from prompt engineering to registry management: schema design, validation, versioning, review, and retirement.
That shift is sensible, but not free.
For mature workflows, the overhead can be amortised. For frontier research or highly bespoke client work, registry gaps may appear exactly where users most want flexibility. The paper suggests mitigation paths: workflow templates, LLM-assisted schema drafting, tiered registries, and eventually federated ecosystems where providers publish schema-conformant tools. These are plausible, but they are not yet solved by wishing earnestly in an architecture diagram.
Finally, the interview sample has boundaries. Participants were recruited through Intellegens’ professional network, and several authors are affiliated with Intellegens. The paper discloses this. The sample likely overrepresents organisations already interested in ML-assisted workflows. That does not invalidate the findings, but it does mean the requirements should be tested across broader industries and less ML-mature organisations.
A practical adoption checklist
For companies building or buying agentic AI workflow systems, the paper translates into a simple checklist.
| Question | Good sign | Warning sign |
|---|---|---|
| What is the final execution artifact? | Versioned invocation, workflow spec, or schema-validated object | Free-form code or hidden agent plan |
| Can the model bypass validation? | No; schema is the sole execution path | Yes; tools are optional or loosely enforced |
| Are multi-step dependencies validated? | Workflow-level checks for data flow, types, ordering, parameters | Only individual tool calls are checked |
| How are missing values handled? | Clarification-before-execution | Defaults silently invented by the model |
| How is provenance captured? | Workflow ID, schema version, parameters, datasets, outputs, user identity | Chat logs and vibes |
| How are registry gaps handled? | Refusal, authoring path, or sandboxed mode with explicit downgrade | Agent improvises production execution |
| Who governs the registry? | Review, versioning, permissions, retirement policy | Anyone can add tools and hope for civilisation |
The checklist is deliberately unromantic. That is the point. Agentic AI becomes business infrastructure only when it can survive boring questions.
The strategic lesson: separate freedom from authority
The most useful business interpretation of this paper is not that every company should immediately build a grand schema-gated workflow platform.
The useful lesson is narrower and stronger:
Do not confuse conversational freedom with execution authority.
A user should be able to speak naturally. The model should be able to reason, propose, retrieve, compare, and clarify. But production execution should pass through a boundary that is explicit, validated, versioned, and auditable.
This principle applies beyond scientific workflows. It applies to financial analysis, procurement automation, customer operations, HR document workflows, regulatory reporting, and any process where “the AI did it” is not an acceptable audit trail.
The future of agentic AI in business is unlikely to be a fully unconstrained digital employee wandering across systems with a charming personality and root access. That version makes good demo videos and bad incident reports.
The more durable model is quieter: chat-first interfaces over validated workflow registries, with LLMs used to reduce interaction cost rather than define executable behaviour from scratch.
Talk freely. Execute strictly.
That is not a slogan. It is an architecture.
Cognaptus: Automate the Present, Incubate the Future.
-
Joel Strickland et al., “Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows,” arXiv:2603.06394, 2026. ↩︎