Tables Turned: Why LLM-Based Table Agents Are the Next Big Leap in Business AI

TL;DR for operators

Most business data does not live in pristine chatbot-friendly prose. It lives in spreadsheets, ledgers, CSV exports, relational databases, dashboards, compliance reports, and those heroic Excel files with merged cells, colour-coded warnings, unexplained abbreviations, and one column called misc.

The paper behind this article, Toward Real-World Table Agents, argues that LLM-based table agents should not be judged as smarter versions of Text-to-SQL alone.¹ Real-world table work requires an end-to-end workflow: reading table structure, cleaning noisy semantics, retrieving only the relevant parts, executing traceable reasoning steps, and adapting to domains such as finance, healthcare, public administration, and industrial operations.

The most operationally useful part of the paper is also the least flattering to the current agent narrative. In the authors’ Text-to-SQL experiments, simple Chain-of-Thought prompting often made open-source models worse. Several agent frameworks added complexity and token cost without reliable gains. MAC-SQL, for example, sharply degraded TableGPT2-7B on all three reported benchmarks. OpenSearch-SQL was the clear positive exception, especially on BIRD-dev, where TableGPT2-7B rose from 46.94 to 64.34 execution accuracy.

The business lesson is not “agents are fake”. That would be satisfyingly dramatic and mostly wrong. The lesson is narrower and more useful: agent scaffolding only helps when the underlying model can follow the workflow, preserve intermediate formats, and avoid cascading errors. Otherwise, the system becomes a very expensive way to misunderstand a table in several stages.

For enterprise buyers and builders, the evaluation question should shift from “Can it answer this spreadsheet question?” to “Can it survive our actual table environment?” That means testing preprocessing, retrieval quality, code execution safety, audit logs, privacy constraints, ambiguous user intent, and domain-specific terminology. A demo on clean benchmark tables is nice. So is a showroom kitchen. Neither tells you whether dinner service survives a Saturday night.

The uncomfortable result: more agent does not always mean more accuracy

A familiar enterprise fantasy runs like this: connect an LLM to company tables, add a few tools, ask questions in natural language, and let the agent perform analysis. The spreadsheet becomes conversational. The database becomes democratic. The analyst backlog vanishes into a tasteful cloud of automation.

The paper interrupts that fantasy with an empirical nuisance. The authors test several Text-to-SQL agent methods using three open-source models: TableGPT2-7B, Qwen2.5-Coder-7B-Instruct, and Qwen2.5-Coder-32B-Instruct. They evaluate on Spider-dev, Spider-test, and BIRD-dev using execution accuracy, meaning whether the generated SQL returns the correct result.

Here is the part worth pinning above the procurement spreadsheet:

Method	Model	Spider-dev	Spider-test	BIRD-dev	Practical reading
Baseline	TableGPT2-7B	73.69	71.73	46.94	Simple prompting is not helpless.
CoT	TableGPT2-7B	74.95	71.73	46.28	Chain-of-Thought barely helps here and slightly hurts on BIRD.
MAC-SQL	TableGPT2-7B	49.61	52.26	30.44	Multi-agent decomposition can collapse when the model cannot reliably manage the workflow.
MAGIC	TableGPT2-7B	77.08	72.38	49.80	Failure-guideline correction gives modest gains.
OpenSearch-SQL	TableGPT2-7B	76.89	78.16	64.34	Carefully aligned multi-stage design can materially help weaker models.
Baseline	Qwen2.5-Coder-32B-Instruct	75.82	75.59	59.91	Larger model is stronger on the harder BIRD-dev benchmark.
CHESSISC	Qwen2.5-Coder-32B-Instruct	80.85	82.81	64.34	Some structured workflows help when the model can follow them.
OpenSearch-SQL	Qwen2.5-Coder-32B-Instruct	79.50	82.07	63.62	Strong, but not universally superior to every alternative.

This is not a full market benchmark. It is a focused experiment inside one subfield: Text-to-SQL. Still, it lands on a very business-relevant point. Agent frameworks are not free intelligence. They are workflows. Workflows impose overhead. They require format discipline. They create intermediate failure points. They consume more tokens. They may ask a model to do exactly the thing it is weakest at: maintain state, follow constraints, and recover from its own earlier mistake without theatrics.

The authors’ explanation is sensible. Many existing agent systems use techniques such as query decomposition, multi-step SQL generation, execution-based refinement, Chain-of-Thought, and voting. Those methods can help strong models. But with weaker open-source models, especially those deployed locally for cost, privacy, or governance reasons, the extra scaffolding often creates more opportunities for failure than for correction.

That matters because real enterprise deployment often prefers local or domain-specialised open-source models. Not because CIOs have suddenly become romantics of open weights, but because external APIs can be expensive, hard to govern, and awkward around sensitive financial, medical, customer, or operational records. In other words, the very environments where table agents are most useful are also the environments where the model may be weaker than the benchmark leaderboard darling.

The paper’s evidence therefore forces a better question: not “Should we use agents?” but “Which parts of the table workflow deserve agency, and which should remain boring, deterministic software?”

Boring software, as usual, is underrated.

The misconception: a table agent is not Text-to-SQL with a larger context window

The paper’s deeper contribution is conceptual. It reframes table intelligence around five capabilities that real systems need before they can be trusted with messy business tables:

Capability	What it means	Operational failure if ignored
Table structure understanding	Preserving headers, merged cells, hierarchies, layout, and table relationships	The agent reads the wrong row, loses hierarchy, or treats formatting as irrelevant when it is carrying meaning.
Table and query semantic understanding	Cleaning column names, resolving ambiguity, interpreting user intent	The agent maps “GM” to the wrong concept, mistakes a code for a value, or answers an underspecified query with fake confidence.
Table retrieval and compression	Selecting the relevant schema, rows, columns, or cells without destroying meaning	The model wastes context on irrelevant data or drops the one cell that matters. Elegant, fatal.
Executable reasoning with traceability	Producing verifiable SQL, Python, DSL steps, or logged intermediate operations	The final answer cannot be audited, debugged, or safely used in regulated workflows.
Cross-domain generalisation	Adapting to finance, healthcare, government, materials, or other specialised domains	The system performs well on generic demos and then fails on abbreviations, formulas, and domain conventions.

The misconception is easy to understand. Text-to-SQL is measurable, familiar, and commercially attractive. Ask a question, generate SQL, run it, return an answer. Nice and tidy. Unfortunately, most real table work is not tidy.

A financial report may require numerical reasoning over footnotes, subtotals, fiscal periods, and domain-specific metrics. A healthcare table may combine lab values, units, patient timelines, and clinical abbreviations. A public administration spreadsheet may contain semi-structured forms, inconsistent naming, and multiple tables embedded in one document. A business analyst’s workbook may include formatting conventions that were never written down because apparently civilisation is built on vibes and conditional formatting.

A larger context window does not solve this. The paper explicitly notes that even if models can technically ingest large contexts, cost and performance degradation remain problems. More importantly, serialising a table into text can destroy structural information. A table is not merely a long sentence wearing gridlines. Its meaning often depends on position, hierarchy, relation, aggregation, and visual convention.

This is why the paper’s capability framework is more useful than a catalogue of models. It tells operators where table-agent systems break.

Structure is the first bottleneck because tables are not prose

LLMs are comfortable with sequences. Tables are not naturally sequential. Turning a two-dimensional or higher-dimensional structure into a token stream is already an act of interpretation.

The paper reviews several table formats: text, image, graph, table-specific encodings, and other specialised representations. Text is currently the most common because it is easy to feed into LLMs. Markdown, JSON, HTML, LaTeX, attribute-value pairs, and natural-language serialisations all appear in the literature. Text is convenient. It is also lossy.

HTML and LaTeX can preserve more structure than plain Markdown, but they consume more tokens. Images preserve visual layout and can handle scanned or handwritten tables, but they make editing and programmatic operations harder. Graph formats look promising because they can represent relationships and may handle permutation invariance better, but they remain less explored in LLM table systems. Table-specific encoders can preserve structural properties, but compatibility with heterogeneous real-world tables remains limited.

The paper’s practical conclusion is not that one format wins. It is that no universal table representation currently exists. That is annoying, which is usually how you know the conclusion is close to reality.

For enterprise design, the implication is straightforward: table agents should not be built around a single ingestion assumption. A system that only understands clean CSV or Markdown tables is not a table agent. It is a polite intern for benchmark datasets.

A more serious architecture needs format routing:

Input condition	Likely representation need	Why it matters
Relational database	Schema-aware text or SQL representation	Query generation and execution can be grounded in table relationships.
Spreadsheet with merged cells and visual cues	Hybrid text plus layout or image-aware parsing	Formatting may carry semantic meaning.
Large sparse table	Retrieval and compression before generation	Full-table ingestion wastes context and increases error risk.
Multi-table database	Schema graph or relationship-aware representation	Foreign keys and joins must be preserved.
Domain-specific forms	Preprocessing plus metadata enrichment	Abbreviations and implicit conventions need expansion.

The business value here is not elegance. It is error reduction. If structure is misread at ingestion, every later reasoning step becomes theatre.

Semantic cleaning is not janitorial work; it is model alignment

The paper treats table and query semantic understanding as a core capability, and this is one of its best operational instincts.

Real-world tables are noisy. Column names are abbreviated. Values are missing. Date formats shift halfway down the sheet because someone copied from another system. Units are implied. Codes are used without dictionaries. A field called status may mean payment status, employee status, clinical status, project status, or the emotional state of the person maintaining the spreadsheet. Usually the last one is “tired”.

Semantic preprocessing includes column-name cleaning, schema construction, column relationship detection, value normalisation, and query clarification. These sound like data-engineering chores. In LLM systems, they are also alignment steps. They translate the organisation’s messy data language into something the model can reason about without hallucinating glue.

The paper highlights column-name expansion methods such as NameGuess and schema construction methods such as PoTable. It also discusses query ambiguity. In table contexts, ambiguity is not a fringe issue. The authors cite prior work where human annotators agreed only 62% of the time on whether a SQL query was ambiguous. That should terrify anyone planning to put a natural-language layer over enterprise data and call the result “self-service analytics”.

The right behaviour for a table agent is often not to answer. It is to clarify.

For example:

User asks	Hidden ambiguity	Better agent behaviour
“Show overdue accounts by region.”	Overdue by invoice date, due date, collection status, or accounting policy?	Ask which overdue definition to apply or present candidates.
“Compare sales this quarter.”	Compared with last quarter, same quarter last year, forecast, or target?	Ask for comparison baseline.
“Which hospitals are high risk?”	Financial risk, clinical risk, compliance risk, operational risk?	Request domain-specific risk definition.
“Summarise abnormal values.”	Abnormal by statistical outlier, business rule, medical range, or audit policy?	Use metadata or ask for threshold rules.

This is where many demos cheat. They reward the agent for sounding decisive. Production rewards the agent for knowing when the query is under-specified. A table agent that never asks questions is not efficient. It is just confidently skipping requirements gathering.

Retrieval and compression decide whether the model sees the right evidence

Large tables are usually sparse with respect to a given question. Most rows, columns, and cells are irrelevant to the immediate task. The obvious response is retrieval: select the parts that matter, feed only those into the model, and preserve enough context to reason correctly.

The paper reviews schema linking, table retrieval, row and column selection, graph-based schema pruning, and cell-level retrieval. The technical detail varies, but the operational pattern is the same: table agents need a relevance filter before they need eloquence.

This introduces a fragile dependency. If retrieval drops the right column, no amount of downstream reasoning can recover it. If compression rewrites the table in a semantically ambiguous way, the model may answer the compressed artefact rather than the original data.

The authors note that retrieval and compression quality is not yet systematically evaluated well enough. Standard precision, recall, and F1 can describe retrieval overlap, but they do not fully capture whether the retained table subset is sufficient for downstream reasoning. For business users, this is not a minor benchmark nuance. It is the difference between a defensible answer and a misleading one.

A production table-agent evaluation should therefore include retrieval-level tests, not only final-answer tests:

Test layer	What to inspect	Why final accuracy alone misses it
Schema retrieval	Did the agent select the necessary tables and columns?	A lucky final answer can hide unstable retrieval.
Value retrieval	Did it include required rows, cells, and units?	The SQL or Python step may be correct over the wrong subset.
Compression	Did summarisation preserve constraints, hierarchy, and definitions?	A compact representation can quietly delete business rules.
Downstream sensitivity	Does the answer change if retrieved context varies slightly?	High sensitivity indicates brittle grounding.

The business version is simple: do not buy a table agent unless you can inspect what it chose to look at. An answer without evidence selection is just autocomplete wearing a blazer.

Traceability is the difference between analysis and gossip

The paper’s fourth capability, executable reasoning with traceability, is where table agents become operationally serious.

A natural-language answer may be useful for a quick explanation. But table work often needs executable steps: SQL queries, Python transformations, spreadsheet operations, or domain-specific languages. Executable outputs can be run, tested, logged, inspected, and corrected. They turn a final answer into a process.

The paper discusses SQL, Python, and DSLs as common output languages. SQL is powerful for relational databases and benefits from execution feedback. Python is flexible for analysis, transformation, and visualisation. DSLs can constrain behaviour and improve safety, but may reduce flexibility and require maintenance.

This is the trade-off that matters:

Output style	Strength	Weakness
Natural language	Easy for users to read	Hard to verify; weak auditability
SQL	Executable and precise for databases	Limited for non-relational or messy spreadsheet tasks
Python	Flexible for analysis and transformation	Requires sandboxing and code validation
DSL	Safer and more constrained	Can limit capability and increase maintenance burden

For regulated or high-stakes domains, traceability is not a nice-to-have. It is the product. Finance teams need to know which rows were aggregated. Healthcare users need to know which fields were used. Public-sector workflows need reproducibility and accountability. Internal audit does not accept “the agent seemed confident” as a control procedure, at least not yet, though give the industry a few quarters.

The paper also notes the security tension. Sandboxed execution can reduce risk, but it adds infrastructure and may reduce efficiency. DSLs can prevent malicious code generation, but they may constrain useful capability. Table datasets often ignore security, which means benchmark success may not reflect deployment readiness.

The practical conclusion: do not evaluate table agents only by answer quality. Evaluate the reasoning artefact. Can the generated SQL be reviewed? Can Python execution be sandboxed? Are intermediate transformations logged? Can the system explain how table A became table A′? Can access control prevent a user from extracting restricted fields through a clever prompt?

If not, the system may still be a prototype. It is not yet an enterprise table agent.

Domain adaptation is where generic agents meet actual business language

The paper’s fifth capability, cross-domain generalisation, is not a generic “models should generalise” statement. It is about the stubborn specificity of table work.

Finance tables involve ratios, time periods, accounting rules, risk calculations, and numerical reasoning. Medical tables involve lab values, patient records, units, ranges, and clinical terminology. Public administration tables may involve semi-structured document generation and policy-specific templates. Petrochemical data may use compressed column names that make perfect sense to domain insiders and look like keyboard accidents to everyone else.

The authors review domain-specific datasets and methods in finance, medicine, materials, and petrochemicals. Their conclusion is blunt enough to be useful: no universal domain-adaptation method currently balances performance and cost across domains. RAG can help, especially with terminology and column-name understanding, but it requires well-constructed knowledge bases and may deliver limited gains. Fine-tuning can improve performance but costs more and reduces flexibility. Custom methods may work well for one data type and travel badly.

This is where enterprise planning often becomes too optimistic. A vendor demo on generic sales data does not prove readiness for bank regulatory reports, hospital EHR tables, insurance claims, or construction procurement sheets. Domain adaptation is not a setting. It is a programme of work.

For operators, the question is not “Does the agent support our industry?” That phrase is too vague to survive contact with a spreadsheet. Better questions are:

Deployment question	What it reveals
Can the agent expand our abbreviations and map them to governed definitions?	Whether semantic understanding is domain-aware.
Can it handle our units, periods, currencies, and rounding policies?	Whether numerical outputs are operationally valid.
Can we inject domain documentation without leaking sensitive data?	Whether RAG is practical under governance constraints.
Can we fine-tune or configure behaviour for recurring table types?	Whether adaptation cost is sustainable.
Can domain experts inspect and correct intermediate steps?	Whether the system can improve without becoming a black box.

The phrase “cross-domain generalisation” sounds grand. In practice, it means surviving the abbreviation habits of one department.

The current agent landscape is still incomplete

The paper compares existing LLM-based table agents from a workflow perspective, including systems such as SheetAgent, SheetCopilot, Data-Copilot, ReAcTable, DAAgent, DB-GPT, Data Formulator, TableGPT, TableGPT2, and EHRAgent.

The authors’ observations are useful because they cut across named systems rather than worshipping any one of them. Most agents rely primarily on text formats. Many do not adequately preprocess table data. Intent recognition is often borrowed from general agents rather than tailored to table tasks. Retrieval and compression remain underused, leaving agents poorly prepared for large tables. SQL and Python dominate output because they are practical. Safety measures tend to rely on private deployment or sandboxes, while traceability remains underdeveloped. Domain adaptation is still unsolved.

TableGPT2 appears as one of the more comprehensive systems in the survey, with table normalisation, schema retrieval, context aggregation, SQL/Python outputs, sandboxing, and RAG for domain adaptation. Even there, the paper says the workflow remains incomplete. That is not an insult. It is a useful calibration of where the field is.

Here is the compressed business reading:

Paper observation	Business interpretation
Text format dominates current agents	Many systems are still optimised for LLM convenience, not table fidelity.
Preprocessing is insufficient	“Bring your own clean data” remains a hidden requirement.
Retrieval and compression are underused	Large-table deployment is weaker than demo performance suggests.
SQL and Python are popular outputs	Executability matters, but so does sandboxing and review.
Safety mostly means private deployment or sandboxing	Governance is present but immature.
Domain adaptation remains difficult	Industry-specific rollout will require configuration, RAG, fine-tuning, or expert-in-the-loop processes.

This is where the paper is most valuable as a procurement lens. It gives buyers a way to look past the interface. A polished chat panel over a spreadsheet does not reveal whether the system has preprocessing, retrieval, traceability, security, and adaptation. Those are the unglamorous layers where enterprise value either appears or quietly dies.

What the experiments actually support, and what they do not

The Text-to-SQL experiment is the article’s best hook, but it should not be overextended.

What the paper directly shows:

Test	Likely purpose	What it supports	What it does not prove
Baseline prompting on Spider and BIRD	Establish a simple reference point for each model	Simple prompting can be competitive, especially against poorly matched agent workflows	That baseline prompting is enough for real enterprise table work
Zero-shot CoT	Test whether a common reasoning prompt helps	CoT often degrades performance on these open-source models and tasks	That CoT is useless for all table reasoning
MAC-SQL, MAG-SQL, CHESS, MAGIC, OpenSearch-SQL	Compare representative Text-to-SQL agent methods	Agent scaffolding has uneven effects; careful alignment can help	That one framework dominates all settings
Open-source model comparison	Reflect practical deployment constraints	Weaker local models may struggle with complex agent workflows	That closed-source models would behave the same way
Execution accuracy as sole metric	Measure whether generated SQL returns correct results	Useful for Text-to-SQL correctness	It does not measure security, retrieval quality, traceability, latency, cost, or user trust

The strongest inference is not that agent systems fail. It is that agent systems are highly dependent on model capability and workflow design. OpenSearch-SQL’s results matter because they show the positive case: an alignment agent can reduce error propagation in a multi-stage workflow. For TableGPT2-7B, OpenSearch-SQL improves BIRD-dev from 46.94 to 64.34. That is not cosmetic. It is the difference between “interesting” and “maybe worth engineering”.

But the failures matter just as much. MAC-SQL drops TableGPT2-7B from 73.69 to 49.61 on Spider-dev and from 46.94 to 30.44 on BIRD-dev. CoT reduces Qwen2.5-Coder-7B-Instruct from 76.89 to 68.47 on Spider-dev and from 53.32 to 45.31 on BIRD-dev. These are not rounding errors. They are reminders that adding reasoning words does not guarantee better reasoning.

For businesses, the correct reading is disciplined: test the full workflow under your own model constraints. If you plan to use local models, do not assume agent methods designed around stronger closed models will transfer. If you plan to use multi-step workflows, measure intermediate failures. If you plan to use CoT-like prompting, verify that it improves outputs rather than merely making mistakes more verbose.

Verbose mistakes are still mistakes. They just bill better.

A practical evaluation checklist for table-agent pilots

The paper’s seven design principles can be translated into a pilot checklist for enterprise teams.

Evaluation area	Pilot test	Pass condition
Multiformat input	Feed CSV, spreadsheet, database table, scanned table, and hierarchical-header examples	The system preserves structure or routes to the right parser instead of flattening everything blindly.
Preprocessing	Include abbreviations, missing values, inconsistent dates, mixed units, and dirty column names	The system normalises or flags issues before reasoning.
Query understanding	Ask ambiguous business questions	The system clarifies rather than pretending the query is complete.
Retrieval/compression	Use large sparse tables with hidden relevant cells	The system exposes selected evidence and does not drop required fields.
Executable reasoning	Require SQL/Python/intermediate steps	Outputs are runnable, inspectable, and linked to final answers.
Security	Attempt unsafe code, restricted field access, and prompt injection through table content	The system validates code, enforces permissions, and runs in a sandbox where needed.
Domain adaptation	Use real terminology, policies, and recurring report formats	The system can use domain documentation or expert corrections without expensive reinvention.
Cost and latency	Compare baseline prompting, agent workflow, and retrieval-assisted designs	Accuracy gains justify extra tokens, time, and infrastructure.

Notice what is not on the list: “Does the answer sound good?” That is a demo question. Operational evaluation starts when the answer sounds good and you ask how it got there.

Where this paper is strongest, and where the boundary sits

This paper is strongest as a framework and deployment warning. Its five-capability model is a useful map of the field. Its workflow comparison shows that many current table agents still cover only fragments of the real problem. Its Text-to-SQL experiment adds evidence to a point practitioners already suspect: complex agent scaffolding can underperform when paired with weaker local models.

The boundary is equally important.

First, the quantitative experiment covers a specific subfield: Text-to-SQL. Table agents are broader than SQL generation. They include spreadsheet manipulation, table question answering, data visualisation, prediction, transformation, and report generation.

Second, the experiment uses selected open-source models and methods under resource constraints. It does not establish a universal ranking of all agent frameworks, all models, or all production systems.

Third, execution accuracy is useful but narrow. It does not measure privacy, auditability, retrieval quality, robustness to dirty spreadsheets, latency, user clarification, or cost per successful task.

Fourth, the survey’s design principles are well-motivated but not themselves validated as a complete production architecture. They are a roadmap, not a finished blueprint.

That said, the paper’s limitations do not weaken its core message. They sharpen it. The field does not yet have enough real-world table benchmarks, security evaluations, retrieval metrics, or domain-adaptation tests. Therefore, serious adopters should build their own evaluation harnesses rather than waiting for a leaderboard to bless their specific mess.

The business value is workflow reliability, not spreadsheet magic

The market temptation will be to sell table agents as spreadsheet magic: ask anything, get answers, replace analysts, liberate data, enjoy the usual confetti.

The more defensible business value is narrower and better:

Reduce manual table cleaning by automating schema alignment, value normalisation, and validation.
Improve analyst throughput by retrieving relevant table regions and generating executable draft queries or transformations.
Increase auditability by logging intermediate SQL, Python, or transformation steps.
Support governed self-service analytics by forcing clarification when business terms are ambiguous.
Enable domain-specific workflows where RAG, fine-tuning, or expert feedback can adapt the agent to recurring table formats.

That is still a large opportunity. It is just not magic. It is infrastructure.

The paper’s title points “toward real-world table agents”, and the wording matters. We are not there merely because a model can answer a question about a clean table. We get there when the system can ingest messy inputs, ask better questions, select the right evidence, execute traceable operations, survive domain terminology, and respect security boundaries.

For operators, that means the next table-agent pilot should not begin with the prettiest dashboard. It should begin with the ugliest spreadsheet in the organisation.

The ugly spreadsheet knows the truth.

Cognaptus: Automate the Present, Incubate the Future.

Jiaming Tian, Liyao Li, Wentao Ye, Haobo Wang, Lingxin Wang, Lihua Yu, Zujie Ren, Gang Chen, and Junbo Zhao, “Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence,” arXiv:2507.10281, 2025. https://arxiv.org/pdf/2507.10281 ↩︎

TL;DR for operators#

The uncomfortable result: more agent does not always mean more accuracy#

The misconception: a table agent is not Text-to-SQL with a larger context window#

Structure is the first bottleneck because tables are not prose#

Semantic cleaning is not janitorial work; it is model alignment#

Retrieval and compression decide whether the model sees the right evidence#

Traceability is the difference between analysis and gossip#

Domain adaptation is where generic agents meet actual business language#

The current agent landscape is still incomplete#

What the experiments actually support, and what they do not#

A practical evaluation checklist for table-agent pilots#

Where this paper is strongest, and where the boundary sits#

The business value is workflow reliability, not spreadsheet magic#