TL;DR for operators
Most business data does not live in pristine chatbot-friendly prose. It lives in spreadsheets, ledgers, CSV exports, relational databases, dashboards, compliance reports, and those heroic Excel files with merged cells, colour-coded warnings, unexplained abbreviations, and one column called misc.
The paper behind this article, Toward Real-World Table Agents, argues that LLM-based table agents should not be judged as smarter versions of Text-to-SQL alone.1 Real-world table work requires an end-to-end workflow: reading table structure, cleaning noisy semantics, retrieving only the relevant parts, executing traceable reasoning steps, and adapting to domains such as finance, healthcare, public administration, and industrial operations.
The most operationally useful part of the paper is also the least flattering to the current agent narrative. In the authors’ Text-to-SQL experiments, simple Chain-of-Thought prompting often made open-source models worse. Several agent frameworks added complexity and token cost without reliable gains. MAC-SQL, for example, sharply degraded TableGPT2-7B on all three reported benchmarks. OpenSearch-SQL was the clear positive exception, especially on BIRD-dev, where TableGPT2-7B rose from 46.94 to 64.34 execution accuracy.
The business lesson is not “agents are fake”. That would be satisfyingly dramatic and mostly wrong. The lesson is narrower and more useful: agent scaffolding only helps when the underlying model can follow the workflow, preserve intermediate formats, and avoid cascading errors. Otherwise, the system becomes a very expensive way to misunderstand a table in several stages.
For enterprise buyers and builders, the evaluation question should shift from “Can it answer this spreadsheet question?” to “Can it survive our actual table environment?” That means testing preprocessing, retrieval quality, code execution safety, audit logs, privacy constraints, ambiguous user intent, and domain-specific terminology. A demo on clean benchmark tables is nice. So is a showroom kitchen. Neither tells you whether dinner service survives a Saturday night.
The uncomfortable result: more agent does not always mean more accuracy
A familiar enterprise fantasy runs like this: connect an LLM to company tables, add a few tools, ask questions in natural language, and let the agent perform analysis. The spreadsheet becomes conversational. The database becomes democratic. The analyst backlog vanishes into a tasteful cloud of automation.
The paper interrupts that fantasy with an empirical nuisance. The authors test several Text-to-SQL agent methods using three open-source models: TableGPT2-7B, Qwen2.5-Coder-7B-Instruct, and Qwen2.5-Coder-32B-Instruct. They evaluate on Spider-dev, Spider-test, and BIRD-dev using execution accuracy, meaning whether the generated SQL returns the correct result.
Here is the part worth pinning above the procurement spreadsheet:
| Method | Model | Spider-dev | Spider-test | BIRD-dev | Practical reading |
|---|---|---|---|---|---|
| Baseline | TableGPT2-7B | 73.69 | 71.73 | 46.94 | Simple prompting is not helpless. |
| CoT | TableGPT2-7B | 74.95 | 71.73 | 46.28 | Chain-of-Thought barely helps here and slightly hurts on BIRD. |
| MAC-SQL | TableGPT2-7B | 49.61 | 52.26 | 30.44 | Multi-agent decomposition can collapse when the model cannot reliably manage the workflow. |
| MAGIC | TableGPT2-7B | 77.08 | 72.38 | 49.80 | Failure-guideline correction gives modest gains. |
| OpenSearch-SQL | TableGPT2-7B | 76.89 | 78.16 | 64.34 | Carefully aligned multi-stage design can materially help weaker models. |
| Baseline | Qwen2.5-Coder-32B-Instruct | 75.82 | 75.59 | 59.91 | Larger model is stronger on the harder BIRD-dev benchmark. |
| CHESSISC | Qwen2.5-Coder-32B-Instruct | 80.85 | 82.81 | 64.34 | Some structured workflows help when the model can follow them. |
| OpenSearch-SQL | Qwen2.5-Coder-32B-Instruct | 79.50 | 82.07 | 63.62 | Strong, but not universally superior to every alternative. |
This is not a full market benchmark. It is a focused experiment inside one subfield: Text-to-SQL. Still, it lands on a very business-relevant point. Agent frameworks are not free intelligence. They are workflows. Workflows impose overhead. They require format discipline. They create intermediate failure points. They consume more tokens. They may ask a model to do exactly the thing it is weakest at: maintain state, follow constraints, and recover from its own earlier mistake without theatrics.
The authors’ explanation is sensible. Many existing agent systems use techniques such as query decomposition, multi-step SQL generation, execution-based refinement, Chain-of-Thought, and voting. Those methods can help strong models. But with weaker open-source models, especially those deployed locally for cost, privacy, or governance reasons, the extra scaffolding often creates more opportunities for failure than for correction.
That matters because real enterprise deployment often prefers local or domain-specialised open-source models. Not because CIOs have suddenly become romantics of open weights, but because external APIs can be expensive, hard to govern, and awkward around sensitive financial, medical, customer, or operational records. In other words, the very environments where table agents are most useful are also the environments where the model may be weaker than the benchmark leaderboard darling.
The paper’s evidence therefore forces a better question: not “Should we use agents?” but “Which parts of the table workflow deserve agency, and which should remain boring, deterministic software?”
Boring software, as usual, is underrated.
The misconception: a table agent is not Text-to-SQL with a larger context window
The paper’s deeper contribution is conceptual. It reframes table intelligence around five capabilities that real systems need before they can be trusted with messy business tables:
| Capability | What it means | Operational failure if ignored |
|---|---|---|
| Table structure understanding | Preserving headers, merged cells, hierarchies, layout, and table relationships | The agent reads the wrong row, loses hierarchy, or treats formatting as irrelevant when it is carrying meaning. |
| Table and query semantic understanding | Cleaning column names, resolving ambiguity, interpreting user intent | The agent maps “GM” to the wrong concept, mistakes a code for a value, or answers an underspecified query with fake confidence. |
| Table retrieval and compression | Selecting the relevant schema, rows, columns, or cells without destroying meaning | The model wastes context on irrelevant data or drops the one cell that matters. Elegant, fatal. |
| Executable reasoning with traceability | Producing verifiable SQL, Python, DSL steps, or logged intermediate operations | The final answer cannot be audited, debugged, or safely used in regulated workflows. |
| Cross-domain generalisation | Adapting to finance, healthcare, government, materials, or other specialised domains | The system performs well on generic demos and then fails on abbreviations, formulas, and domain conventions. |
The misconception is easy to understand. Text-to-SQL is measurable, familiar, and commercially attractive. Ask a question, generate SQL, run it, return an answer. Nice and tidy. Unfortunately, most real table work is not tidy.
A financial report may require numerical reasoning over footnotes, subtotals, fiscal periods, and domain-specific metrics. A healthcare table may combine lab values, units, patient timelines, and clinical abbreviations. A public administration spreadsheet may contain semi-structured forms, inconsistent naming, and multiple tables embedded in one document. A business analyst’s workbook may include formatting conventions that were never written down because apparently civilisation is built on vibes and conditional formatting.
A larger context window does not solve this. The paper explicitly notes that even if models can technically ingest large contexts, cost and performance degradation remain problems. More importantly, serialising a table into text can destroy structural information. A table is not merely a long sentence wearing gridlines. Its meaning often depends on position, hierarchy, relation, aggregation, and visual convention.
This is why the paper’s capability framework is more useful than a catalogue of models. It tells operators where table-agent systems break.
Structure is the first bottleneck because tables are not prose
LLMs are comfortable with sequences. Tables are not naturally sequential. Turning a two-dimensional or higher-dimensional structure into a token stream is already an act of interpretation.
The paper reviews several table formats: text, image, graph, table-specific encodings, and other specialised representations. Text is currently the most common because it is easy to feed into LLMs. Markdown, JSON, HTML, LaTeX, attribute-value pairs, and natural-language serialisations all appear in the literature. Text is convenient. It is also lossy.
HTML and LaTeX can preserve more structure than plain Markdown, but they consume more tokens. Images preserve visual layout and can handle scanned or handwritten tables, but they make editing and programmatic operations harder. Graph formats look promising because they can represent relationships and may handle permutation invariance better, but they remain less explored in LLM table systems. Table-specific encoders can preserve structural properties, but compatibility with heterogeneous real-world tables remains limited.
The paper’s practical conclusion is not that one format wins. It is that no universal table representation currently exists. That is annoying, which is usually how you know the conclusion is close to reality.
For enterprise design, the implication is straightforward: table agents should not be built around a single ingestion assumption. A system that only understands clean CSV or Markdown tables is not a table agent. It is a polite intern for benchmark datasets.
A more serious architecture needs format routing:
| Input condition | Likely representation need | Why it matters |
|---|---|---|
| Relational database | Schema-aware text or SQL representation | Query generation and execution can be grounded in table relationships. |
| Spreadsheet with merged cells and visual cues | Hybrid text plus layout or image-aware parsing | Formatting may carry semantic meaning. |
| Large sparse table | Retrieval and compression before generation | Full-table ingestion wastes context and increases error risk. |
| Multi-table database | Schema graph or relationship-aware representation | Foreign keys and joins must be preserved. |
| Domain-specific forms | Preprocessing plus metadata enrichment | Abbreviations and implicit conventions need expansion. |
The business value here is not elegance. It is error reduction. If structure is misread at ingestion, every later reasoning step becomes theatre.
Semantic cleaning is not janitorial work; it is model alignment
The paper treats table and query semantic understanding as a core capability, and this is one of its best operational instincts.
Real-world tables are noisy. Column names are abbreviated. Values are missing. Date formats shift halfway down the sheet because someone copied from another system. Units are implied. Codes are used without dictionaries. A field called status may mean payment status, employee status, clinical status, project status, or the emotional state of the person maintaining the spreadsheet. Usually the last one is “tired”.
Semantic preprocessing includes column-name cleaning, schema construction, column relationship detection, value normalisation, and query clarification. These sound like data-engineering chores. In LLM systems, they are also alignment steps. They translate the organisation’s messy data language into something the model can reason about without hallucinating glue.
The paper highlights column-name expansion methods such as NameGuess and schema construction methods such as PoTable. It also discusses query ambiguity. In table contexts, ambiguity is not a fringe issue. The authors cite prior work where human annotators agreed only 62% of the time on whether a SQL query was ambiguous. That should terrify anyone planning to put a natural-language layer over enterprise data and call the result “self-service analytics”.
The right behaviour for a table agent is often not to answer. It is to clarify.
For example:
| User asks | Hidden ambiguity | Better agent behaviour |
|---|---|---|
| “Show overdue accounts by region.” | Overdue by invoice date, due date, collection status, or accounting policy? | Ask which overdue definition to apply or present candidates. |
| “Compare sales this quarter.” | Compared with last quarter, same quarter last year, forecast, or target? | Ask for comparison baseline. |
| “Which hospitals are high risk?” | Financial risk, clinical risk, compliance risk, operational risk? | Request domain-specific risk definition. |
| “Summarise abnormal values.” | Abnormal by statistical outlier, business rule, medical range, or audit policy? | Use metadata or ask for threshold rules. |
This is where many demos cheat. They reward the agent for sounding decisive. Production rewards the agent for knowing when the query is under-specified. A table agent that never asks questions is not efficient. It is just confidently skipping requirements gathering.
Retrieval and compression decide whether the model sees the right evidence
Large tables are usually sparse with respect to a given question. Most rows, columns, and cells are irrelevant to the immediate task. The obvious response is retrieval: select the parts that matter, feed only those into the model, and preserve enough context to reason correctly.
The paper reviews schema linking, table retrieval, row and column selection, graph-based schema pruning, and cell-level retrieval. The technical detail varies, but the operational pattern is the same: table agents need a relevance filter before they need eloquence.
This introduces a fragile dependency. If retrieval drops the right column, no amount of downstream reasoning can recover it. If compression rewrites the table in a semantically ambiguous way, the model may answer the compressed artefact rather than the original data.
The authors note that retrieval and compression quality is not yet systematically evaluated well enough. Standard precision, recall, and F1 can describe retrieval overlap, but they do not fully capture whether the retained table subset is sufficient for downstream reasoning. For business users, this is not a minor benchmark nuance. It is the difference between a defensible answer and a misleading one.
A production table-agent evaluation should therefore include retrieval-level tests, not only final-answer tests:
| Test layer | What to inspect | Why final accuracy alone misses it |
|---|---|---|
| Schema retrieval | Did the agent select the necessary tables and columns? | A lucky final answer can hide unstable retrieval. |
| Value retrieval | Did it include required rows, cells, and units? | The SQL or Python step may be correct over the wrong subset. |
| Compression | Did summarisation preserve constraints, hierarchy, and definitions? | A compact representation can quietly delete business rules. |
| Downstream sensitivity | Does the answer change if retrieved context varies slightly? | High sensitivity indicates brittle grounding. |
The business version is simple: do not buy a table agent unless you can inspect what it chose to look at. An answer without evidence selection is just autocomplete wearing a blazer.
Traceability is the difference between analysis and gossip
The paper’s fourth capability, executable reasoning with traceability, is where table agents become operationally serious.
A natural-language answer may be useful for a quick explanation. But table work often needs executable steps: SQL queries, Python transformations, spreadsheet operations, or domain-specific languages. Executable outputs can be run, tested, logged, inspected, and corrected. They turn a final answer into a process.
The paper discusses SQL, Python, and DSLs as common output languages. SQL is powerful for relational databases and benefits from execution feedback. Python is flexible for analysis, transformation, and visualisation. DSLs can constrain behaviour and improve safety, but may reduce flexibility and require maintenance.
This is the trade-off that matters:
| Output style | Strength | Weakness |
|---|---|---|
| Natural language | Easy for users to read | Hard to verify; weak auditability |
| SQL | Executable and precise for databases | Limited for non-relational or messy spreadsheet tasks |
| Python | Flexible for analysis and transformation | Requires sandboxing and code validation |
| DSL | Safer and more constrained | Can limit capability and increase maintenance burden |
For regulated or high-stakes domains, traceability is not a nice-to-have. It is the product. Finance teams need to know which rows were aggregated. Healthcare users need to know which fields were used. Public-sector workflows need reproducibility and accountability. Internal audit does not accept “the agent seemed confident” as a control procedure, at least not yet, though give the industry a few quarters.
The paper also notes the security tension. Sandboxed execution can reduce risk, but it adds infrastructure and may reduce efficiency. DSLs can prevent malicious code generation, but they may constrain useful capability. Table datasets often ignore security, which means benchmark success may not reflect deployment readiness.
The practical conclusion: do not evaluate table agents only by answer quality. Evaluate the reasoning artefact. Can the generated SQL be reviewed? Can Python execution be sandboxed? Are intermediate transformations logged? Can the system explain how table A became table A′? Can access control prevent a user from extracting restricted fields through a clever prompt?
If not, the system may still be a prototype. It is not yet an enterprise table agent.
Domain adaptation is where generic agents meet actual business language
The paper’s fifth capability, cross-domain generalisation, is not a generic “models should generalise” statement. It is about the stubborn specificity of table work.
Finance tables involve ratios, time periods, accounting rules, risk calculations, and numerical reasoning. Medical tables involve lab values, patient records, units, ranges, and clinical terminology. Public administration tables may involve semi-structured document generation and policy-specific templates. Petrochemical data may use compressed column names that make perfect sense to domain insiders and look like keyboard accidents to everyone else.
The authors review domain-specific datasets and methods in finance, medicine, materials, and petrochemicals. Their conclusion is blunt enough to be useful: no universal domain-adaptation method currently balances performance and cost across domains. RAG can help, especially with terminology and column-name understanding, but it requires well-constructed knowledge bases and may deliver limited gains. Fine-tuning can improve performance but costs more and reduces flexibility. Custom methods may work well for one data type and travel badly.
This is where enterprise planning often becomes too optimistic. A vendor demo on generic sales data does not prove readiness for bank regulatory reports, hospital EHR tables, insurance claims, or construction procurement sheets. Domain adaptation is not a setting. It is a programme of work.
For operators, the question is not “Does the agent support our industry?” That phrase is too vague to survive contact with a spreadsheet. Better questions are:
| Deployment question | What it reveals |
|---|---|
| Can the agent expand our abbreviations and map them to governed definitions? | Whether semantic understanding is domain-aware. |
| Can it handle our units, periods, currencies, and rounding policies? | Whether numerical outputs are operationally valid. |
| Can we inject domain documentation without leaking sensitive data? | Whether RAG is practical under governance constraints. |
| Can we fine-tune or configure behaviour for recurring table types? | Whether adaptation cost is sustainable. |
| Can domain experts inspect and correct intermediate steps? | Whether the system can improve without becoming a black box. |
The phrase “cross-domain generalisation” sounds grand. In practice, it means surviving the abbreviation habits of one department.
The current agent landscape is still incomplete
The paper compares existing LLM-based table agents from a workflow perspective, including systems such as SheetAgent, SheetCopilot, Data-Copilot, ReAcTable, DAAgent, DB-GPT, Data Formulator, TableGPT, TableGPT2, and EHRAgent.
The authors’ observations are useful because they cut across named systems rather than worshipping any one of them. Most agents rely primarily on text formats. Many do not adequately preprocess table data. Intent recognition is often borrowed from general agents rather than tailored to table tasks. Retrieval and compression remain underused, leaving agents poorly prepared for large tables. SQL and Python dominate output because they are practical. Safety measures tend to rely on private deployment or sandboxes, while traceability remains underdeveloped. Domain adaptation is still unsolved.
TableGPT2 appears as one of the more comprehensive systems in the survey, with table normalisation, schema retrieval, context aggregation, SQL/Python outputs, sandboxing, and RAG for domain adaptation. Even there, the paper says the workflow remains incomplete. That is not an insult. It is a useful calibration of where the field is.
Here is the compressed business reading:
| Paper observation | Business interpretation |
|---|---|
| Text format dominates current agents | Many systems are still optimised for LLM convenience, not table fidelity. |
| Preprocessing is insufficient | “Bring your own clean data” remains a hidden requirement. |
| Retrieval and compression are underused | Large-table deployment is weaker than demo performance suggests. |
| SQL and Python are popular outputs | Executability matters, but so does sandboxing and review. |
| Safety mostly means private deployment or sandboxing | Governance is present but immature. |
| Domain adaptation remains difficult | Industry-specific rollout will require configuration, RAG, fine-tuning, or expert-in-the-loop processes. |
This is where the paper is most valuable as a procurement lens. It gives buyers a way to look past the interface. A polished chat panel over a spreadsheet does not reveal whether the system has preprocessing, retrieval, traceability, security, and adaptation. Those are the unglamorous layers where enterprise value either appears or quietly dies.
What the experiments actually support, and what they do not
The Text-to-SQL experiment is the article’s best hook, but it should not be overextended.
What the paper directly shows:
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Baseline prompting on Spider and BIRD | Establish a simple reference point for each model | Simple prompting can be competitive, especially against poorly matched agent workflows | That baseline prompting is enough for real enterprise table work |
| Zero-shot CoT | Test whether a common reasoning prompt helps | CoT often degrades performance on these open-source models and tasks | That CoT is useless for all table reasoning |
| MAC-SQL, MAG-SQL, CHESS, MAGIC, OpenSearch-SQL | Compare representative Text-to-SQL agent methods | Agent scaffolding has uneven effects; careful alignment can help | That one framework dominates all settings |
| Open-source model comparison | Reflect practical deployment constraints | Weaker local models may struggle with complex agent workflows | That closed-source models would behave the same way |
| Execution accuracy as sole metric | Measure whether generated SQL returns correct results | Useful for Text-to-SQL correctness | It does not measure security, retrieval quality, traceability, latency, cost, or user trust |
The strongest inference is not that agent systems fail. It is that agent systems are highly dependent on model capability and workflow design. OpenSearch-SQL’s results matter because they show the positive case: an alignment agent can reduce error propagation in a multi-stage workflow. For TableGPT2-7B, OpenSearch-SQL improves BIRD-dev from 46.94 to 64.34. That is not cosmetic. It is the difference between “interesting” and “maybe worth engineering”.
But the failures matter just as much. MAC-SQL drops TableGPT2-7B from 73.69 to 49.61 on Spider-dev and from 46.94 to 30.44 on BIRD-dev. CoT reduces Qwen2.5-Coder-7B-Instruct from 76.89 to 68.47 on Spider-dev and from 53.32 to 45.31 on BIRD-dev. These are not rounding errors. They are reminders that adding reasoning words does not guarantee better reasoning.
For businesses, the correct reading is disciplined: test the full workflow under your own model constraints. If you plan to use local models, do not assume agent methods designed around stronger closed models will transfer. If you plan to use multi-step workflows, measure intermediate failures. If you plan to use CoT-like prompting, verify that it improves outputs rather than merely making mistakes more verbose.
Verbose mistakes are still mistakes. They just bill better.
A practical evaluation checklist for table-agent pilots
The paper’s seven design principles can be translated into a pilot checklist for enterprise teams.
| Evaluation area | Pilot test | Pass condition |
|---|---|---|
| Multiformat input | Feed CSV, spreadsheet, database table, scanned table, and hierarchical-header examples | The system preserves structure or routes to the right parser instead of flattening everything blindly. |
| Preprocessing | Include abbreviations, missing values, inconsistent dates, mixed units, and dirty column names | The system normalises or flags issues before reasoning. |
| Query understanding | Ask ambiguous business questions | The system clarifies rather than pretending the query is complete. |
| Retrieval/compression | Use large sparse tables with hidden relevant cells | The system exposes selected evidence and does not drop required fields. |
| Executable reasoning | Require SQL/Python/intermediate steps | Outputs are runnable, inspectable, and linked to final answers. |
| Security | Attempt unsafe code, restricted field access, and prompt injection through table content | The system validates code, enforces permissions, and runs in a sandbox where needed. |
| Domain adaptation | Use real terminology, policies, and recurring report formats | The system can use domain documentation or expert corrections without expensive reinvention. |
| Cost and latency | Compare baseline prompting, agent workflow, and retrieval-assisted designs | Accuracy gains justify extra tokens, time, and infrastructure. |
Notice what is not on the list: “Does the answer sound good?” That is a demo question. Operational evaluation starts when the answer sounds good and you ask how it got there.
Where this paper is strongest, and where the boundary sits
This paper is strongest as a framework and deployment warning. Its five-capability model is a useful map of the field. Its workflow comparison shows that many current table agents still cover only fragments of the real problem. Its Text-to-SQL experiment adds evidence to a point practitioners already suspect: complex agent scaffolding can underperform when paired with weaker local models.
The boundary is equally important.
First, the quantitative experiment covers a specific subfield: Text-to-SQL. Table agents are broader than SQL generation. They include spreadsheet manipulation, table question answering, data visualisation, prediction, transformation, and report generation.
Second, the experiment uses selected open-source models and methods under resource constraints. It does not establish a universal ranking of all agent frameworks, all models, or all production systems.
Third, execution accuracy is useful but narrow. It does not measure privacy, auditability, retrieval quality, robustness to dirty spreadsheets, latency, user clarification, or cost per successful task.
Fourth, the survey’s design principles are well-motivated but not themselves validated as a complete production architecture. They are a roadmap, not a finished blueprint.
That said, the paper’s limitations do not weaken its core message. They sharpen it. The field does not yet have enough real-world table benchmarks, security evaluations, retrieval metrics, or domain-adaptation tests. Therefore, serious adopters should build their own evaluation harnesses rather than waiting for a leaderboard to bless their specific mess.
The business value is workflow reliability, not spreadsheet magic
The market temptation will be to sell table agents as spreadsheet magic: ask anything, get answers, replace analysts, liberate data, enjoy the usual confetti.
The more defensible business value is narrower and better:
- Reduce manual table cleaning by automating schema alignment, value normalisation, and validation.
- Improve analyst throughput by retrieving relevant table regions and generating executable draft queries or transformations.
- Increase auditability by logging intermediate SQL, Python, or transformation steps.
- Support governed self-service analytics by forcing clarification when business terms are ambiguous.
- Enable domain-specific workflows where RAG, fine-tuning, or expert feedback can adapt the agent to recurring table formats.
That is still a large opportunity. It is just not magic. It is infrastructure.
The paper’s title points “toward real-world table agents”, and the wording matters. We are not there merely because a model can answer a question about a clean table. We get there when the system can ingest messy inputs, ask better questions, select the right evidence, execute traceable operations, survive domain terminology, and respect security boundaries.
For operators, that means the next table-agent pilot should not begin with the prettiest dashboard. It should begin with the ugliest spreadsheet in the organisation.
The ugly spreadsheet knows the truth.
Cognaptus: Automate the Present, Incubate the Future.
-
Jiaming Tian, Liyao Li, Wentao Ye, Haobo Wang, Lingxin Wang, Lihua Yu, Zujie Ren, Gang Chen, and Junbo Zhao, “Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence,” arXiv:2507.10281, 2025. https://arxiv.org/pdf/2507.10281 ↩︎