Numbers are where language models usually stop sounding impressive.

Ask a model to summarize a financial report and it may produce a fluent paragraph with just enough confidence to make everyone in the meeting relax. Ask it to calculate a percentage change from a table, preserve the correct scale, and return a verifiable number, and the poetry ends. Suddenly the model must select the right values, understand the wording, apply the right operation, avoid sign mistakes, avoid scale mistakes, and not hallucinate a formula because the word “change” appeared nearby.

This is not a minor inconvenience. In finance, insurance, healthcare operations, procurement, and audit workflows, tabular arithmetic is often the boring part that decides whether a system is useful. Nobody wants a charming assistant that reads a balance sheet like a literary critic and then mislabels a percentage as millions. We already have humans for meetings like that.

The paper behind today’s article, Error-Driven Prompt Optimization for Arithmetic Reasoning: A Code Generation Approach Using On-Premises Small Language Models on Tabular Data, studies a practical version of this problem: can a small language model running on-premise become good enough at arithmetic question answering over financial tables without fine-tuning?1

The paper’s answer is not “just use a better prompt.” That phrase should probably be taxed by now. The answer is more operational: restructure the table, make the model generate executable code, observe where it fails, cluster those failures, ask a human expert to formulate narrow rules, and keep only the rules that improve both the local failure cluster and global accuracy.

That is the important mechanism. The headline result is easier to remember: Qwen3 4B improves from 59.96% to 70.82% exact match on a filtered arithmetic subset of TAT-QA, surpassing the paper’s reported GPT-3.5 Turbo exact-match baseline of 66.27% under comparable code-generation and rule-augmented conditions. But the headline is not the lesson. The lesson is that small models do not become reliable by being shouted at with a longer instruction list. They improve when errors are treated as diagnostic material.

The real bottleneck is not arithmetic alone

The naive story says small models fail because they cannot do arithmetic. That is partly true, but not precise enough to be useful.

In tabular financial QA, a model has to solve several linked problems:

Subtask What can go wrong Why arithmetic alone is the wrong diagnosis
Understand the question Confuse “percentage change” with “change in percentage” The formula depends on wording, not only numbers
Select values from the table Pick the wrong row, year, or category The calculation may be correct over the wrong inputs
Interpret scale Return “million” when the answer should be “percent” The numeric value and answer unit can be separately wrong
Generate computation Use subtraction, division, averaging, or loops correctly The model must translate language into an operation
Return final answer Format number and scale consistently Exact-match evaluation penalizes output structure, not only math

The paper uses TAT-QA because it contains financial reports with questions and derivations. The authors filter it to focus on arithmetic questions answerable from tables alone: 215 tables and 497 questions, down from the broader 278 tables and 1,668 questions in the dataset. This matters. The result is not a claim about all financial QA, all table reasoning, or all document analysis. It is a controlled test of arithmetic over structured financial tables.

That scope is narrow, but useful. A narrow test can expose a mechanism cleanly. The mistake would be to treat it as a universal benchmark trophy.

The mechanism starts by taking arithmetic away from the model

The first move is the Code Generation Agent, or CGA. Instead of asking the model to directly produce the answer, the system asks it to generate a Python function. The function receives a list of annotated table values and returns a tuple: the numeric answer and its scale.

This decomposition changes the job.

The language model no longer has to be both interpreter and calculator in a single opaque step. It still has to understand the question and generate suitable code, but the arithmetic itself is delegated to deterministic execution. If the generated code is correct, the computation is no longer subject to the model’s internal numerical fragility.

The authors combine this with table restructuring. Rather than feeding the original financial table in its raw layout, they convert it into annotated values. Each data item can carry contextual metadata such as category, header, and number value. This is not glamorous. It is also where much of the useful engineering lives.

In business terms, the architecture is closer to an audit workflow than a chatbot trick:

  1. normalize the table into machine-friendly records;
  2. ask the model to produce executable logic;
  3. run the logic;
  4. compare output against known answers;
  5. diagnose recurring failures;
  6. add rules only when the evidence justifies them.

The paper’s previous-work tables are important background rather than the main evidence. They show why the authors selected Qwen3 4B and why a simplified prompt matters. In their earlier comparison, Qwen3 4B with CGA and table restructuring reached 53.72% exact match, and with a simplified prompt reached 59.96%. That 59.96% becomes the base prompt in the new error-driven experiment.

The mechanism is now ready for the real question: if the model still fails after decomposition, can the failures teach us what prompt rules are worth adding?

Error clustering turns prompt writing into diagnosis

The paper’s central contribution is a semi-automated prompt extension loop.

The loop does not begin with a human brainstorming rules. It begins with the model’s wrong answers. For each failed case, the authors extract features that describe the type of calculation, the generated code’s calculation pattern, whether the scale mismatched, whether the numeric values matched, and the broader error type.

The error types are practical and interpretable: selection error, calculation error, scale error, sign error, syntax error, and runtime error. This taxonomy matters because prompt optimization without error categories is usually just vibes in a lab coat.

The authors then cluster failed cases with HDBSCAN using Hamming distance over categorical or binary features. They tune clustering parameters by looking at cluster count and noise ratio, with the goal of producing clusters that are neither too coarse nor too fragmented. In the first iteration, the selected setup produces 45 clusters and a noise ratio of about 0.14; the largest cluster contains 34 items.

That largest cluster is revealing. Its members are all scale errors around percentage-change questions. The model often calculates the percentage-change value but returns the wrong scale, such as “million” or “thousand,” when the answer should be “percent.”

The human-in-the-loop part enters here. The algorithm identifies the cluster; the expert formulates the rule. The first accepted rule is simple:

“percentage change” results “percent” scale

This is not a broad theory of financial reasoning. It is a small corrective instruction aimed at a recurring root cause. Very unfashionable. Very useful.

The rule is then tested locally on the cluster and globally on the full evaluation set. The authors use the exact binomial variant of the McNemar test for paired binary outcomes in the local cluster, because the cluster sizes are small. A rule is not accepted simply because it sounds reasonable. It must improve the selected failure cluster and also improve global exact match.

This is where the paper becomes more interesting than ordinary prompt engineering. A rule is treated as a candidate intervention, not as sacred text.

The accepted rules are small, specific, and not equally obvious

The final prompt contains three accepted rules:

Accepted rule Failure it targets Operational interpretation
“percentage change” results “percent” scale Scale errors in percentage-change questions The model could calculate the value but return the wrong unit
“change in percentage” is a subtraction Confusion between relative change and difference in percentage points The wording determines whether division is appropriate
For “year average,” average the given year and previous year Misunderstanding a domain-specific financial phrase The model lacked a local convention needed for the dataset

The first rule is almost embarrassingly simple, which is precisely why the result is useful. Many production failures are not grand reasoning failures. They are narrow, repeated, boring errors that survive because nobody has built the monitoring loop to find them.

The second rule is more semantic. “Percentage change” and “change in percentage” look similar, especially to a small model trying to match patterns. But they can imply different operations. A percentage change usually involves a relative-change formula. A change in percentage can mean subtracting one percentage value from another. The model needs wording discipline.

The third rule is domain-specific. “Year average” is handled as the average between the given year and the previous one. That may be natural in the dataset context, but it is not a universal financial law. This is exactly the kind of rule that can help a deployment and embarrass a generalized benchmark claim if used carelessly.

The paper’s mechanism therefore has two distinct layers:

Layer What is automated What remains human
Error discovery Run model, compare predictions, extract features, cluster failures Decide whether a cluster reflects a meaningful root cause
Rule formulation Evaluate local and global performance after rule insertion Write the domain rule in language the model can follow
Rule acceptance Use paired evaluation and global exact match Interpret whether the rule is worth operational adoption

This is semi-automated, not fully automated. That is not a weakness if the target user is a regulated business. In compliance-heavy workflows, full autonomy is often less valuable than repeatable diagnosis with human accountability.

The main evidence is the iteration curve, not a single score

The paper’s most useful table is the iteration summary. It shows what happens as rules are added, accepted, or rejected.

Prompt version Rule change Exact match Likely purpose of test Interpretation
V1 Base simplified prompt, no added rules 59.96% Baseline Starting point for Qwen3 4B with CGA and restructuring
V2 Add percentage-change scale rule 64.59% Main evidence Large gain from correcting a frequent scale failure
V3 Add “change in percentage” subtraction rule 67.61% Main evidence Further gain from clarifying semantic operation
V4 Add broad difference/change subtraction-order rule 67.40% Rejected rule / sensitivity test Plausible rule slightly hurts global performance
V5 Add year-average rule to accepted rule set 70.82% Main evidence Best reported prompt; small but meaningful additional gain
V6 Add “use all relevant years” for averages 70.02% Rejected rule / sensitivity test Local-looking fix reduces global performance
V7 Add exact category/header selection rule 69.22% Rejected rule / sensitivity test Implementation-oriented instruction overloads or misguides the model

The improvement from 59.96% to 70.82% is an 10.86 percentage-point absolute gain. On a 497-question filtered test set, that is not cosmetic. It means the system moves from “interesting prototype” toward “worth testing inside a controlled workflow.”

But the rejected rules are just as important as the accepted ones. V4, V6, and V7 show that more rules do not automatically help. Some rules target small clusters. Some rules are too broad. Some may conflict with other instructions. Some may ask the small model to hold too many operational details in its prompt context.

This is the paper’s quiet warning to prompt maximalists: a prompt is not a landfill for every bug report.

“Small beats GPT-3.5” is true here, but read the fine print

The paper compares the final Qwen3 4B setup with a GPT-3.5 Turbo reference setup using CGA, table restructuring, and model-specific prompt rules. Qwen3 4B reaches 70.82% exact match, compared with GPT-3.5 Turbo at 66.27%.

That is a strong result for an on-premise small model. It supports the claim that small local models can become competitive in narrow, structured arithmetic workflows when the architecture is built around them.

However, the comparison has a detail that deserves more attention than the usual “small model beats large model” headline. GPT-3.5 Turbo still has a slightly higher value match score: 78.92% versus Qwen3 4B’s 77.06%. Exact match includes both value and scale. The Qwen setup wins exact match partly because the prompt rules improve final answer formatting and scale handling.

That does not weaken the paper. It sharpens it.

The result suggests that targeted prompt rules can make a smaller model produce the right final answer more often in the tested setup, even if its underlying value selection and reasoning are not uniformly stronger than the larger model’s. For a business workflow, the final answer matters. For system design, the distinction matters even more.

A deployment team should not conclude, “Qwen3 4B is better than GPT-3.5 at financial reasoning.” The safer conclusion is:

Under this filtered TAT-QA arithmetic setup, with table restructuring, code generation, and error-driven rule selection, Qwen3 4B achieved higher exact match than the reported GPT-3.5 Turbo reference, while value-only performance remained slightly lower.

That sentence is less viral. It is also less likely to get someone fired.

The rule-capacity trade-off is the business lesson

The paper frames a theoretical implication around the number of prompt rules. Small language models have limited capacity to process complex instructions. Early rules fix common, high-impact errors. Later rules tend to address rarer edge cases, introduce conflicts, or overload the model.

This is not just a model behavior observation. It is a governance pattern.

In a real organization, every rule added to an AI workflow has a maintenance cost. Someone must know why it exists, which failure mode it targets, whether it still helps after data distribution changes, and whether it conflicts with newer rules. If rules are added casually, the prompt becomes a museum of past incidents. The model then has to perform under a pile of institutional anxiety.

The paper’s algorithm gives a more disciplined approach:

Governance question Paper’s mechanism Business value
Which failures deserve attention? Cluster errors and inspect the largest or most coherent groups Focus effort on repeated failure modes
Who writes the fix? Human expert formulates the domain-specific rule Preserve domain accountability
How is a rule accepted? Test local cluster improvement and global exact-match impact Avoid fixes that help one corner but hurt the system
When do we stop? Stop when candidate rules no longer meet acceptance criteria Prevent prompt bloat

This is especially relevant for regulated sectors. In finance or healthcare, the selling point is not that the model is small because small is cute. The selling point is that the system can run locally, keep sensitive data inside controlled infrastructure, and produce auditable intermediate artifacts: annotated values, generated code, execution output, error clusters, and rule histories.

That is a more credible business story than “we used AI to automate analysis.” Almost everyone says that now. Some even mean it.

What the paper directly shows

The direct findings are specific:

Claim Evidence in the paper Boundary
CGA plus table restructuring gives small models a better arithmetic setup Prior results show Qwen3 4B improves substantially under CGA and restructuring This is inherited from the authors’ earlier work and used as the base for the current study
Error-driven prompt optimization improves Qwen3 4B Exact match rises from 59.96% to 70.82% after three accepted rules Tested on 497 filtered TAT-QA arithmetic table questions
Clustering can identify interpretable recurring failure modes The largest first cluster contains percentage-change scale errors Cluster usefulness depends on feature design and expert interpretation
Not all plausible rules help Several candidate rules reduce or fail to improve global performance This supports selective rule acceptance, not automatic rule accumulation
The small on-premise model can beat the reported GPT-3.5 Turbo exact-match baseline Qwen3 4B: 70.82% EM; GPT-3.5 Turbo: 66.27% EM Value match remains slightly lower for Qwen3 4B

The distinction between direct evidence and business inference is important. The paper does not prove that any small model can replace cloud LLMs in every regulated workflow. It does show that, for a narrow class of financial table arithmetic questions, a carefully engineered small-model pipeline can become competitive without fine-tuning.

What Cognaptus would infer for business use

For business automation, the paper points to a deployable pattern:

  1. use local models where data sovereignty matters;
  2. convert messy tables into annotated values before asking the model to reason;
  3. make the model generate executable code rather than final arithmetic directly;
  4. log failures as structured data;
  5. cluster errors to identify repeated root causes;
  6. let a domain expert write narrow corrective rules;
  7. accept only rules that improve both targeted errors and global performance.

This pattern is not limited to financial reports. It could apply to invoice reconciliation, medical billing tables, procurement variance analysis, insurance claim calculations, inventory reports, regulatory templates, or any workflow where the data is structured enough to validate answers and the cost of sending data to an external API is legally or commercially unattractive.

But the transfer is architectural, not automatic. A hospital billing workflow would need different annotated fields, different error categories, and different domain rules. An insurance workflow would have its own scale conventions, exclusions, and calculation semantics. The paper offers a process for learning operational rules from mistakes. It does not hand over a universal rulebook.

That is probably the healthier kind of contribution.

The appendix is operational evidence, not decoration

The prompt appendix matters because it shows how small the final intervention actually is. The base prompt asks the model to generate a Python function over a value list and return (number, scale). The final prompt adds only three domain-specific rules before repeating the instruction not to generate explanations or example code.

That compactness is part of the result.

The error-cluster appendix is also useful. It lists many percentage-change questions whose generated code often computed a percentage-like expression but returned the wrong scale. This supports the interpretation that the first rule was not arbitrary. It was targeted at a visible family of failures.

In other words, the appendix is not a second thesis. It is an audit trail. It lets the reader see the failure family, the prompt intervention, and the final prompt state. For business readers, that auditability is not academic overhead. It is what separates a governed automation system from a very confident spreadsheet intern.

Boundaries that matter before deployment

The first boundary is dataset scope. The test set contains 497 filtered TAT-QA arithmetic questions answerable from tabular data. This is valuable, but it excludes broader hybrid questions requiring paragraph evidence. Many real financial workflows mix tables, footnotes, definitions, and management commentary. The paper’s mechanism may extend to those settings, but the reported numbers do not prove it.

The second boundary is human rule formulation. The system clusters errors algorithmically, but a human still interprets the cluster and writes the rule. That is appropriate for high-stakes domains, but it means the method is not fully self-improving. It is better described as error-driven assisted prompt optimization.

The third boundary is domain specificity. The “year average” rule is useful in the tested context, but it is not a universal accounting axiom. Domain-specific rules should be versioned, documented, and periodically revalidated. Otherwise today’s clever fix becomes tomorrow’s silent bug.

The fourth boundary is model capacity. The rejected rules show that small models can be harmed by additional instructions. This means prompt optimization for SLMs is not simply the same as LLM prompt engineering with fewer parameters. It is its own discipline: shorter prompts, cleaner rules, stricter acceptance tests.

The fifth boundary is comparison framing. Qwen3 4B surpasses GPT-3.5 Turbo in exact match in the reported setup, but GPT-3.5 Turbo remains slightly ahead on value match. The result is a win for the engineered pipeline, not a general declaration that the smaller model reasons better.

The practical message: stop treating mistakes as anecdotes

The paper’s most useful idea is cultural as much as technical. In many AI deployments, errors are handled as anecdotes. A user finds a bad answer, someone patches a prompt, the team moves on, and the prompt slowly turns into a haunted attic.

This paper treats errors as data. That changes the workflow.

A failed prediction is not just a defect. It is a labeled event with features. A group of failures is not just a bad week. It may be a root cause. A new rule is not just a clever sentence. It is an intervention that must pass local and global tests. A rejected rule is not wasted effort. It is evidence that plausibility and usefulness are different things.

That is the part worth bringing into business AI systems.

The future of on-premise AI in regulated sectors will not be decided only by model size. It will be decided by whether organizations can build reliable loops around models: preprocessing, decomposition, execution, monitoring, diagnosis, targeted correction, and governance.

Small models do not need to be magical to be useful. They need scaffolding. They need error logs. They need domain experts who write fewer, better rules. And, apparently, they need someone to remind them that “percentage change” returns a percent. The machines are brilliant, but let us not get carried away.

Conclusion: the model learns, but the workflow teaches

The title says small models learn from their mistakes, but the more precise version is this: the workflow learns from the model’s mistakes and then teaches the model through a concise rule set.

That distinction matters. Fine-tuning changes model weights. This paper changes the operating environment around the model. It decomposes the task into code generation, turns wrong answers into clustered diagnostics, and adds domain rules only when they survive evidence. The result is a small on-premise model that becomes materially stronger on a narrow but commercially relevant class of tabular arithmetic tasks.

For Cognaptus readers, the lesson is not “replace your cloud LLM with Qwen3 4B tomorrow.” The lesson is that reliable automation in regulated workflows may come less from chasing ever-larger models and more from designing tighter feedback loops around smaller ones.

A model that can be audited, corrected, and deployed locally is not always the smartest model in the room. Sometimes it is simply the one that can be trusted to stay in the room.

References

Cognaptus: Automate the Present, Incubate the Future.


  1. Árpád Pándy, Róbert Lakatos, and András Hajdu, “Error-Driven Prompt Optimization for Arithmetic Reasoning: A Code Generation Approach Using On-Premises Small Language Models on Tabular Data,” arXiv:2512.13323, 2025. https://arxiv.org/abs/2512.13323 ↩︎