When Small Models Learn From Their Mistakes: Arithmetic Reasoning Without Fine-Tuning

Opening — Why this matters now

Regulated industries love spreadsheets and hate surprises. Finance, healthcare, and insurance all depend on tabular data—and all have strict constraints on where that data is allowed to go. Shipping sensitive tables to an API-hosted LLM is often a non‑starter. Yet small, on‑prem language models have a reputation problem: they speak fluently but stumble over arithmetic.

This paper quietly dismantles that assumption. It shows that arithmetic weakness is not an inherent limitation of small language models (SLMs), but a consequence of how we ask them to think.

Background — From “LLMs can’t count” to code‑first reasoning

Prior work has already established two uncomfortable truths:

Large language models are unreliable at multi‑step arithmetic over tables.
Simply scaling parameters does not reliably fix this.

The authors’ earlier work reframed tabular QA as code generation: instead of answering directly, the model generates executable Python code that selects values and performs deterministic calculations. This Code Generation Agent (CGA), combined with table restructuring, pushed accuracy close to 80%—but relied on large, API‑hosted models.

The natural question followed: can small, local models do the same job—without fine‑tuning and without leaking data?

Analysis — Error‑driven prompt optimization

The core contribution of this paper is deceptively simple: make the model learn from its mistakes, but at the prompt level.

The pipeline

Table restructuring converts raw tables into annotated value lists.
Code Generation Agent produces Python functions instead of answers.
Deterministic execution guarantees arithmetic correctness if the code is right.
Error clustering groups failed cases by shared root causes.
Rule induction adds targeted, domain‑specific prompt rules.
Statistical validation (McNemar tests) ensures improvements are real.

No fine‑tuning. No gradient updates. Just disciplined iteration.

Why clustering matters

Rather than guessing which rules might help, the authors cluster errors using features such as:

Calculation pattern
Scale mismatch
Value/sign errors
Runtime vs logic errors

This reveals patterns like:

“percentage change” returning the wrong scale
“change in percentage” being treated as relative rather than subtractive
financial terms (e.g. year average) being misunderstood

Each accepted rule fixes a class of errors, not a single example.

Findings — Results that actually matter

The headline result is not flashy—but it is consequential.

Model Setup	Exact Match
Qwen3 4B (CGA, no rules)	59.96%
Qwen3 4B (error‑driven rules)	70.82%
GPT‑3.5 Turbo (same pipeline)	66.27%

A 4B parameter model, running fully on‑prem, outperforms GPT‑3.5 Turbo on arithmetic tabular QA.

Even more interesting is the shape of improvement:

Early rules deliver large gains (fixing common semantic errors)
Later rules yield diminishing—or negative—returns

This leads to a practical insight rarely stated explicitly.

Implications — The myth of “more rules is better”

The paper formalizes an idea many practitioners intuitively feel but rarely quantify: prompt overload is real.

There exists an optimal rule count $K_{opt}$ such that:

$$ K_{opt} = \arg\max_K P_{accuracy}(K) $$

Beyond this point:

Cognitive load increases
Rule conflicts emerge
Accuracy stagnates or declines

In other words, prompt engineering for SLMs is not about maximal instruction—it is about minimal sufficiency.

Business relevance — Why this changes deployment economics

For organizations operating under privacy or compliance constraints, this framework is quietly disruptive:

On‑prem deployment with commodity hardware
Auditable reasoning via executable code
No fine‑tuning cost or model retraining
Model‑agnostic methodology applicable beyond finance

This is not “prompt hacking.” It is a repeatable optimization loop.

Conclusion — Small models, properly disciplined

This paper reframes the debate around small language models. Their arithmetic failures are not fundamental—they are procedural. When reasoning is decomposed into code and refined through systematic error analysis, compact models become reliable analytical agents.

The real takeaway is not that Qwen3 4B beats GPT‑3.5 in one benchmark. It is that capability emerges from structure, not scale.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From “LLMs can’t count” to code‑first reasoning#

Analysis — Error‑driven prompt optimization#

The pipeline#

Why clustering matters#

Findings — Results that actually matter#

Implications — The myth of “more rules is better”#

Business relevance — Why this changes deployment economics#

Conclusion — Small models, properly disciplined#