Opening — Why this matters now

Most AI teams are still asking the wrong question: Which model should we use?

The more uncomfortable—and far more expensive—question is: How much are you paying for each correct answer?

In production environments, especially those involving structured classification tasks, performance is no longer judged by benchmark scores alone. It is judged by accuracy per dollar, per call, per decision.

A recent case study on receipt-item categorisation using AWS Bedrock quietly exposes something the industry prefers not to say out loud: prompt design can outperform model upgrades in both accuracy and ROI fileciteturn0file0.

And yes, the difference can be as small—and as decisive—as $0.004 per call.


Background — From model obsession to system design

Text classification is not new. What is new is the assumption that LLMs can replace traditional pipelines with minimal effort.

Historically, classification systems relied on:

  • Feature engineering (painful)
  • Fine-tuned models (expensive)
  • Rule-based systems (fragile but predictable)

LLMs promised to dissolve this trade-off.

In reality, they reintroduced it—just at a different layer.

Instead of tuning weights, we now tune prompts, schemas, and category definitions.

The paper’s setup is deceptively simple:

  • Task: classify receipt items into predefined expense categories
  • Models: Claude 3.7, Claude 4, Mixtral 8x7B, Mistral 7B
  • Platform: AWS Bedrock (standardized API layer)
  • Dataset: 389 manually labeled receipt items

The ambition is not academic elegance. It is operational clarity: Which setup actually works in production?


Analysis — What the paper actually does

The study runs in two phases.

Phase 1 — Model selection under controlled prompts

All models are tested under identical, schema-first prompts:

  • Fixed category list
  • JSON output constraint
  • Zero-shot (no examples)

The goal is fairness. The result is clarity.

Phase 2 — Prompt engineering as optimization

Once the best model is identified, the real work begins.

Four prompt variants are tested:

Variant Categories Rules Few-shot Intent
V1 Baseline Minimal cost baseline
V2 Refined Better taxonomy
V3 Refined Rule-guided disambiguation
V4 Refined Example-driven guidance

This is not prompt tinkering. It is system design disguised as prompting.


Findings — The uncomfortable economics of LLMs

1. Model choice matters. But not as much as you think.

Model Accuracy Balanced Accuracy F1 Observation
Claude 3.7 0.902 0.773 0.905 Best overall trade-off
Claude 4 0.848 0.748 0.851 Slower, no clear gain
Mixtral 8x7B 0.694 0.608 0.696 Faster but unstable
Mistral 7B 0.596 0.492 0.600 Cheapest, least reliable

The gap between proprietary and open-weight models is not subtle. It is structural.

But the real story begins after this.


2. Prompt design delivers larger gains than model upgrades

Variant Accuracy Balanced Accuracy Cost Impact
V1 (Baseline) ~90.7% ~81.4% 1x
V2 (Better categories) ~90.7% ~1.05x
V3 (Rules) 93.3% ↑↑ ~2x
V4 (Few-shot) 92.5% mixed highest cost

The conclusion is almost offensive in its simplicity:

Adding rules beats adding examples.

Few-shot prompting—often treated as a default upgrade—adds cost without improving outcomes.

Rules, on the other hand, reshape the decision boundary.


3. Accuracy is not binary—it is contextual

The study introduces an important distinction:

  • Strict accuracy: exact label match
  • Lenient accuracy: acceptable alternative categories allowed

Under lenient evaluation:

  • Accuracy rises to ~95%
  • Only ~5% of outputs are truly wrong

This reframes the problem entirely.

LLMs are not “wrong” as often as metrics suggest. They are ambiguous in human ways.


4. Cost is driven by input tokens, not output

Variant Input Tokens Cost per Call
V1 ~421 ~$0.00395
V3 ~979 ~$0.00874
V4 ~1212 ~$0.01067

Output tokens barely change.

Which means:

Every extra word in your prompt is a recurring expense.

At scale, this is not a technical detail. It is a budget line.


Implications — What this means for real systems

1. Prompt engineering is now an economic function

Not a creative one. Not an experimental one.

An economic one.

Teams should treat prompts as:

  • Cost centers
  • Optimization surfaces
  • Versioned assets

2. “Better models” are often a lazy substitute for design

Switching from Mistral to Claude improves performance.

But refining categories and adding rules improves it more efficiently.

This is the difference between:

  • Buying capability
  • Engineering capability

3. Schema-first design quietly solves hallucination

The study’s most underrated insight:

Constraining outputs to a fixed schema:

  • Eliminates format errors
  • Reduces hallucinations
  • Improves consistency

In other words:

Most hallucinations are not intelligence failures. They are interface failures.


4. The real trade-off is not accuracy vs cost

It is:

Strategy Outcome
Bigger model Higher cost, marginal gain
Better prompt Moderate cost, meaningful gain
Better taxonomy Structural improvement

The winning combination in this study:

Claude 3.7 + refined categories + rules (no few-shot)

Not the newest model. Not the most complex setup.

Just the most disciplined one.


Conclusion — The quiet shift from AI to systems thinking

This paper is not really about receipt categorisation.

It is about a broader transition:

From model-centric AI to system-centric AI.

The industry is slowly realizing:

  • Models provide capability
  • Prompts shape behavior
  • Schemas enforce discipline
  • Costs determine viability

And somewhere between those layers, the real product emerges.

The next wave of AI advantage will not come from larger models.

It will come from teams who understand that:

A well-designed prompt is a cheaper model upgrade.

And sometimes, the difference between a prototype and a product…

…is exactly $0.004.


Cognaptus: Automate the Present, Incubate the Future.