Opening — Why this matters now
Most AI teams are still asking the wrong question: Which model should we use?
The more uncomfortable—and far more expensive—question is: How much are you paying for each correct answer?
In production environments, especially those involving structured classification tasks, performance is no longer judged by benchmark scores alone. It is judged by accuracy per dollar, per call, per decision.
A recent case study on receipt-item categorisation using AWS Bedrock quietly exposes something the industry prefers not to say out loud: prompt design can outperform model upgrades in both accuracy and ROI fileciteturn0file0.
And yes, the difference can be as small—and as decisive—as $0.004 per call.
Background — From model obsession to system design
Text classification is not new. What is new is the assumption that LLMs can replace traditional pipelines with minimal effort.
Historically, classification systems relied on:
- Feature engineering (painful)
- Fine-tuned models (expensive)
- Rule-based systems (fragile but predictable)
LLMs promised to dissolve this trade-off.
In reality, they reintroduced it—just at a different layer.
Instead of tuning weights, we now tune prompts, schemas, and category definitions.
The paper’s setup is deceptively simple:
- Task: classify receipt items into predefined expense categories
- Models: Claude 3.7, Claude 4, Mixtral 8x7B, Mistral 7B
- Platform: AWS Bedrock (standardized API layer)
- Dataset: 389 manually labeled receipt items
The ambition is not academic elegance. It is operational clarity: Which setup actually works in production?
Analysis — What the paper actually does
The study runs in two phases.
Phase 1 — Model selection under controlled prompts
All models are tested under identical, schema-first prompts:
- Fixed category list
- JSON output constraint
- Zero-shot (no examples)
The goal is fairness. The result is clarity.
Phase 2 — Prompt engineering as optimization
Once the best model is identified, the real work begins.
Four prompt variants are tested:
| Variant | Categories | Rules | Few-shot | Intent |
|---|---|---|---|---|
| V1 | Baseline | ✗ | ✗ | Minimal cost baseline |
| V2 | Refined | ✗ | ✗ | Better taxonomy |
| V3 | Refined | ✓ | ✗ | Rule-guided disambiguation |
| V4 | Refined | ✓ | ✓ | Example-driven guidance |
This is not prompt tinkering. It is system design disguised as prompting.
Findings — The uncomfortable economics of LLMs
1. Model choice matters. But not as much as you think.
| Model | Accuracy | Balanced Accuracy | F1 | Observation |
|---|---|---|---|---|
| Claude 3.7 | 0.902 | 0.773 | 0.905 | Best overall trade-off |
| Claude 4 | 0.848 | 0.748 | 0.851 | Slower, no clear gain |
| Mixtral 8x7B | 0.694 | 0.608 | 0.696 | Faster but unstable |
| Mistral 7B | 0.596 | 0.492 | 0.600 | Cheapest, least reliable |
The gap between proprietary and open-weight models is not subtle. It is structural.
But the real story begins after this.
2. Prompt design delivers larger gains than model upgrades
| Variant | Accuracy | Balanced Accuracy | Cost Impact |
|---|---|---|---|
| V1 (Baseline) | ~90.7% | ~81.4% | 1x |
| V2 (Better categories) | ~90.7% | ↑ | ~1.05x |
| V3 (Rules) | 93.3% | ↑↑ | ~2x |
| V4 (Few-shot) | 92.5% | mixed | highest cost |
The conclusion is almost offensive in its simplicity:
Adding rules beats adding examples.
Few-shot prompting—often treated as a default upgrade—adds cost without improving outcomes.
Rules, on the other hand, reshape the decision boundary.
3. Accuracy is not binary—it is contextual
The study introduces an important distinction:
- Strict accuracy: exact label match
- Lenient accuracy: acceptable alternative categories allowed
Under lenient evaluation:
- Accuracy rises to ~95%
- Only ~5% of outputs are truly wrong
This reframes the problem entirely.
LLMs are not “wrong” as often as metrics suggest. They are ambiguous in human ways.
4. Cost is driven by input tokens, not output
| Variant | Input Tokens | Cost per Call |
|---|---|---|
| V1 | ~421 | ~$0.00395 |
| V3 | ~979 | ~$0.00874 |
| V4 | ~1212 | ~$0.01067 |
Output tokens barely change.
Which means:
Every extra word in your prompt is a recurring expense.
At scale, this is not a technical detail. It is a budget line.
Implications — What this means for real systems
1. Prompt engineering is now an economic function
Not a creative one. Not an experimental one.
An economic one.
Teams should treat prompts as:
- Cost centers
- Optimization surfaces
- Versioned assets
2. “Better models” are often a lazy substitute for design
Switching from Mistral to Claude improves performance.
But refining categories and adding rules improves it more efficiently.
This is the difference between:
- Buying capability
- Engineering capability
3. Schema-first design quietly solves hallucination
The study’s most underrated insight:
Constraining outputs to a fixed schema:
- Eliminates format errors
- Reduces hallucinations
- Improves consistency
In other words:
Most hallucinations are not intelligence failures. They are interface failures.
4. The real trade-off is not accuracy vs cost
It is:
| Strategy | Outcome |
|---|---|
| Bigger model | Higher cost, marginal gain |
| Better prompt | Moderate cost, meaningful gain |
| Better taxonomy | Structural improvement |
The winning combination in this study:
Claude 3.7 + refined categories + rules (no few-shot)
Not the newest model. Not the most complex setup.
Just the most disciplined one.
Conclusion — The quiet shift from AI to systems thinking
This paper is not really about receipt categorisation.
It is about a broader transition:
From model-centric AI to system-centric AI.
The industry is slowly realizing:
- Models provide capability
- Prompts shape behavior
- Schemas enforce discipline
- Costs determine viability
And somewhere between those layers, the real product emerges.
The next wave of AI advantage will not come from larger models.
It will come from teams who understand that:
A well-designed prompt is a cheaper model upgrade.
And sometimes, the difference between a prototype and a product…
…is exactly $0.004.
Cognaptus: Automate the Present, Incubate the Future.