Receipts are not glamorous. That is precisely why they are useful.
A receipt-item categoriser is not a benchmark leaderboard, a launch demo, or a dramatic agentic workflow with a glowing dashboard. It is the kind of small, repetitive business decision that quietly determines whether an AI system becomes a product or remains an expensive toy. A bottle of iced coffee needs a category. A supermarket item needs to land in the right expense bucket. The output must be parseable. The cost must be low enough to repeat thousands or millions of times. Nobody wants a philosophical essay from the model. They want a JSON array.
That makes the case study in Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study more useful than its modest title suggests.1 The paper compares Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct on AWS Bedrock for classifying receipt items into predefined expense categories. It then takes the best model from the first round and tests whether prompt and taxonomy refinements can improve the result.
The headline is not “Claude wins.” That would be too easy, and therefore suspiciously close to a procurement memo.
The more useful finding is narrower and more operational: in this task, Claude 3.7 Sonnet gives the strongest accuracy-consistency-cost trade-off, but the larger lesson is that category design, schema discipline, and explicit disambiguation rules matter as much as — and sometimes more than — model upgrading. The system gets better not because the model becomes magically smarter, but because the task becomes less sloppy.
The study is really a comparison of decisions, not just models
The paper has two phases, and the distinction matters.
Phase 1 is the model-selection phase. The authors evaluate four AWS Bedrock models on the same dataset of 389 manually labelled receipt items, using schema-first zero-shot prompts and a fixed category list. The purpose is main evidence: which model produces the best combination of classification accuracy, output stability, runtime, and cost behaviour under controlled conditions?
Phase 2 is the system-design phase. After Claude 3.7 Sonnet performs best in Phase 1, the authors keep that model and compare four prompt configurations: the baseline category set, a refined category set, the refined set plus explicit rules, and the refined set plus rules plus few-shot examples. This is closer to an ablation study than a second model benchmark. It asks which part of the prompt system actually improves performance: the taxonomy, the rules, or the examples.
That difference is important for business readers. Many teams treat model selection as the main decision and prompt work as implementation detail. The paper reverses the weight of those two activities. Model choice gets the system into a viable range. Prompt and taxonomy design determine whether it is economically useful.
| Decision layer | What the paper tests | Likely purpose | Business meaning |
|---|---|---|---|
| Model family | Claude 3.7, Claude 4, Mixtral 8x7B, Mistral 7B | Main evidence | Choose a viable baseline model for structured classification |
| Category taxonomy | Original categories vs refined categories | Ablation / design test | Reduce ambiguity before blaming the model |
| Disambiguation rules | Explicit boundaries for overlapping categories | Ablation / implementation test | Turn vague business logic into reusable prompt logic |
| Few-shot examples | Rules plus examples | Sensitivity test on common prompting habit | Check whether examples justify extra token cost |
| Strict vs lenient scoring | Exact match vs acceptable alternative | Robustness / ambiguity test | Separate true errors from defensible category choices |
| Runtime and token use | Latency and estimated cost per call | Operational constraint | Estimate whether accuracy gains survive production economics |
This is why a comparison-based reading is better than a linear summary. The paper is not one argument moving from introduction to conclusion. It is a decision matrix hiding inside an experiment.
Claude 3.7 wins Phase 1, but not because “newer is better”
The first comparison is straightforward. Claude 3.7 Sonnet achieves the best overall result among the four models in Phase 1.
| Model | Precision | Recall | F1 | Accuracy | Balanced accuracy |
|---|---|---|---|---|---|
| Claude 3.7 Sonnet | 0.907 | 0.902 | 0.905 | 0.902 | 0.773 |
| Claude 4 Sonnet | 0.853 | 0.848 | 0.851 | 0.848 | 0.748 |
| Mixtral 8x7B | 0.698 | 0.694 | 0.696 | 0.694 | 0.608 |
| Mistral 7B | 0.604 | 0.596 | 0.600 | 0.596 | 0.492 |
The obvious temptation is to convert this into a model ranking: Claude 3.7 good, Claude 4 surprisingly less useful, open-weight models faster but weaker. That is mostly correct, but incomplete.
The more interesting point is that Claude 3.7 does not merely classify more accurately. It also obeys the output contract better. For a production classification pipeline, this distinction is not cosmetic. If the model returns one label for each detected item, downstream systems can continue. If it returns too many labels, too few labels, or invented categories, the pipeline now needs repair logic, human review, or silent failure. Silent failure is, of course, the enterprise version of stepping on a rake.
The paper reports perfect array-length consistency for both Claude models: a 0.00% mismatch rate. Mixtral 8x7B has an approximate mismatch rate of 1.9%, while Mistral 7B reaches approximately 9.3%. That may look small if one thinks like a benchmark reader. It looks less small if one thinks like the person maintaining an expense automation system where every malformed row becomes an exception queue.
The runtime comparison adds another layer. Mixtral is the fastest on mean latency, at 431 ms. Claude 3.7 averages 1265 ms, Claude 4 averages 1433 ms, and Mistral 7B averages 912 ms but has an extreme maximum latency of 15351 ms. So the fastest option is not the most reliable, and the newest Claude model is not the best trade-off. For this task, Claude 3.7 sits in the operational middle: slower than Mixtral, but steadier and much more accurate.
That is the first useful correction to the common buyer instinct. Do not ask only which model is strongest. Ask which model fails in the least expensive way.
The $0.004 baseline is cheap only if the answer is usable
The paper’s cost analysis is available mainly for the Claude models, because those models report token counts. In Phase 1, Claude 3.7 averages 421 input tokens and 24 output tokens per receipt classification, with an estimated total cost of $0.003950 per call. Claude 4 is almost identical at $0.003960 per call. In round numbers, that is about 250 calls per dollar.
This is the source of the article title. A $0.004 decision sounds almost free. It is not.
At small scale, it is trivial. At production scale, it becomes a repeating unit of operational economics. One million calls at the baseline cost is roughly $3,950 before considering surrounding extraction, storage, orchestration, monitoring, exception handling, and human review. The model call is only one invoice line, but it is the line most teams can accidentally double by adding a friendly paragraph of “helpful” instructions to the prompt.
The paper makes this visible in Phase 2. Output tokens remain nearly stable across prompt variants, around 24–27 output tokens. The cost changes mostly because input tokens grow.
| Claude 3.7 variant | Prompt design | Avg. input tokens | Avg. output tokens | Avg. total cost/call | Calls per $1 |
|---|---|---|---|---|---|
| Variant 1 | Baseline categories, zero-shot | 421 | 24 | $0.003950 | 253 |
| Variant 2 | Refined categories, zero-shot | 433 | 26 | $0.004160 | 240 |
| Variant 3 | Refined categories + rules, zero-shot | 979 | 26.1 | $0.008740 | 114 |
| Variant 4 | Refined categories + rules + few-shot | 1212 | 26.5 | $0.010670 | 94 |
The practical lesson is blunt: prompt engineering is not free engineering. It is recurring-cost engineering.
But that does not mean shorter prompts are always better. It means every extra token needs a job. Variant 3 more than doubles the baseline cost, but it also produces the strongest strict accuracy in Phase 2. Variant 4 costs even more and does not improve strict performance. That is the difference between paying for decision logic and paying for decorative examples.
Better categories do part of the work before the model starts thinking
Phase 2 begins with a deceptively simple intervention: revise the category set.
The original Phase 1 taxonomy contains 26 categories, including labels such as Fresh Produce, Pantry & Snacks, Coffee & Tea, Beverages, Eating Out, Home & Cleaning, and Other. The revised Phase 2 taxonomy expands to 27 categories and adjusts boundaries. For example, Fresh Produce becomes Fruits & Vegetables; Meat & Seafood becomes Meat & Seafood & Deli; Dairy & Eggs becomes Dairy & Eggs & Fridge; Beverages becomes Drinks; Home & Cleaning is split toward Cleaning & Maintenance and Home & Lifestyle.
This is not merely renaming. It is business semantics being translated into machine-usable boundaries.
The paper notes several ambiguous items that motivated refinement: frozen dumplings, hummus, deli meats, and iced coffee drinks. These are exactly the kinds of items that make receipt categorisation annoying. A human can often defend two labels. A model forced to return one label will be punished if the taxonomy does not explain which interpretation the business prefers.
This is where many LLM implementations go wrong. Teams assume the model is confused. Sometimes it is. But often the business process itself is under-specified. The model exposes the ambiguity; it does not create it.
Variant 2, which uses the refined categories without extra rules or examples, gives only a modest improvement in overall strict metrics. That should not be read as failure. Taxonomy refinement often produces its value unevenly: it improves some boundaries, reveals other weak boundaries, and creates a cleaner base for rules. In the paper’s terms, refined categories are not the whole solution. They are the floor on which the solution stands.
Rules beat examples because this is a boundary problem
The strongest strict result comes from Variant 3: refined categories plus explicit rules and disambiguation heuristics, still zero-shot.
Under strict evaluation, Variant 3 reaches 93.30% overall accuracy, compared with 90.70% for the baseline. Its overall precision, recall, and F1 are around 93%. Under lenient evaluation, Variant 3 reaches 95.60% accuracy, with overall precision of 95.40% and F1 of 95.50%.
| Variant | Strict accuracy | Strict F1 | Lenient accuracy | Lenient F1 | Interpretation |
|---|---|---|---|---|---|
| Variant 1 | 90.70% | 90.50% | Not reported in same table | Not reported in same table | Strong baseline, cheap |
| Variant 2 | 90.70% | 90.60% | 91.80% | 91.60% | Better taxonomy, modest gain |
| Variant 3 | 93.30% | 93.20% | 95.60% | 95.50% | Best strict result; best overall practical balance if accuracy matters |
| Variant 4 | 92.50% | 92.40% | 95.40% | 95.20% | Few-shot adds cost without strict gain |
The reason is not mysterious. Receipt-item classification is a boundary problem. The hard cases are not random strings. They are adjacent categories: pantry versus fridge, drink versus coffee, eating out versus packaged food, personal care versus household cleaning. Examples can help a model infer a pattern, but explicit rules tell it which boundary the business wants.
A rule such as “If an item is refrigerated or perishable, classify it under dairy_eggs_fridge rather than pantry_snacks” is not just prompt text. It is a compressed business policy. It resolves a recurring ambiguity before the model has to improvise.
Few-shot examples, by contrast, are heavier. They consume input tokens every time the prompt runs. In this experiment, they do not outperform the rules-only variant under strict scoring. Variant 4 slightly improves balanced metrics under lenient evaluation, but it costs the most and does not beat Variant 3 on overall strict or lenient accuracy.
That does not prove few-shot prompting is useless. It proves something narrower and more actionable: for this task, with this taxonomy, this dataset, and this model, examples are not the first lever to pull. Define the categories. Add the rules. Then ask whether examples still earn their rent.
Strict accuracy is useful, but ambiguity needs its own accounting
The paper’s strict-versus-lenient evaluation is one of its more important design choices.
Strict evaluation counts only exact category matches as correct. This is necessary for a production system because the database wants one value, not a seminar. But strict scoring can overstate practical error when the receipt item is genuinely ambiguous. The authors therefore add a lenient evaluation in Phase 2, where a prediction can be counted as correct if it matches either the primary category or a valid alternative.
This is best understood as a robustness or sensitivity test. It does not replace strict accuracy. It explains how much of the apparent error comes from ambiguous labelling rather than useless classification.
The paper estimates that around 20 of the 389 items are inherently ambiguous. That is about 5% of the dataset. Under lenient evaluation, Variant 3 reaches approximately 95.6% accuracy. In business terms, this suggests that some residual “error” may be a product-design question rather than a model-quality question.
Consider iced coffee. If the business sees it as a drink, “drinks” is acceptable. If the business treats refrigerated dairy-based drinks differently, “dairy_eggs_fridge” may also be defensible. The correct answer depends on accounting logic, nutrition logic, inventory logic, or user expectation. The model cannot infer that policy from the laws of nature. Someone has to decide.
This is why the paper’s evaluation design matters beyond receipts. When deploying LLMs for classification, organisations need at least three error buckets:
| Error bucket | Meaning | Operational response |
|---|---|---|
| True misclassification | The output is clearly wrong | Improve prompt, model, training data, or review process |
| Ambiguous but acceptable | The output differs from the primary label but is defensible | Clarify business policy or allow accepted alternatives |
| Schema or format failure | The output cannot be safely parsed or aligned | Strengthen schema constraints or reject model configuration |
Without this separation, teams overreact to ambiguity and underreact to malformed output. That is a bad trade. Ambiguity can often be governed. Malformed output breaks automation.
The category-level results warn against celebrating averages too early
The paper reports category-level performance, and this is where the story becomes less flattering but more useful.
In Phase 1, Claude models perform strongly on common categories such as Pantry & Snacks, Eating Out, and Fresh Produce. Open-weight models struggle more visibly, especially in household and personal-care categories. Some low-support categories behave erratically because there are only one or a few examples. A single mistake can turn a category’s apparent performance from perfect to disastrous.
Phase 2 shows similar patterns. Variant 3 performs very well in several categories under lenient evaluation: Fruits & Vegetables at 100%, Dairy & Eggs & Fridge at 100%, Frozen at 100%, Coffee & Tea at 100%, and Eating Out at 100%. But Drinks remains weak, at 56.5% for both Variant 3 and Variant 4. Utilities & Bills stays at 60%. Travel & Holidays remains at 0%, though with only one item.
This is not a reason to dismiss the result. It is a reason to avoid treating one overall accuracy number as a deployment certificate.
For a production team, category-level performance should become a prioritisation tool. High-volume categories with high accuracy can be automated aggressively. High-volume categories with boundary problems need policy work. Low-volume categories need more data before anyone should pretend the percentage means much. A category with one example is not a category benchmark. It is a coin toss wearing a lab coat.
What businesses should copy from the paper
The paper’s most transferable contribution is not the exact model choice. Claude 3.7 may be the best option in this experiment, but the next team’s domain, pricing contract, latency target, data quality, and compliance requirements may differ.
What is worth copying is the evaluation discipline.
First, start with a schema-first prompt. The model should choose from a fixed category list and return a parseable structure. In this paper, the required output is a JSON array whose length must match the number of detected items. That design turns model output into something a pipeline can check.
Second, evaluate more than overall accuracy. Accuracy, balanced accuracy, precision, recall, F1, category-level performance, output-length consistency, runtime, and token cost each answer a different deployment question. A model can be accurate but expensive, cheap but malformed, fast but unstable, or strong on common classes while failing minority classes.
Third, separate taxonomy work from model work. If categories overlap, the model will inherit the confusion. Refining the category set is not administrative cleanup. It is part of model performance engineering.
Fourth, test rules before examples when the task is boundary-heavy. In this case, explicit disambiguation rules outperform adding few-shot examples under strict evaluation and cost less than the few-shot variant. The lesson is not anti-example. The lesson is pro-causality: use the intervention that matches the error mechanism.
Fifth, price the decision, not the demo. A per-call cost of $0.003950 looks harmless until prompt refinements push it to $0.008740 or $0.010670, and until call volume turns decimals into invoices. Accuracy gains can justify this. Prompt verbosity cannot justify itself by sounding thoughtful.
What should not be overgeneralised
The boundaries of the study are clear enough to matter.
The dataset contains 389 expense items from Australian receipts. It is useful as a controlled case study, not as a universal receipt benchmark. Several categories have very low or zero support. The authors note that categories such as Baby & Maternity, Entertainment, and Transport & Fuel had no ground-truth examples in the relevant phase, while other categories had only one or a few items. Balanced accuracy helps, but it cannot manufacture evidence where no evidence exists.
The inputs are already text extracted from receipt images. The experiment is not a full end-to-end document AI evaluation from raw receipt image to final category. OCR or extraction quality is upstream of the classification task. A business deployment would still need to test the full pipeline, including image quality, vendor variation, layout issues, and extraction errors.
The receipt sources are also limited. The paper says the receipts were collected by the authors, mostly as photographed physical receipts and partly as electronic screenshots. No PDF receipts were included. This matters for businesses that process emailed invoices, PDF receipts, international formats, or enterprise expense exports.
Cost reporting is complete mainly for the Claude models because token accounting was available there. The open-weight models are expected to be cheaper, but the paper does not quantify their Bedrock cost in the same way. For procurement, that means the Claude cost comparison is more precise than the open-weight cost comparison.
Finally, the paper is an early prototype-oriented study, not a statistical endpoint. Its value is in the disciplined comparison, the operational metrics, and the prompt-design lesson. It should guide a pilot design. It should not be pasted into a board deck as proof that one model is globally best at finance automation. Please do not make the receipt paper carry the burden of civilisation.
The real upgrade is less glamorous than the model upgrade
The article title says “when prompt engineering beats model upgrades,” but the more accurate phrase is probably “when system design beats model shopping.”
The paper shows three layers of improvement. Claude 3.7 gives the strongest base model result. Refined categories reduce avoidable confusion. Explicit rules turn recurring ambiguity into operational policy. Few-shot examples, despite their popularity, do not justify their added cost in this setup.
For business teams, that sequence matters.
Do not begin by asking whether the newest model can solve the problem. Begin by asking whether the task has been made solvable: fixed schema, clear categories, explicit boundary rules, measurable ambiguity, and cost per usable decision. Once those are in place, model comparison becomes meaningful. Before that, it is just expensive astrology with API keys.
The useful conclusion is not that every company should use Claude 3.7 for receipt categorisation. The useful conclusion is that LLM ROI often lives in the unglamorous parts of the system: the taxonomy spreadsheet, the prompt contract, the evaluator script, the exception policy, and the cost table.
A better model may buy capability. A better prompt system buys reliability. In production, reliability is usually the one that sends the invoice.
\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast
-
Gabby Sanchez, Sneha Oommen, Cassandra T. Britto, Di Wang, Jung-De Chiou, and Maria Spichkova, “Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study,” arXiv:2604.01615, 2026. https://arxiv.org/abs/2604.01615 ↩︎