Expense Categorization with LLMs
Manual expense coding is slow for exactly the reason finance teams know well: the same merchant can map to different accounts depending on business purpose, entity, tax treatment, and supporting evidence. An LLM can help interpret messy descriptions and normalize intake, but it should not be treated as the final accounting authority.
Why This Matters
Expense categorization is a high-volume workflow where small mistakes compound. Misclassified travel costs, software subscriptions, client entertainment, or capitalizable purchases can distort management reporting, tax treatment, departmental budgets, and audit support. The point of AI here is not to let a model “do the books.” It is to reduce repetitive coding work while preserving the control logic that finance already depends on.
What AI Is Good At Here
LLMs are useful when the source material is language-heavy or inconsistent: receipt text, merchant names, employee descriptions, memo fields, and invoice notes. They can infer likely categories, extract supporting fields, and explain why a category may fit.
They are weak when the answer depends on accounting policy edge cases, entity-specific treatment, capitalization thresholds, tax rules, or incomplete supporting evidence. In those cases, the model should suggest and flag, not decide.
Before-and-After Workflow in Prose
Before AI: employees submit expenses with vague descriptions; AP or accounting staff read receipts manually, infer purpose, map to the chart of accounts, correct common miscoding patterns, and chase missing support. Senior reviewers then spend time rechecking routine classifications because low-level intake quality is inconsistent.
After AI: the system extracts merchant, date, amount, currency, tax, employee, memo text, and likely expense type; it proposes a chart-of-accounts code and a brief rationale; deterministic policy rules screen for prohibited categories, missing receipts, spending-cap breaches, or entity-specific tax issues; high-confidence, low-risk items move into a standard review stream, while ambiguous or material items go to an exception queue for finance approval.
Control Objective
The control objective is simple: let AI assist with interpretation and preparation, but keep policy ownership and final posting authority with finance.
Control Matrix
| Workflow Step | AI May Suggest | Human Must Approve | Key Control |
|---|---|---|---|
| Receipt and memo extraction | Merchant, amount, date, currency, tax, purpose hints | Only when extraction is incomplete or conflicts with source | Source document retained and viewable |
| Initial category proposal | Likely expense category and account code candidate | Final category for ambiguous, policy-sensitive, or material items | Approved category list only; no freeform categories |
| Policy checks | Missing receipt, weekend spend, cap exceedance, duplicate clues | Exception disposition | Policy rule log |
| Posting recommendation | Routing to standard or exception queue | Final posting / reimbursement approval | Segregation of duties |
| Learning from corrections | Pattern suggestions for future mapping | Rule or prompt changes | Change control and version history |
What AI May Suggest vs What Humans Must Approve
AI may suggest
- merchant normalization
- likely expense type
- likely chart-of-accounts mapping
- supporting rationale
- confidence score or review band
- duplicate or unusual-spend flags
Humans must approve
- final category on ambiguous merchants
- capitalization vs expense treatment
- tax-sensitive treatment
- policy exceptions
- employee reimbursement approval
- any posting above the materiality threshold
Chart-of-Accounts Mapping Examples
Below is the kind of mapping logic the lesson should make explicit.
| Merchant / Description | Possible Category | Why It Can Be Ambiguous |
|---|---|---|
| Uber | Local transport / client travel / employee commute exception | Same merchant, different business purpose |
| Amazon | Office supplies / IT peripherals / books / small equipment | Merchant is too broad to classify without line detail |
| Microsoft | Software subscriptions / cloud services / one-time license / training | Depends on contract and department |
| Marriott | Lodging / conference expense / client event | Requires trip purpose and attendee context |
| Apple | Small tools / capitalizable device / employee equipment reimbursement | May cross capitalization threshold |
The model should never be allowed to invent a new account outside the approved coding structure. If the answer does not map cleanly, the right outcome is “needs review”, not creative classification.
Merchant Ambiguity Cases
A strong production workflow should explicitly define ambiguous merchants and mixed-purpose expenses. Common examples include:
- broad marketplaces like Amazon or Lazada
- travel merchants used for both internal travel and client-facing events
- software vendors that serve multiple departments
- restaurants that may represent travel meals, team meals, or client entertainment
- hardware purchases that may either be expensed or capitalized
These should feed an ambiguity library so that the system routes them more conservatively.
Confidence-Based Review Rules
A practical rule set might look like this:
| Confidence / Risk Band | Example Handling |
|---|---|
| High confidence + low amount + low-risk category | Auto-route into standard reviewer queue |
| Medium confidence or ambiguous merchant | Mandatory accounting review |
| Low confidence, policy conflict, or missing support | Exception queue |
| Any item above materiality threshold | Senior finance review regardless of confidence |
Confidence alone is not enough. A high-confidence answer on a material or policy-sensitive item should still require human approval.
Materiality Thresholds
Finance should define materiality thresholds by workflow. For example:
- small employee reimbursements may enter a lighter review band
- cross-entity allocations may always require review
- expenses near capitalization policy thresholds should be routed upward
- tax-relevant expenses may require stricter handling regardless of amount
The lesson here is that materiality belongs to finance policy, not to the model.
Exception Queue Design
The exception queue should capture:
- ambiguous merchant or mixed-purpose signal
- missing receipt or incomplete support
- suspected duplicate
- prohibited spend category
- cap exceedance
- unusual tax treatment
- materiality threshold breach
Each case should include the original source, the AI suggestion, the reason for escalation, the final human decision, and the reviewer identity.
Audit Trail Requirements
A defensible audit trail should preserve:
- original receipt or invoice
- extracted fields
- model output and rationale
- category suggested
- policy rules triggered
- reviewer decision and timestamp
- final posted category
- any override reason
If finance cannot reconstruct why an item was categorized a certain way, the workflow is not production-ready.
Typical Workflow
- Ingest the receipt, invoice, or employee claim.
- Extract core fields from the document or email.
- Normalize merchant and employee metadata.
- Ask the LLM to propose category, rationale, and confidence band using only approved categories.
- Apply deterministic checks for duplicates, caps, prohibited spend, entity rules, and tax-sensitive cases.
- Route low-risk items to standard review and exceptions to the appropriate queue.
- Capture final finance decisions for audit and future tuning.
Risks, Limits, and Common Mistakes
- treating merchant name alone as sufficient classification evidence
- allowing the model to generate categories outside the approved chart
- over-trusting confidence scores
- forgetting that tax and capitalization rules can override ordinary pattern matching
- failing to store overrides and correction reasons
Example Scenario
A regional company receives 1,200 employee expense claims per month. Before AI, AP staff manually read each claim, corrected routine miscoding, and escalated borderline cases late in the cycle. After AI, the system extracts core fields, proposes a category, identifies ambiguous merchants, and routes only flagged cases to accounting. Routine low-risk items move faster, while policy-sensitive items remain under explicit review.
Practical Metrics
Useful metrics include:
- first-pass categorization accuracy
- exception rate by merchant or employee
- turnaround time for standard claims
- override rate by category
- duplicate detection rate
- share of items requiring senior finance review
Practical Checklist
- Is the chart of accounts or expense taxonomy fixed and approved?
- Are ambiguous merchants explicitly listed?
- Do confidence scores trigger different review paths?
- Are policy-sensitive and material items always escalated?
- Can every final category be traced back to source evidence and reviewer action?