Expense Categorization with LLMs

Manual expense coding is slow for exactly the reason finance teams know well: the same merchant can map to different accounts depending on business purpose, entity, tax treatment, and supporting evidence. An LLM can help interpret messy descriptions and normalize intake, but it should not be treated as the final accounting authority.

Why This Matters

Expense categorization is a high-volume workflow where small mistakes compound. Misclassified travel costs, software subscriptions, client entertainment, or capitalizable purchases can distort management reporting, tax treatment, departmental budgets, and audit support. The point of AI here is not to let a model “do the books.” It is to reduce repetitive coding work while preserving the control logic that finance already depends on.

What AI Is Good At Here

LLMs are useful when the source material is language-heavy or inconsistent: receipt text, merchant names, employee descriptions, memo fields, and invoice notes. They can infer likely categories, extract supporting fields, and explain why a category may fit.

They are weak when the answer depends on accounting policy edge cases, entity-specific treatment, capitalization thresholds, tax rules, or incomplete supporting evidence. In those cases, the model should suggest and flag, not decide.

Before-and-After Workflow in Prose

Before AI: employees submit expenses with vague descriptions; AP or accounting staff read receipts manually, infer purpose, map to the chart of accounts, correct common miscoding patterns, and chase missing support. Senior reviewers then spend time rechecking routine classifications because low-level intake quality is inconsistent.

After AI: the system extracts merchant, date, amount, currency, tax, employee, memo text, and likely expense type; it proposes a chart-of-accounts code and a brief rationale; deterministic policy rules screen for prohibited categories, missing receipts, spending-cap breaches, or entity-specific tax issues; high-confidence, low-risk items move into a standard review stream, while ambiguous or material items go to an exception queue for finance approval.

Control Objective

The control objective is simple: let AI assist with interpretation and preparation, but keep policy ownership and final posting authority with finance.

Control Matrix

Workflow Step AI May Suggest Human Must Approve Key Control
Receipt and memo extraction Merchant, amount, date, currency, tax, purpose hints Only when extraction is incomplete or conflicts with source Source document retained and viewable
Initial category proposal Likely expense category and account code candidate Final category for ambiguous, policy-sensitive, or material items Approved category list only; no freeform categories
Policy checks Missing receipt, weekend spend, cap exceedance, duplicate clues Exception disposition Policy rule log
Posting recommendation Routing to standard or exception queue Final posting / reimbursement approval Segregation of duties
Learning from corrections Pattern suggestions for future mapping Rule or prompt changes Change control and version history

What AI May Suggest vs What Humans Must Approve

AI may suggest

  • merchant normalization
  • likely expense type
  • likely chart-of-accounts mapping
  • supporting rationale
  • confidence score or review band
  • duplicate or unusual-spend flags

Humans must approve

  • final category on ambiguous merchants
  • capitalization vs expense treatment
  • tax-sensitive treatment
  • policy exceptions
  • employee reimbursement approval
  • any posting above the materiality threshold

Chart-of-Accounts Mapping Examples

Below is the kind of mapping logic the lesson should make explicit.

Merchant / Description Possible Category Why It Can Be Ambiguous
Uber Local transport / client travel / employee commute exception Same merchant, different business purpose
Amazon Office supplies / IT peripherals / books / small equipment Merchant is too broad to classify without line detail
Microsoft Software subscriptions / cloud services / one-time license / training Depends on contract and department
Marriott Lodging / conference expense / client event Requires trip purpose and attendee context
Apple Small tools / capitalizable device / employee equipment reimbursement May cross capitalization threshold

The model should never be allowed to invent a new account outside the approved coding structure. If the answer does not map cleanly, the right outcome is “needs review”, not creative classification.

Merchant Ambiguity Cases

A strong production workflow should explicitly define ambiguous merchants and mixed-purpose expenses. Common examples include:

  • broad marketplaces like Amazon or Lazada
  • travel merchants used for both internal travel and client-facing events
  • software vendors that serve multiple departments
  • restaurants that may represent travel meals, team meals, or client entertainment
  • hardware purchases that may either be expensed or capitalized

These should feed an ambiguity library so that the system routes them more conservatively.

Confidence-Based Review Rules

A practical rule set might look like this:

Confidence / Risk Band Example Handling
High confidence + low amount + low-risk category Auto-route into standard reviewer queue
Medium confidence or ambiguous merchant Mandatory accounting review
Low confidence, policy conflict, or missing support Exception queue
Any item above materiality threshold Senior finance review regardless of confidence

Confidence alone is not enough. A high-confidence answer on a material or policy-sensitive item should still require human approval.

Materiality Thresholds

Finance should define materiality thresholds by workflow. For example:

  • small employee reimbursements may enter a lighter review band
  • cross-entity allocations may always require review
  • expenses near capitalization policy thresholds should be routed upward
  • tax-relevant expenses may require stricter handling regardless of amount

The lesson here is that materiality belongs to finance policy, not to the model.

Exception Queue Design

The exception queue should capture:

  • ambiguous merchant or mixed-purpose signal
  • missing receipt or incomplete support
  • suspected duplicate
  • prohibited spend category
  • cap exceedance
  • unusual tax treatment
  • materiality threshold breach

Each case should include the original source, the AI suggestion, the reason for escalation, the final human decision, and the reviewer identity.

Audit Trail Requirements

A defensible audit trail should preserve:

  • original receipt or invoice
  • extracted fields
  • model output and rationale
  • category suggested
  • policy rules triggered
  • reviewer decision and timestamp
  • final posted category
  • any override reason

If finance cannot reconstruct why an item was categorized a certain way, the workflow is not production-ready.

Typical Workflow

  1. Ingest the receipt, invoice, or employee claim.
  2. Extract core fields from the document or email.
  3. Normalize merchant and employee metadata.
  4. Ask the LLM to propose category, rationale, and confidence band using only approved categories.
  5. Apply deterministic checks for duplicates, caps, prohibited spend, entity rules, and tax-sensitive cases.
  6. Route low-risk items to standard review and exceptions to the appropriate queue.
  7. Capture final finance decisions for audit and future tuning.

Risks, Limits, and Common Mistakes

  • treating merchant name alone as sufficient classification evidence
  • allowing the model to generate categories outside the approved chart
  • over-trusting confidence scores
  • forgetting that tax and capitalization rules can override ordinary pattern matching
  • failing to store overrides and correction reasons

Example Scenario

A regional company receives 1,200 employee expense claims per month. Before AI, AP staff manually read each claim, corrected routine miscoding, and escalated borderline cases late in the cycle. After AI, the system extracts core fields, proposes a category, identifies ambiguous merchants, and routes only flagged cases to accounting. Routine low-risk items move faster, while policy-sensitive items remain under explicit review.

Practical Metrics

Useful metrics include:

  • first-pass categorization accuracy
  • exception rate by merchant or employee
  • turnaround time for standard claims
  • override rate by category
  • duplicate detection rate
  • share of items requiring senior finance review

Practical Checklist

  • Is the chart of accounts or expense taxonomy fixed and approved?
  • Are ambiguous merchants explicitly listed?
  • Do confidence scores trigger different review paths?
  • Are policy-sensitive and material items always escalated?
  • Can every final category be traced back to source evidence and reviewer action?

Continue Learning