Expense Categorization with LLMs

Manual expense coding is slow for exactly the reason finance teams know well: the same merchant can map to different accounts depending on business purpose, entity, tax treatment, and supporting evidence. An LLM can help interpret messy descriptions and normalize intake, but it should not be treated as the final accounting authority.

Why This Matters

Expense categorization is a high-volume workflow where small mistakes compound. Misclassified travel costs, software subscriptions, client entertainment, or capitalizable purchases can distort management reporting, tax treatment, departmental budgets, and audit support. The point of AI here is not to let a model “do the books.” It is to reduce repetitive coding work while preserving the control logic that finance already depends on.

What AI Is Good At Here

LLMs are useful when the source material is language-heavy or inconsistent: receipt text, merchant names, employee descriptions, memo fields, and invoice notes. They can infer likely categories, extract supporting fields, and explain why a category may fit.

They are weak when the answer depends on accounting policy edge cases, entity-specific treatment, capitalization thresholds, tax rules, or incomplete supporting evidence. In those cases, the model should suggest and flag, not decide.

Before-and-After Workflow in Prose

Before AI: employees submit expenses with vague descriptions; AP or accounting staff read receipts manually, infer purpose, map to the chart of accounts, correct common miscoding patterns, and chase missing support. Senior reviewers then spend time rechecking routine classifications because low-level intake quality is inconsistent.

After AI: the system extracts merchant, date, amount, currency, tax, employee, memo text, and likely expense type; it proposes a chart-of-accounts code and a brief rationale; deterministic policy rules screen for prohibited categories, missing receipts, spending-cap breaches, or entity-specific tax issues; high-confidence, low-risk items move into a standard review stream, while ambiguous or material items go to an exception queue for finance approval.

Control Objective

The control objective is simple: let AI assist with interpretation and preparation, but keep policy ownership and final posting authority with finance.

Control Matrix

Workflow Step	AI May Suggest	Human Must Approve	Key Control
Receipt and memo extraction	Merchant, amount, date, currency, tax, purpose hints	Only when extraction is incomplete or conflicts with source	Source document retained and viewable
Initial category proposal	Likely expense category and account code candidate	Final category for ambiguous, policy-sensitive, or material items	Approved category list only; no freeform categories
Policy checks	Missing receipt, weekend spend, cap exceedance, duplicate clues	Exception disposition	Policy rule log
Posting recommendation	Routing to standard or exception queue	Final posting / reimbursement approval	Segregation of duties
Learning from corrections	Pattern suggestions for future mapping	Rule or prompt changes	Change control and version history

What AI May Suggest vs What Humans Must Approve

AI may suggest

merchant normalization
likely expense type
likely chart-of-accounts mapping
supporting rationale
confidence score or review band
duplicate or unusual-spend flags

Humans must approve

final category on ambiguous merchants
capitalization vs expense treatment
tax-sensitive treatment
policy exceptions
employee reimbursement approval
any posting above the materiality threshold

Chart-of-Accounts Mapping Examples

Below is the kind of mapping logic the lesson should make explicit.

Merchant / Description	Possible Category	Why It Can Be Ambiguous
Uber	Local transport / client travel / employee commute exception	Same merchant, different business purpose
Amazon	Office supplies / IT peripherals / books / small equipment	Merchant is too broad to classify without line detail
Microsoft	Software subscriptions / cloud services / one-time license / training	Depends on contract and department
Marriott	Lodging / conference expense / client event	Requires trip purpose and attendee context
Apple	Small tools / capitalizable device / employee equipment reimbursement	May cross capitalization threshold

The model should never be allowed to invent a new account outside the approved coding structure. If the answer does not map cleanly, the right outcome is “needs review”, not creative classification.

Merchant Ambiguity Cases

A strong production workflow should explicitly define ambiguous merchants and mixed-purpose expenses. Common examples include:

broad marketplaces like Amazon or Lazada
travel merchants used for both internal travel and client-facing events
software vendors that serve multiple departments
restaurants that may represent travel meals, team meals, or client entertainment
hardware purchases that may either be expensed or capitalized

These should feed an ambiguity library so that the system routes them more conservatively.

Confidence-Based Review Rules

A practical rule set might look like this:

Confidence / Risk Band	Example Handling
High confidence + low amount + low-risk category	Auto-route into standard reviewer queue
Medium confidence or ambiguous merchant	Mandatory accounting review
Low confidence, policy conflict, or missing support	Exception queue
Any item above materiality threshold	Senior finance review regardless of confidence

Confidence alone is not enough. A high-confidence answer on a material or policy-sensitive item should still require human approval.

Materiality Thresholds

Finance should define materiality thresholds by workflow. For example:

small employee reimbursements may enter a lighter review band
cross-entity allocations may always require review
expenses near capitalization policy thresholds should be routed upward
tax-relevant expenses may require stricter handling regardless of amount

The lesson here is that materiality belongs to finance policy, not to the model.

Exception Queue Design

The exception queue should capture:

ambiguous merchant or mixed-purpose signal
missing receipt or incomplete support
suspected duplicate
prohibited spend category
cap exceedance
unusual tax treatment
materiality threshold breach

Each case should include the original source, the AI suggestion, the reason for escalation, the final human decision, and the reviewer identity.

Audit Trail Requirements

A defensible audit trail should preserve:

original receipt or invoice
extracted fields
model output and rationale
category suggested
policy rules triggered
reviewer decision and timestamp
final posted category
any override reason

If finance cannot reconstruct why an item was categorized a certain way, the workflow is not production-ready.

Typical Workflow

Ingest the receipt, invoice, or employee claim.
Extract core fields from the document or email.
Normalize merchant and employee metadata.
Ask the LLM to propose category, rationale, and confidence band using only approved categories.
Apply deterministic checks for duplicates, caps, prohibited spend, entity rules, and tax-sensitive cases.
Route low-risk items to standard review and exceptions to the appropriate queue.
Capture final finance decisions for audit and future tuning.

Risks, Limits, and Common Mistakes

treating merchant name alone as sufficient classification evidence
allowing the model to generate categories outside the approved chart
over-trusting confidence scores
forgetting that tax and capitalization rules can override ordinary pattern matching
failing to store overrides and correction reasons

Example Scenario

A regional company receives 1,200 employee expense claims per month. Before AI, AP staff manually read each claim, corrected routine miscoding, and escalated borderline cases late in the cycle. After AI, the system extracts core fields, proposes a category, identifies ambiguous merchants, and routes only flagged cases to accounting. Routine low-risk items move faster, while policy-sensitive items remain under explicit review.

Practical Metrics

Useful metrics include:

first-pass categorization accuracy
exception rate by merchant or employee
turnaround time for standard claims
override rate by category
duplicate detection rate
share of items requiring senior finance review

Practical Checklist

Is the chart of accounts or expense taxonomy fixed and approved?
Are ambiguous merchants explicitly listed?
Do confidence scores trigger different review paths?
Are policy-sensitive and material items always escalated?
Can every final category be traced back to source evidence and reviewer action?

Expense Categorization with LLMs#

Why This Matters#

What AI Is Good At Here#

Before-and-After Workflow in Prose#

Control Objective#

Control Matrix#

What AI May Suggest vs What Humans Must Approve#

AI may suggest#

Humans must approve#

Chart-of-Accounts Mapping Examples#

Merchant Ambiguity Cases#

Confidence-Based Review Rules#

Materiality Thresholds#

Exception Queue Design#

Audit Trail Requirements#

Typical Workflow#

Risks, Limits, and Common Mistakes#

Example Scenario#

Practical Metrics#

Practical Checklist#

Continue Learning#