Paths > Outcomes: Measuring Agent Quality Beyond the Final State

When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap.

The problem with “it worked in the end”

Final‑state benchmarks bless any trace that lands on the right answer, even if the agent skipped a required audit step, issued a risky write then silently reversed it, or spammed an API three times before succeeding. In production, those detours are costs and risks: rate limits, inconsistent state under partial failure, and latent liability.

CORE’s core idea (pun intended)

CORE models each task as a deterministic finite automaton (DFA): valid tool calls are edges; safe terminal states are accept nodes. The agent’s raw trace is condensed to remove harmless self‑loops (reads with no state change) while preserving progress steps and harmful attempts. Scoring then compares the condensed path to golden (loop‑free, harm‑free) paths.

The five deployment‑oriented metrics

Path Correctness (PC): How close is the agent’s condensed path to a canonical solution (via normalized edit distance)?
PC‑KTC: A composite that blends token‑level correctness with Kendall‑tau order agreement—rewarding the right steps in the right order.
Prefix Criticality: Penalizes earlier harmful calls more, since they can cascade downstream.
Harmful‑Call Rate: Proportion of substantive steps that are policy‑violating.
Efficiency: Raw‑trace economy: shortest valid length vs. actual calls (reads and detours count).

TL;DR: CORE converts vague “vibes” about an agent’s behavior into a 5‑vector that exposes safety, order, and waste—exactly what deployment owners care about.

A simple mental picture

Imagine a smart‑farm rover. A textbook path is: unlock_safety → move → scan → open_valve → water → log. A final‑state benchmark would let an agent “water” before opening the valve and still pass if the end moisture looks OK. CORE would dock it for order errors, early harms, and inefficiency—even though the terminal moisture reading is fine.

Why this is different (and better for business)

Final‑state metrics optimize for demos. Path‑aware metrics optimize for operations:

Operational risk	How it shows up	Which CORE metric flags it	What to mitigate
Skipped preconditions (e.g., no consent check)	Looks fine at the end	Harmful‑Call Rate ↑, PC/PC‑KTC ↓	Enforce DFA checks; add must‑pass reads
Compensating pairs (do‑bad → undo)	End state clean, but non‑atomic	Prefix Criticality ↓, PC ↓	Make risky writes atomic; gate with idempotency keys
API spam/looping	Higher cost and rate‑limit risk	Efficiency ↓, PC‑KTC ↓	Add retry budgets and backoff policies
Out‑of‑order steps	Fragile workflows	PC‑KTC ↓	Make order explicit in tools & prompts

HLR: rewarding the right instinct, punishing the wrong act

A neat add‑on called Harm‑Local Refinement (HLR) builds a small set of agent‑consistent references by repairing only the harmful tokens (delete or replace with a legal read) and then compares against these. It avoids over‑penalizing traces where the agent did all the right progress steps but briefly probed something unsafe. Net effect: fairer, more stable scoring without excusing harm.

What early numbers suggest

Across small (≤10B) models, CORE surfaces practical differences that end‑state scores blur: models that “pass” by final state can still have noisy, harmful, or wasteful paths. In other words, cost, safety, and reliability separate even when “accuracy” looks similar. That’s exactly the separation you need for vendor selection and change‑control.

Implementation playbook for teams

Define DFAs where it counts. Start with compliance‑sensitive or state‑ful workflows: finance operations, HR changes, robotic manipulation, security controls.
Instrument your tools. Make every function’s name + argument pattern a token; label reads vs writes; record state deltas.
Adopt the 5‑vector as your gate. Promote agents/models only if they meet target bands (e.g., Harmful‑Call Rate < 3%, Prefix Criticality > 0.9, Efficiency > 0.8).
Close the loop. Use low PC‑KTC findings to rewrite prompts/tools; use high early‑harm to add interlocks; track trends per release.

Example target bands by context

Back‑office RPA: Efficiency ≥ 0.85, Harmful‑Call Rate ≤ 2%, PC‑KTC ≥ 0.9
Customer‑facing edits (PII): Prefix Criticality ≥ 0.95, Harmful‑Call Rate ≤ 1%, PC ≥ 0.9
Robotics/IoT: Efficiency ≥ 0.75 (latency), PC‑KTC ≥ 0.9, early‑harm penalties strict (β small)

Where CORE fits in your evaluation stack

Keep final‑state checks (they’re necessary). Add CORE to see how the sausage is made.
Pair with cost/tokens and wall‑clock budgets for a complete “Ops‑ready” dashboard.
For stochastic worlds, compute distributions (means/quantiles) of CORE metrics across rollouts.

Limitations to mind

CORE assumes you can encode critical behavior in a DFA. UX quality, continuous control nuances, or purely aesthetic outputs need extra rubrics. Still, many high‑stakes enterprise actions are DFA‑friendly: precondition → act → verify → log.

Bottom line: If your agents touch money, data, or machines, paths matter. CORE gives you the language—and the numbers—to make that operational.

Cognaptus: Automate the Present, Incubate the Future

The problem with “it worked in the end”#

CORE’s core idea (pun intended)#

The five deployment‑oriented metrics#

A simple mental picture#

Why this is different (and better for business)#

HLR: rewarding the right instinct, punishing the wrong act#

What early numbers suggest#

Implementation playbook for teams#

Example target bands by context#

Where CORE fits in your evaluation stack#

Limitations to mind#