When we measure a marathon by who crosses the line, we ignore how they ran it. For LLM agents that operate through tool calls—editing a CRM, moving a robot arm, or filing a compliance report—the “how” is the difference between deployable and dangerous. Today’s paper introduces CORE: Full‑Path Evaluation of LLM Agents Beyond Final State, a framework that scores agents on the entire execution path rather than only the end state. Here’s why this matters for your roadmap.
The problem with “it worked in the end”
Final‑state benchmarks bless any trace that lands on the right answer, even if the agent skipped a required audit step, issued a risky write then silently reversed it, or spammed an API three times before succeeding. In production, those detours are costs and risks: rate limits, inconsistent state under partial failure, and latent liability.
CORE’s core idea (pun intended)
CORE models each task as a deterministic finite automaton (DFA): valid tool calls are edges; safe terminal states are accept nodes. The agent’s raw trace is condensed to remove harmless self‑loops (reads with no state change) while preserving progress steps and harmful attempts. Scoring then compares the condensed path to golden (loop‑free, harm‑free) paths.
The five deployment‑oriented metrics
- Path Correctness (PC): How close is the agent’s condensed path to a canonical solution (via normalized edit distance)?
- PC‑KTC: A composite that blends token‑level correctness with Kendall‑tau order agreement—rewarding the right steps in the right order.
- Prefix Criticality: Penalizes earlier harmful calls more, since they can cascade downstream.
- Harmful‑Call Rate: Proportion of substantive steps that are policy‑violating.
- Efficiency: Raw‑trace economy: shortest valid length vs. actual calls (reads and detours count).
TL;DR: CORE converts vague “vibes” about an agent’s behavior into a 5‑vector that exposes safety, order, and waste—exactly what deployment owners care about.
A simple mental picture
Imagine a smart‑farm rover. A textbook path is: unlock_safety → move → scan → open_valve → water → log. A final‑state benchmark would let an agent “water” before opening the valve and still pass if the end moisture looks OK. CORE would dock it for order errors, early harms, and inefficiency—even though the terminal moisture reading is fine.
Why this is different (and better for business)
Final‑state metrics optimize for demos. Path‑aware metrics optimize for operations:
Operational risk | How it shows up | Which CORE metric flags it | What to mitigate |
---|---|---|---|
Skipped preconditions (e.g., no consent check) | Looks fine at the end | Harmful‑Call Rate ↑, PC/PC‑KTC ↓ | Enforce DFA checks; add must‑pass reads |
Compensating pairs (do‑bad → undo) | End state clean, but non‑atomic | Prefix Criticality ↓, PC ↓ | Make risky writes atomic; gate with idempotency keys |
API spam/looping | Higher cost and rate‑limit risk | Efficiency ↓, PC‑KTC ↓ | Add retry budgets and backoff policies |
Out‑of‑order steps | Fragile workflows | PC‑KTC ↓ | Make order explicit in tools & prompts |
HLR: rewarding the right instinct, punishing the wrong act
A neat add‑on called Harm‑Local Refinement (HLR) builds a small set of agent‑consistent references by repairing only the harmful tokens (delete or replace with a legal read) and then compares against these. It avoids over‑penalizing traces where the agent did all the right progress steps but briefly probed something unsafe. Net effect: fairer, more stable scoring without excusing harm.
What early numbers suggest
Across small (≤10B) models, CORE surfaces practical differences that end‑state scores blur: models that “pass” by final state can still have noisy, harmful, or wasteful paths. In other words, cost, safety, and reliability separate even when “accuracy” looks similar. That’s exactly the separation you need for vendor selection and change‑control.
Implementation playbook for teams
- Define DFAs where it counts. Start with compliance‑sensitive or state‑ful workflows: finance operations, HR changes, robotic manipulation, security controls.
- Instrument your tools. Make every function’s name + argument pattern a token; label reads vs writes; record state deltas.
- Adopt the 5‑vector as your gate. Promote agents/models only if they meet target bands (e.g., Harmful‑Call Rate < 3%, Prefix Criticality > 0.9, Efficiency > 0.8).
- Close the loop. Use low PC‑KTC findings to rewrite prompts/tools; use high early‑harm to add interlocks; track trends per release.
Example target bands by context
- Back‑office RPA: Efficiency ≥ 0.85, Harmful‑Call Rate ≤ 2%, PC‑KTC ≥ 0.9
- Customer‑facing edits (PII): Prefix Criticality ≥ 0.95, Harmful‑Call Rate ≤ 1%, PC ≥ 0.9
- Robotics/IoT: Efficiency ≥ 0.75 (latency), PC‑KTC ≥ 0.9, early‑harm penalties strict (β small)
Where CORE fits in your evaluation stack
- Keep final‑state checks (they’re necessary). Add CORE to see how the sausage is made.
- Pair with cost/tokens and wall‑clock budgets for a complete “Ops‑ready” dashboard.
- For stochastic worlds, compute distributions (means/quantiles) of CORE metrics across rollouts.
Limitations to mind
CORE assumes you can encode critical behavior in a DFA. UX quality, continuous control nuances, or purely aesthetic outputs need extra rubrics. Still, many high‑stakes enterprise actions are DFA‑friendly: precondition → act → verify → log.
Bottom line: If your agents touch money, data, or machines, paths matter. CORE gives you the language—and the numbers—to make that operational.
Cognaptus: Automate the Present, Incubate the Future