Paths > Outcomes: Measuring Agent Quality Beyond the Final State

A calendar assistant creates the right meeting. A compliance agent files the right flag. A robotic controller moves the right object. Everyone applauds, because the final state is correct.

Then someone checks the logs.

The calendar assistant created, deleted, recreated, and re-notified the same meeting. The compliance agent skipped the required policy check and jumped straight to enforcement. The robot got the object into place only after executing a step that would have been unsafe if the power had cut out halfway through. The destination was fine. The route was a mess. In enterprise automation, this is not a philosophical distinction. It is the difference between “the demo worked” and “legal now wants a meeting.”

That is the problem addressed by CORE: Full-Path Evaluation of LLM Agents Beyond Final State, a paper that proposes a deterministic finite automaton, or DFA, framework for evaluating tool-using LLM agents by their full action paths rather than only their final outcomes.¹ The contribution is not simply another benchmark score to add to the dashboard, because naturally what agent engineering needed was one more number. CORE is more useful than that. It gives a structured way to ask whether an agent followed an acceptable procedure, avoided harmful calls, preserved ordering, and used tools economically.

The important comparison is not “CORE versus no evaluation.” It is final-state evaluation versus full-path evaluation. Final-state evaluation asks whether the world ended up looking right. CORE asks whether the agent got there without breaking procedural, safety, or efficiency constraints along the way.

That second question is where agent quality starts to look less like chatbot grading and more like operations risk.

The final state is a receipt, not an audit trail

Many tool-agent evaluations are built around a simple premise: run the agent, inspect the backend state or emitted response, and decide whether the task was completed. This is intuitive. It is also incomplete in exactly the places where businesses care most.

If an agent is answering a read-only query, final-state evaluation can be adequate. There may be no meaningful state to corrupt, no fragile ordering requirement, and no expensive physical action hiding behind a tool call. But once the agent starts changing state, the path matters.

A bank transfer can be reversed, but the reversal is not atomic. A compliance action can arrive at the right enforcement flag while skipping the mandated review step. A smart-home controller can end in the right configuration after issuing redundant or unsafe commands. A robot can place an object correctly after executing a sequence that would fail under a timing glitch. Final-state scoring sees the polished table. CORE looks under it and notices the missing screws.

The paper’s core intuition is that agent execution should be represented as a sequence of function calls over a finite action space. The task is encoded as a DFA: states represent control conditions, transitions represent valid tool invocations, and undefined transitions are harmful calls. A task does not merely have a correct endpoint; it has a set of acceptable paths.

That difference matters because many real workflows are not just goal-driven. They are procedure-driven.

CORE turns agent behaviour into a path problem

CORE models each task as a world with tools, an initial state, a user prompt, and a target solution. The agent produces a raw sequence of function calls. Those calls are mapped into discrete actions, where the function name and arguments matter. Calling water_plant(plant_A) and water_plant(plant_B) are not the same action with a cosmetic parameter difference. They are different operational steps.

The DFA then labels what happens as the path unfolds. A valid transition advances the control state. A harmless self-loop, such as a read that does not change the state, may be ignored for some metrics. An undefined transition is harmful: it does not advance the DFA, but it is recorded. This is a clean design choice. It prevents one bad call from making the rest of the trace impossible to evaluate, while still preserving the evidence of harm.

CORE also condenses action paths for several metrics. State-preserving repetitions are dropped, while meaningful progress steps and harmful attempts remain. This matters because it separates harmless observation from substantive behaviour. A read-only inspection may not deserve the same penalty as an invalid write. But for efficiency, the paper deliberately returns to the raw path: every call counts because every call consumes time, tokens, compute, battery, API quota, or patience. Usually several at once.

The result is a five-metric view of agent behaviour.

CORE metric	What it asks	Operational meaning
Path Correctness (PC)	How close is the condensed action path to a valid golden path?	Did the agent perform the right substantive steps?
PC-KTC	Did the agent preserve both the right tokens and the right order?	Did it do the right things in the right sequence?
Prefix Criticality	Did harmful calls happen early or late?	Did early mistakes create causal risk?
Harmful-Call Rate	How often did the agent attempt invalid actions?	Is the agent habitually unsafe, even if it sometimes recovers?
Efficiency	How many raw calls did the agent use relative to the shortest valid path?	How much operational waste did it create?

The metrics are intentionally not collapsed into one grand “agent score.” The appendix argues for reporting them as a vector or using deployment-specific weights. That is sensible. A robotic controller, a legal compliance assistant, and a browser research agent should not share the same risk appetite just because someone in procurement likes tidy leaderboards.

The benchmark comparison shows where final-state scoring is still useful

The paper evaluates CORE across 14 simulated worlds, each with tool interfaces, task prompts, initial states, manually verified prompt-specific DFAs, and finite golden sets of loop-free, harm-free paths. The worlds include Farm Rover, Robotic Arm, Transactions, Web Browsing, Automation, Legal Compliance, Communication, CRUD operations, Desktop Manager, Event Scheduler, File Management, Navigation, Validation, Computations, and Writing.

The comparison baseline is BFCL-style evaluation, reported in two forms: state-based evaluation, which checks whether the final backend state matches the ground truth, and response-based evaluation, which checks whether the execution contains the minimal viable call sequence needed for the requested response.

The useful finding is not that final-state scoring is always wrong. It is more precise than that.

In read-dominant or deterministic workflows, CORE and BFCL mostly agree. File Management, Validation, and Events Scheduler show relatively high path alignment and temporal safety. In the paper’s per-world table, File Management has PC 0.711, PC-KTC 0.741, PrefixCrit 0.985, and BFCL-State 83.3%. Validation has PC 0.705, PC-KTC 0.694, PrefixCrit 0.966, and BFCL-State 100.0%. These are not identical signals, but they are directionally consistent.

That is important. CORE is not arguing that final-state evaluation is useless. It is saying that final-state evaluation works best when the workflow itself is forgiving: low path sensitivity, transparent state changes, few required preconditions, and little opportunity for harmful intermediate action.

This is the first business lesson. If your agent operates in a domain where many paths are acceptable and intermediate actions are cheap, reversible, and harmless, then final-state scoring may be a reasonable first pass. Not complete. Reasonable. There is a difference, though benchmark culture sometimes treats the two as inconvenient synonyms.

The discrepancies are the paper’s main evidence

The central evidence comes from cases where BFCL and CORE disagree.

Legal Compliance and Web Browsing are the headline examples. Both achieve 100.0% BFCL-State in the per-world table. Under a final-state view, they look perfect. Under CORE, they do not. Legal Compliance records PC 0.408, PC-KTC 0.444, PrefixCrit 0.526, Harmful total 124, average harmful calls 3.100, and Efficiency 0.472. Web Browsing records PC 0.452, PC-KTC 0.491, BFCL-State 100.0%, and BFCL-Response 57.6%.

The paper’s interpretation is straightforward: the final state can be correct even when the route contains skipped preconditions, meandering reads, or invalid calls. This is not an academic edge case. Compliance workflows often require the check to happen before the enforcement action. Browser workflows can reach the right page or extracted content after wasteful or unstable navigation. The endpoint is not the process.

Communication gives another kind of mismatch. It reaches BFCL-Response 100.0%, but PC-KTC is only 0.530. The paper attributes this to redundant sends and order instability. In a consumer demo, maybe that looks like a quirk. In a business workflow, sending the same high-priority message three times is not “extra diligence.” It is spam with a graduate degree.

The paper’s qualitative examples explain the mechanism behind the numbers.

Failure mode	What final-state evaluation can miss	What CORE adds
Mandatory precondition skipped	The final enforcement flag is correct, so the task appears successful	The missing check is visible as a path violation
Redundant or unsafe repetitions	The requested message or action exists somewhere in the trace	Harm rate, prefix criticality, and efficiency distinguish mild detours from operational mess
Missing intermediate action	The terminal state happens to match	Path metrics expose non-atomic omissions that could be unsafe mid-trajectory

This is the strongest part of the paper because it changes the evaluation question. The issue is not “Can the agent eventually get the right answer?” The issue is “Can the agent execute the procedure in a way the business can defend, reproduce, and pay for?”

The model results show that capability and cleanliness are not the same thing

The per-model results add a second layer. Across the evaluated models, GPT-o4-mini has the strongest aggregate profile in the table: PC 0.812, PC-KTC 0.834, Efficiency 0.748, PrefixCrit 0.896, BFCL-State 79.8%, and BFCL-Response 79.8%. GPT-4o-mini follows with PC 0.715, PC-KTC 0.744, and Efficiency 0.675. Qwen3-8B is competitive on path metrics, with PC 0.744, PC-KTC 0.777, and the highest PrefixCrit in the table at 0.897, but lower Efficiency at 0.591.

The Qwen3 family shows size-related improvement: Qwen3-0.6B has PC 0.585 and Efficiency 0.446; Qwen3-1.7B rises to PC 0.642 and Efficiency 0.525; Qwen3-8B reaches PC 0.744 and Efficiency 0.591. No surprise there. Larger models generally behave better. Champagne may remain unopened.

But the more interesting comparison is with the Qwen2.5 rows. Qwen2.5-7B has BFCL-Response 76.7%, which looks respectable. Yet CORE records PC 0.460, PC-KTC 0.598, Efficiency 0.291, average path length 12.4, and average harmful calls 4.13. Qwen2.5-3B is worse on path correctness and harm: PC 0.346, PC-KTC 0.542, Efficiency 0.277, average path length 11.3, and average harmful calls 5.71.

That tells a practical story. A model can appear acceptable under response-based scoring while producing long, noisy, harmful traces. For an enterprise buyer, that matters because a tool agent is not just generating text. It is touching systems. Long traces mean cost, latency, audit exposure, customer-visible weirdness, and more chances for failure between compensating actions.

The smallest Qwen2.5 model, meanwhile, has a different failure shape. Qwen2.5-0.5B produces very short average traces, with average length 1.9. Its harmful average is low at 0.72, but PC-KTC is only 0.405 and BFCL-Response is 15.9%. The paper interprets this as premature termination rather than clean execution. That distinction is valuable: an agent that stops early is not safer in the business sense. It may simply fail before it has time to cause trouble. This is the automation equivalent of calling a sleeping guard “low incident.”

PC+HLR is a calibration tool, not a free pass

CORE also introduces Harm-Local Refinement, or HLR, which augments Path Correctness by generating task-consistent reference candidates from harmful sites in the agent’s path. The likely purpose is not to create a second thesis, but to address a measurement problem: strict comparison against loop-free golden paths can over-penalise cases where the agent mostly followed the right progress structure but inserted a localized harmful or benign detour.

HLR works locally. It identifies harmful positions in the condensed path, then either deletes the harmful token or replaces it with a legal read self-loop in that state. If the repaired prefix can be extended along a golden path, it is extended. PC+HLR is then computed against this augmented set of harm-free candidates.

This is an important nuance. HLR does not erase harmful behaviour. Harmful-Call Rate and Prefix Criticality still report it. HLR simply asks whether the path-correctness distance should compare the agent only to a pristine canonical path, or also to nearby task-consistent variants that reflect localized repairs.

In the model table, PC+HLR is generally higher than PC. GPT-o4-mini rises from PC 0.812 to PC+HLR 0.858. Qwen2.5-7B rises from 0.460 to 0.649. In the world table, Desktop Manager rises from PC 0.573 to PC+HLR 0.814, and Web Browsing rises from 0.452 to 0.635. This does not mean the agents were secretly excellent. It means the raw PC score can sometimes be harsh when the remaining path structure is closer to a valid reference than the canonical golden set alone suggests.

For business use, this distinction is useful. PC is the strict procedural-alignment signal. PC+HLR is a more forgiving structural-alignment signal. HarmRate and PrefixCrit keep the safety account honest. Use all three, or enjoy the familiar enterprise sport of optimising one metric until it lies politely.

The experiments are best read as an evaluation study, not a product leaderboard

The paper’s experimental components serve different purposes. Treating them all as “results” in the same way would flatten the argument.

Paper component	Likely purpose	What it supports	What it does not prove
DFA task formulation	Main method	Tool-agent tasks can be represented as acceptable and harmful paths	That every real workflow is easy to encode
Five CORE metrics	Main contribution	Agent quality decomposes into correctness, order, harm incidence, harm timing, and efficiency	That one universal weighting is appropriate
BFCL comparison	Comparison with prior evaluation style	Final-state/response scoring can overrate path-sensitive tasks	That BFCL is useless in low-risk or read-only workflows
Per-model table	Main evidence	Models with similar final scores can differ sharply in path quality	That the ranking generalises to all models or larger deployments
Per-world table	Main evidence	Path sensitivity varies by workflow category	That the simulated worlds cover every enterprise process
Qualitative failure cases	Mechanism explanation	Discrepancies arise from skipped checks, repetitions, and non-atomic omissions	That these are the only possible failure modes
HLR / PC+HLR	Implementation refinement / calibration	Strict golden-path distance can be adjusted without ignoring harm	That repaired alignment means the original execution was safe
Appendix deployment discussion	Practical framing	Metrics map to operational desiderata under explicit assumptions	That the assumptions always hold in messy production systems

This matters because the business value of CORE is not “we now have a better leaderboard.” The value is cheaper diagnosis. If two agents both pass a final-state benchmark, CORE can tell you whether one is quietly burning API calls, skipping checks, violating order constraints, or relying on lucky reversibility.

That is procurement-relevant. It is also governance-relevant. A model that reaches the correct end state through unacceptable intermediate actions is not merely lower quality. It may be unapprovable.

What businesses should take from CORE

The direct claim of the paper is limited and concrete: a DFA-based full-path evaluation framework can expose differences in tool-agent behaviour that final-state and response-based scoring miss. The empirical evidence comes from 14 simulated worlds with prompt-specific manually verified DFAs and multiple LLM-powered agents. The strongest discrepancies appear in path-sensitive domains such as compliance, robotics-like manipulation, and browsing workflows with order and precondition constraints.

The business inference is broader, but still disciplined: companies deploying agents should evaluate the trajectory, not only the final artefact. This is especially true when the agent can mutate state, trigger external actions, send messages, execute transactions, alter records, or operate physical or quasi-physical systems.

A practical adoption path would look like this:

Business question	CORE-style evaluation lens
Did the agent complete the task?	Path Correctness plus final-state checks
Did it follow required procedure?	PC-KTC and DFA precondition modelling
Did it attempt prohibited calls?	Harmful-Call Rate
Did harmful behaviour occur early enough to cascade?	Prefix Criticality
Did it waste calls, time, or resources?	Efficiency
Is the agent structurally near-correct despite local mistakes?	PC+HLR, interpreted alongside harm metrics

The point is not to replace all evaluation with formal automata. The point is to stop pretending that a successful endpoint is the same as a safe execution. For internal pilots, the first step may be modest: define golden paths for a small number of high-value workflows, label invalid transitions, and compare agent traces across models. The likely return is not a magical safety guarantee. It is visibility. In operations, visibility usually beats superstition.

The boundary: CORE is only as good as the process model

CORE’s main limitation is also its main strength: it depends on a formal representation of the task. The paper uses manually verified, prompt-specific DFAs. That is appropriate for a research study and for some enterprise workflows, but it is not free.

A claims-processing workflow, access-control procedure, refund pipeline, compliance check, or robotic routine may be encodeable as states and transitions. A negotiation, customer-support conversation, design review, or ambiguous policy interpretation may resist clean modelling. If important effects are not expressible as state/action symbols, CORE will not see them unless the alphabet or metrics are extended.

The paper is explicit about this boundary. Fine-grained timing within a call, continuous control, and human-facing UX quality may require additional modelling. Stochastic environments may require distributional scores over repeated rollouts rather than single-path judgments. The appendix also states standing assumptions: reads are side-effect free, the golden set is non-empty, the progress graph is acyclic, and execution cost or latency is roughly proportional to the number of calls. Those assumptions are reasonable in many software workflows. They are not laws of nature, despite how often architecture diagrams try to look like physics.

There is also a governance boundary. A DFA tells the evaluator what counts as valid or harmful. That definition has to come from somewhere: policy, engineering constraints, legal requirements, operational risk rules, or subject-matter expertise. If the business cannot define the procedure, CORE cannot rescue it with notation. The framework evaluates discipline; it does not invent it.

The destination was never enough

CORE lands at the right moment because tool-using agents are moving from toy environments into workflows where intermediate behaviour has consequences. The old question—“Did the agent finish the task?”—is too small. It belongs to demos, not deployments.

The better question is comparative: did the agent reach the outcome through a path the business can tolerate?

That is the intellectual move in this paper. It takes agent evaluation away from final-state theatre and toward procedural evidence. It does not claim every workflow can be perfectly formalised. It does not claim path metrics solve safety. It gives evaluators a sharper instrument for seeing failures that were already there, hiding between the first call and the final state.

The final state says what was left behind. The path says what actually happened.

For tool agents, that is where the quality lives.

Cognaptus: Automate the Present, Incubate the Future.

Panagiotis Michelakis, Yiannis Hadjiyiannis, and Dimitrios Stamoulis, “CORE: Full-Path Evaluation of LLM Agents Beyond Final State,” arXiv:2509.20998, 2025. https://arxiv.org/abs/2509.20998 ↩︎

The final state is a receipt, not an audit trail#

CORE turns agent behaviour into a path problem#

The benchmark comparison shows where final-state scoring is still useful#

The discrepancies are the paper’s main evidence#

The model results show that capability and cleanliness are not the same thing#

PC+HLR is a calibration tool, not a free pass#

The experiments are best read as an evaluation study, not a product leaderboard#

What businesses should take from CORE#

The boundary: CORE is only as good as the process model#

The destination was never enough#