Aligned, or Just Agreeable? The Quiet Failure Mode of Modern LLMs

A support agent can sound calm, ask polite questions, invoke a few tools, and finish with a reassuring summary. The customer leaves. The dashboard shows completion. Everyone feels civilized.

Then someone opens the actual transaction log.

The reservation was not cancelled. The reminder was searched before the timestamp was retrieved. The contact update succeeded for the wrong person. The model was not exactly malicious, or even spectacularly wrong. It was simply agreeable in the familiar corporate way: fluent enough to pass the meeting, not reliable enough to run the process.

That is the quiet failure mode this article is interested in. Not whether a model says safe things in a static answer. Not whether it can win a leaderboard by completing a benchmark once. The harder question is whether an AI agent can complete a workflow when the user is vague, the task is multi-step, the tools matter, and the conversation path changes.

A recent paper, Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis, introduces TED: a framework that evaluates agents through three linked stages — Talk, Evaluate, and Diagnose.¹ The contribution is less glamorous than a new model release, which is usually a good sign. TED is not selling a larger brain. It is asking whether the brain did the work, how quickly it did the work, and what it tends to break when paired with different kinds of users.

That sounds modest. It is not. For business deployment, this is close to the real question.

The old evaluation habit rewards the final answer and ignores the conversation that produced it

Most enterprise discussions about AI agents still inherit a single-response mindset. Ask the system a question. Grade the answer. Repeat at scale. This made sense when LLMs were mostly text generators. It becomes brittle when the system is an agent that must talk to users, call tools, update state, and recover from incomplete instructions.

A workflow agent is not just producing text. It is moving through a sequence of decisions:

What does the user actually want?
What information is missing?
Which tool should be called first?
Did the previous tool output change the plan?
What must be communicated back to the user?
Has the real task been completed, or only narrated as completed?

A final success metric compresses all of this into one number. Useful, but blunt. It treats two agents as similar if they eventually reach the same endpoint, even if one reaches it cleanly and the other gets there through accidental tool calls, redundant turns, and a small miracle from the user.

TED begins from a different assumption: the conversation is part of the evaluation object. The same task can be easy with an expert user and messy with a non-expert user. A good agent should be tested under both conditions because real users, in a shocking development, often fail to behave like benchmark designers.

TED turns evaluation into a loop: Talk, Evaluate, Diagnose

The paper’s core mechanism is simple enough to be useful.

Stage	What TED does	Operational meaning
Talk	Simulates user-agent conversations with reusable expert and non-expert personas	Tests whether the agent works with different user behaviors, not only different task prompts
Evaluate	Converts benchmark goals into natural-language grading notes and uses an LLM judge to score achieved subgoals	Makes heterogeneous workflow tasks comparable without building a custom evaluator for every domain
Diagnose	Uses judge and agent inconsistencies to identify and cluster recurring errors	Turns evaluation from a scorecard into a debugging input

The important word is loop. TED does not stop at “model A scored 0.91 and model B scored 0.87, please clap.” It asks what kind of user produced the score, which subgoals were achieved, when progress happened, where variance appeared, and whether recurring failures can be fed back into agent design.

That is why the framework is more relevant to enterprise deployment than another generic benchmark table. Businesses do not merely need to know which model is globally best. They need to know which agent fails when a customer omits context, which one asks too many clarifying questions, which one calls tools too early, and which one hallucinates an internal state that no system actually changed.

Talk: the same task should be tested against different user expertise levels

TED’s first move is to separate the task from the user persona.

This sounds obvious until you look at many agent benchmarks. Some simulate users dynamically, but the user behavior is often coupled to the task, the domain, or the benchmark’s own scenario design. When user behavior and task difficulty change together, the evaluation cannot easily tell which factor caused the failure.

TED instead uses reusable persona templates:

an expert user who understands the system goal and gives clearer, more precise information;
a non-expert user who is vague, incomplete, and casual, offering details only when prompted.

The same underlying task can then be tested with both persona types. This matters because a workflow agent’s quality is partly its ability to extract missing information. If it only works when the user supplies a perfectly formed instruction, it is not an agent. It is a form with better manners.

The paper applies this setup to two benchmarks: $\tau^2$-bench, including airline and retail tasks, and ToolSandbox, which includes stateful tool-use scenarios such as contacts, reminders, messaging, location, and currency conversion. The point is not that these datasets perfectly represent the enterprise universe. They do not. The point is that TED can reuse the same persona logic across heterogeneous domains without rebuilding the entire evaluation approach each time.

That is the first business lesson: evaluate the user-agent pair, not the agent in isolation.

A procurement benchmark that tests only clean expert prompts will tend to overstate readiness. A customer-support deployment, HR assistant, finance operations bot, or internal IT agent must survive users who say things like “fix the thing from yesterday” and expect the system to know what “the thing” means. Enterprise reality is not adversarial by default. It is just under-specified, which is sometimes worse.

Evaluate: grading notes make messy workflows comparable

The second TED stage converts task requirements into natural-language grading notes.

A grading note is a subgoal written as an assertion, such as “Agent should enable WiFi,” “Agent should call search_contacts to find the contact information,” or “Agent should inform the user that the phone number has been updated.” These notes can cover tool calls, ordering requirements, intermediate steps, and final user-facing communication.

This is more flexible than relying only on final database state, regex matching, or exact output comparison. Different benchmarks encode success differently. Some rely on tool signatures. Some inspect environment state. Some check final responses. TED’s abstraction is to represent the task as a checklist of natural-language subgoals, then use an LLM-as-a-judge to evaluate whether each subgoal was achieved from the trajectory log.

The evaluation unit therefore becomes not “did the final answer look right?” but “which required pieces of the workflow were actually completed?”

That distinction is central. A model can produce a correct-sounding final answer after failing an important tool call. Conversely, an agent can make partial progress on a complex task even when it does not fully complete the workflow. A binary success metric treats both cases too crudely.

TED defines progress as the proportion of grading notes achieved. If a task has five subgoals and the trajectory satisfies three, its progress is 0.6. That already gives a more useful signal than pass/fail.

The paper then extends this idea to account for stochastic agent runs. Instead of only asking whether at least one trial succeeds, TED introduces MaxProgressRate@$k$: the maximum progress achieved across $k$ trials, averaged across tasks. This softens the harshness of pass@$k$, where 0.99 progress and 0.01 progress are both failures if the threshold is full completion. Fine-grained progress is preserved. The bar, for once, is allowed to have units.

Turn efficiency tells you whether the agent solved the task or merely survived it

Progress at the final turn is still not enough. A workflow agent may eventually complete the task, but only after burning ten turns, asking redundant questions, or delaying tool use until the conversation is already tangled.

TED adds two turn-aware metrics:

Metric	What it captures	When it matters
MaxAUC@$k$	Whether the agent achieves subgoals earlier in the conversation	Tasks where early progress reduces downstream uncertainty, such as navigation, web search, or multi-step planning
MaxPPT@$k$	Progress per turn, using final achieved progress divided by the minimum turns needed to reach it	Tasks where order is less important but unnecessary turns still matter

The difference is subtle but valuable.

AUC rewards early progress. If one agent calls the right tool in the first turn and another waits until the fifth, their final success may match, but their AUC will differ. PPT is more forgiving about the order of subgoals. In a travel booking task, booking the flight before the hotel may not matter if both are independent; what matters is how much progress is gained per turn.

The paper gives a concrete example from ToolSandbox. In a search_reminder_with_recency_upcoming scenario, both gpt-5 and Mistral-Nemo, under the non-expert persona, achieve the same PPT value of 0.20. But Mistral-Nemo has a much higher AUC, 0.88 versus 0.61, because it makes rapid early progress while gpt-5 progresses more gradually. The authors’ trajectory analysis shows the trade-off: Mistral-Nemo invokes tools early and then clarifies, while gpt-5 starts by asking clarifying questions before tool use.

Neither behavior is automatically “better” in every setting. This is exactly the point. For a high-risk financial operation, early tool calls may be dangerous if the user intent is not yet clear. For a low-risk information retrieval workflow, delayed tool use may be wasteful. TED does not declare one universal winner. It gives you the language to ask which failure you prefer to pay for.

That is refreshingly adult. Slightly less leaderboard, slightly more operations meeting.

The results show that rankings change when user behavior and progress timing are visible

The paper’s main results are not merely that some models score higher than others. The more useful finding is that model rankings can change when evaluation accounts for user expertise and turn-level progress.

On the easy airline split of $\tau^2$-bench, several traditional metrics saturate. Strong agents look near-perfect on final or best-trial completion. TED’s progress and turn-aware metrics reveal differences underneath that saturation.

For example, in the easy airline domain, gpt-4o-mini and Mistral-Large under the expert persona have similar mean progress scores, but MaxAUC shows a larger gap: 0.85 for gpt-4o-mini versus 0.96 for Mistral-Large. That changes how one interprets the agents. They may look similar by average progress, yet differ in how quickly progress accumulates.

The user persona effect is also consistent. Non-expert users generally force more conversational turns, lowering AUC and PPT even when final success or MaxProgressRate remains high. In the easy airline setup, gpt-4o-mini has the same MaxProgressRate under expert and non-expert users, but its turn-efficiency metrics fall under the non-expert persona. A final score would miss this. A workflow owner would not.

The comparison with original benchmark evaluation paradigms is especially important. Using the original $\tau^2$-bench evaluation approach, model rankings cluster in ways that TED later complicates. In the paper’s analysis, Mistral-Nemo can outperform gpt-4o-mini under expert-user settings on some TED metrics, while the non-expert setting can reverse or narrow that relationship. Similarly, in ToolSandbox, the original milestone-similarity ranking places gpt-4o first and Mistral-Nemo last. TED still finds gpt-4o strong, but reveals a more nuanced relationship between Mistral-Nemo and gpt-4o-mini depending on persona and metric.

The message is not “Mistral-Nemo is secretly better” or “gpt-4o-mini is underrated.” That would be the shallow reading. The message is that a model ranking without a user model is an incomplete procurement artifact.

For businesses, this matters because users are not a constant. Internal experts, junior staff, customers, vendors, and executives all produce different instruction patterns. An agent that excels with expert users may not be the best customer-facing agent. An agent that asks cautious questions may be excellent for regulated workflows and irritating for routine automation. TED makes those differences measurable.

Diagnosis: the score is only useful if it becomes a repair signal

The third stage, Diagnose, is where TED becomes more than a benchmark wrapper.

The framework uses multiple judge runs and multiple agent trajectories to identify inconsistencies. Judge variance indicates uncertainty or instability in grading. Different expected progress values across agent trials indicate agent inconsistency. The paper then uses a two-step automated error analysis process:

identify low-level errors from failed or inconsistent subgoals;
semantically cluster similar errors into higher-level categories.

This is the part business teams should pay attention to. A single evaluation score tells you whether the system is good. Error clusters tell you what to fix.

In the paper’s $\tau^2$-bench airline sample 14, for example, the analysis identifies a problematic subgoal: the agent should cancel a reservation. Some trajectories cluster around expected progress values of 0.6, others around 0.4, suggesting repeated failures around specific subgoals. In one trajectory, the agent fails to check that an existing flight is basic economy and does not cancel the previous flight when rescheduling, causing a payment discrepancy. In another, the agent hallucinates a payment value of $2,613.00, which blocks a booking call and cascades into later errors.

This kind of failure is familiar to anyone who has audited automation logs. The system does not collapse dramatically. It makes one wrong assumption, skips one tool, invents one value, and the rest of the workflow quietly inherits the damage. Enterprise failures rarely wear capes.

TED’s contribution is to make this pattern easier to surface. It does not require predefined error categories. It derives candidate errors from judge explanations and failed subgoals, then clusters them. That makes the diagnostic layer portable across tasks, although not magic. The quality of the diagnosis still depends on the quality of the grading notes, trajectories, and judge behavior.

Error-informed prompts improve performance, but not uniformly enough to call it solved

The paper tests whether identified errors can be fed back into agent design. This is an implementation experiment rather than the main conceptual contribution, and it should be read that way.

The authors try several in-context approaches: direct Error Insertion, manually refined Human Notes, the HiTEC-ICL method using generic global errors, and a variant that replaces generic errors with TED-discovered errors. On ToolSandbox, most approaches improve a majority of metrics. On the special low-performing $\tau^2$-bench airline split, the trend is mixed.

Two results are worth separating.

First, TED-discovered errors can help. The paper reports that Error Insertion improves several setups, including gpt-4o-mini on $\tau^2$-bench with gains of +9% in one proposed metric and +5% in another. Human Notes produce more consistent improvements in some settings, including 7–10% gains for gpt-4.1 on ToolSandbox.

Second, error feedback is not a universal patch. For gpt-4.1 on the $\tau^2$-bench samples, the authors do not find a clear trend showing one error-incorporation method consistently outperforming the others. The paper is careful here: it evaluates whether TED errors are useful, but it does not claim to introduce a new best prompting method.

That boundary matters. The business takeaway is not “paste the error clusters into the system prompt and enjoy reliable automation.” That would be adorable, in the way a spreadsheet named final_FINAL_v7.xlsx is adorable.

The better takeaway is that automated error discovery can shorten the evaluation-to-debugging loop. It can turn raw benchmark failures into structured hypotheses for prompt design, tool-contract changes, user clarification policies, and workflow guardrails.

What the appendix adds: robustness, not a second thesis

The appendix is not decoration. It contains several tests that affect how confidently the main results should be interpreted.

Test or appendix material	Likely purpose	What it supports	What it does not prove
Human validation of user proxy	Reliability check	Expert and non-expert user simulations behave mostly as intended, with instruction-following errors around 6–12%	That simulated users fully represent real human users
Human validation of LLM-as-judge	Reliability check	Majority-vote judge outputs often agree with human raters; Cohen’s Kappa ranges from 0.60 to 0.94 across reported settings	That LLM judging is always reliable or unbiased
Human validation of TED error labels	Reliability check	TED-identified errors show limited disagreement with human reference annotations, roughly 6–23% depending on model and dataset	That automated error clustering is perfectly stable or exhaustive
Additional $\tau^2$-bench hard-sample results	Robustness / difficulty extension	Harder tasks reduce completion and expose stronger differences in efficiency metrics	That the same metric behavior will hold in every domain
User-model ablation	Sensitivity test	Changing the user proxy model affects the expert/non-expert gap; stronger user models may behave more expert-like even when prompted as non-experts	That persona prompts alone fully control user expertise

The user-model ablation is particularly interesting. When the user proxy uses a stronger model, such as gpt-5 in the paper’s setup, the non-expert persona becomes closer to expert behavior. That is not a minor technical footnote. It means simulated user expertise is partly controlled by the persona prompt and partly by the capability of the model playing the user. In other words, even the fake clueless user may be too competent if the model underneath is powerful enough. The benchmark user, like many consultants, may be pretending not to know things.

This does not invalidate TED. It clarifies how to use it. User simulation should be treated as a controllable experimental component, not as a faithful copy of real customer behavior.

The business value is cheaper diagnosis, not prettier benchmarking

For a company deploying agents into real workflows, TED suggests a practical evaluation stack.

Business question	TED-style evaluation response
Does the agent complete the task?	Measure final progress and pass@$k$-style success
Does it work with messy users?	Run the same task under expert and non-expert personas
Does it make progress efficiently?	Compare AUC and PPT, not only final completion
Does it fail consistently or randomly?	Inspect sample-level expectation and variance across trials and judge runs
What should we fix first?	Cluster recurring subgoal failures into operational error categories

This is not just an academic improvement. It maps directly to deployment governance.

For customer service agents, user persona variation tells you whether the bot can handle vague customers without escalating everything. For finance operations agents, grading notes can encode required tool calls and approval-order constraints. For HR or procurement workflows, turn-aware metrics can show whether an agent is creating unnecessary friction. For internal IT agents, error clusters can reveal missing tool usage, wrong arguments, or premature confirmations.

Cognaptus would interpret the paper’s practical pathway like this:

Convert workflow requirements into grading notes before deployment.
Simulate multiple user types, including vague and incomplete users.
Run multiple trajectories, because a single clean demo is a sales artifact, not evidence.
Track progress, early progress, and progress per turn.
Use automated diagnosis to create error categories.
Feed the categories into prompt rules, tool schemas, clarification policies, or guardrails.
Re-run the evaluation and check whether the fix actually improves the target metric.

Notice what is missing from that list: “choose the model with the highest leaderboard score and move on.” That approach remains popular because it is simple. So is buying insurance after the fire.

The limits are real: trajectory-visible success is not the same as system-state truth

TED’s most important limitation is also one of its design strengths. Because grading notes are judged from trajectories, the framework does not always need direct access to the underlying environment. That makes it more portable across domains. It also means it can miss silent failures when the system state changes incorrectly or fails to change, and that failure is not reflected in the trajectory.

If the agent says it updated a database, and the trajectory does not expose the actual database mutation, a judge may not be able to verify the truth. TED can catch many tool-use and response-level failures, but it cannot replace environment-state validation where the state is legally, financially, or operationally decisive.

A serious deployment should therefore combine TED-style trajectory evaluation with direct state checks whenever possible. The division is sensible:

\ast Use grading notes and LLM judging to evaluate conversational progress, tool-call logic, ordering, and user communication. \ast Use deterministic checks to validate final database state, transaction integrity, permissions, and audit requirements. \ast Use diagnosis to explain failure patterns after either kind of check exposes a problem.

Another boundary is simulated users. Expert and non-expert proxies are useful experimental controls, but real users bring accents, impatience, domain misconceptions, emotional pressure, malicious behavior, and organizational context. TED gives a better testing harness. It does not eliminate user research.

Finally, LLM-as-a-judge remains a probabilistic component. The paper mitigates this with multiple judge runs, majority voting, variance analysis, and human validation. That is a solid design. It is not a divine court.

From agreeable agents to accountable agents

The title of this article says “Aligned, or Just Agreeable?” TED is not primarily an alignment paper in the classic safety sense. It does not ask whether a model has internal values aligned with humanity, which is fortunate, because enterprise procurement teams already have enough theology disguised as vendor evaluation.

But it does expose a related operational failure. An agent can be agreeable without being accountable. It can cooperate conversationally while failing procedurally. It can sound aligned with the user’s intent while missing the actual subgoals that define success.

TED’s mechanism-first contribution is to make that gap visible:

\ast Talk shows how user behavior changes agent performance. \ast Evaluate turns heterogeneous tasks into graded subgoals and turn-aware metrics. \ast Diagnose converts repeated failure into repairable error categories.

For businesses, that is the difference between asking “Does this model seem smart?” and asking “Under which user conditions, across which workflow steps, with what failure patterns, does this agent actually work?”

The first question gets you a demo.

The second gets you an evaluation system.

The industry has enough demos. It could use more evaluation systems.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, and Daniel Dahlmeier, “Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis,” arXiv:2603.15483v1, 16 March 2026. https://arxiv.org/abs/2603.15483 ↩︎

The old evaluation habit rewards the final answer and ignores the conversation that produced it#

TED turns evaluation into a loop: Talk, Evaluate, Diagnose#

Talk: the same task should be tested against different user expertise levels#

Evaluate: grading notes make messy workflows comparable#

Turn efficiency tells you whether the agent solved the task or merely survived it#

The results show that rankings change when user behavior and progress timing are visible#

Diagnosis: the score is only useful if it becomes a repair signal#

Error-informed prompts improve performance, but not uniformly enough to call it solved#

What the appendix adds: robustness, not a second thesis#

The business value is cheaper diagnosis, not prettier benchmarking#

The limits are real: trajectory-visible success is not the same as system-state truth#

From agreeable agents to accountable agents#