TL;DR for operators
AI reliability is increasingly a process problem, not an answer-checking problem.
Three recent arXiv papers make that point from very different angles. MoCo-EA shows that adversarial examples are not merely isolated malicious pixels lurking in the shrubbery; they can lie along continuous, optimisable paths.1 ConceptAgent shows that erasing a concept from a diffusion model may disrupt the early text-to-image link while leaving later trajectory dynamics available for concept re-entry.2 BlueFin shows that LLM agents doing finance spreadsheet work fail in ways that only appear when you inspect formulas, recalculation behaviour, workbook mutations, tool choices, and whether the output helps a human analyst do useful work.3
The shared lesson is simple and mildly inconvenient: the final artefact is not enough. A spreadsheet may look polished while formulas are hardcoded. An image model may appear censored while intermediate denoising states still carry exploitable structure. An adversarial attack may look like a point failure while the surrounding path is the real vulnerability.
For managers, this means AI assurance must move from endpoint inspection to trajectory inspection. You need workflow logs, intermediate-state tests, perturbation checks, recalculation gates, cost-quality traces, and red-team probes that ask not only “what did the model produce?” but “how did it get there?”
Naturally, this is harder than looking at a pretty demo. That is usually where the useful work starts.
The endpoint is no longer the evidence
Most business AI evaluations still have a suspiciously simple structure. Give the model a task. Look at the answer. Score the answer. Maybe run a few examples. Declare the system “production-ready”, preferably before lunch.
This made some sense when AI systems mostly behaved like prompt-response engines. It makes less sense when systems are agents, tool users, diffusion samplers, code executors, workbook editors, and adversarial surfaces. The object you see at the end is no longer the whole system. It is the residue of a process.
That distinction matters because modern AI failure is often path-dependent. A model can arrive at the right answer for the wrong reason. A spreadsheet agent can generate the correct number once and fail as soon as the input assumption changes. A safety-edited diffusion model can refuse a direct prompt yet still allow the suppressed concept to reappear through a manipulated denoising trajectory. A classifier can survive one perturbation but remain vulnerable along a connected adversarial corridor.
The three papers in this cluster are not about the same artefact. One is about adversarial attacks on classifiers. One is about concept awakening in diffusion models. One is about benchmarking LLM agents on professional finance spreadsheets. They belong together because they attack the same comforting assumption: that AI reliability can be assessed at the endpoint.
It cannot. Not reliably. The path has become part of the product.
The logic chain: from geometry, to generation, to work
The useful way to read these papers is not as three separate summaries. That would be tidy, and therefore less useful. The stronger structure is a complementary chain.
| Step | Paper role | What the paper shows | Business interpretation |
|---|---|---|---|
| 1 | Hidden geometry of risk | Adversarial perturbations can be connected by optimised Bézier paths that preserve or improve attack effectiveness. | Vulnerability is not always a single bad input; it may be a region or route through model behaviour. |
| 2 | Trajectory-level safety failure | Diffusion concept erasure may suppress early text-semantic alignment while later denoising dynamics still allow concept re-entry. | Safety controls that edit visible mappings may miss how behaviour re-emerges through internal process dynamics. |
| 3 | Enterprise workflow evaluation | Finance spreadsheet agents require evaluation through task trajectories, tool use, dynamic correctness, rubric checks, human utility, and cost. | Real deployment needs process telemetry, not just output scoring. |
This is the chain:
- AI failure can live along paths.
- AI safety controls can fail across trajectories.
- AI business evaluation must therefore inspect workflows.
That is the combined argument. It is not “all models are broken”. It is sharper than that: models may look controlled under static inspection while remaining uncontrolled across the process that produces behaviour.
Step one: adversarial risk has geometry
MoCo-EA begins in adversarial machine learning, where the usual goal is to find small perturbations that cause a classifier to misclassify an input. The paper’s core move is to challenge the idea that adversarial examples should be treated as isolated points. The authors study “adversarial mode connectivity”: the existence of continuous paths between successful adversarial perturbations along which attack effectiveness is preserved.
Their method connects two adversarial perturbations with a quadratic Bézier curve. Instead of randomly mixing two parent perturbations through conventional evolutionary crossover, the method optimises a control point so that sampled points along the curve remain adversarial. In simplified form, a quadratic Bézier path between two endpoint perturbations $\delta_1$ and $\delta_2$ can be written as:
where $c$ is the learnable control point. The practical point is not the elegance of the curve. The practical point is that the path can be bent through regions where adversarial effectiveness is preserved.
The authors compare this geometry-aware crossover with a traditional evolutionary attack baseline. Their reported results show that MoCo-EA achieves near-perfect success across tested perturbation norms on CIFAR-10 and ImageNet, while reducing generations, queries, and runtime substantially relative to the baseline. They also report that optimised Bézier paths preserve adversarial connectivity much better than simple linear interpolation, especially in multi-image and cross-class settings.
For security teams, the useful insight is not “use Bézier curves in your next quarterly risk report”, although one can imagine worse slide decks. The useful insight is that adversarial risk may occupy connected structure. If you only test isolated inputs, you may miss the corridor. If you only check a few endpoints, you may miss the route that transfers better than either endpoint.
This matters for model assurance because many business controls still resemble endpoint sampling. A few test cases. A few adversarial examples. A pass/fail rate. MoCo-EA suggests that such sampling may understate risk when the adversarial space has navigable structure. Attackers do not need to worship the exact bad example you tested. They can move.
Step two: safety can fail inside the trajectory
ConceptAgent moves from classifiers to text-to-image diffusion models. The paper examines concept erasure, a class of techniques intended to remove or suppress unwanted concepts from pretrained diffusion models. A naïve version of the safety story goes like this: remove the text-to-concept mapping, and the model no longer generates the concept.
The paper’s story is less convenient.
The authors argue that diffusion generation should be understood as a trajectory, not a one-step mapping from prompt to image. In their framing, generation is influenced by both text-conditioned estimates and semantic information accumulating in the evolving noisy state. Early in the denoising process, text conditioning is more influential because the noisy state contains little semantic structure. Later, the evolving state becomes more semantically informative, and the process shifts toward refinement of already-established content.
That creates a safety gap. Concept erasure may disrupt early text-semantic alignment, but it may not fully prevent semantic information from propagating or re-entering through the denoising dynamics. ConceptAgent exploits this gap under black-box constraints. It uses surrogate concepts that preserve visual attributes of the erased target, constructs surrogate-guided noisy intermediate states, then steers denoising from those states into regions where the erased concept can be awakened.
The multi-agent packaging is almost theatrical: Strategist, Guesser, Director, Referee. But beneath the naming, the mechanism is clear. The system does not ask the erased model directly for the target concept. It constructs an intermediate state carrying related visual structure, then lets the trajectory do the rest.
The paper reports that ConceptAgent awakens erased concepts across tested erasure methods and multimodal agent backbones, using classification accuracy and CLIP similarity as evaluation metrics. It also extends evaluation to safety-critical concepts, which is where the work becomes less “interesting vulnerability” and more “please check your deployment assumptions before making confident policy statements”.
The business implication is blunt: safety edits that operate on visible mappings may not control the full generative process. A governance team can say “we blocked the prompt” and still fail to control the trajectory. That is not a philosophical distinction. It is a product risk.
If a platform sells brand-safe image generation, medical content filtering, IP-sensitive creative controls, or adult-content suppression, endpoint refusal is not sufficient evidence. The evaluation has to test whether suppressed content can re-enter through intermediate states, surrogate prompts, image conditioning, latent manipulation, tool-mediated workflows, or multi-step composition. Safety is not a label on the front door. It is the locking mechanism in the hallway.
Step three: enterprise agents fail through workflows
BlueFin brings the same process-level argument into the business domain. It benchmarks LLM agents on financial spreadsheet tasks: synthesis, manipulation, and interrogation of workbooks. This is refreshingly practical. Finance people do not need a benchmark where an AI writes a haiku about EBITDA. They need to know whether it can modify a debt schedule without quietly turning formulas into decorative fiction.
BlueFin contains 131 complex tasks and 3,225 granular rubric criteria, created with input from finance-domain contributors. The benchmark uses an agentic spreadsheet harness with tools for reading workbook state, writing cells, formatting, creating or deleting sheets, executing constrained Python over openpyxl, and explicitly recalculating workbooks. The grading agent inspects not only static workbook structure but also dynamic behaviour: for example, whether changing an input causes the correct downstream output to update within tolerance.
That last point is the heart of the paper. Spreadsheet correctness is not only whether a value looks right today. It is whether the workbook remains correct when assumptions change. In finance, a number without a live formula is often not an answer. It is a taxidermied answer.
The results are sobering. The paper reports that no tested frontier model exceeds 50% overall performance on the held-out benchmark. Models perform better on some structural or syntactically checkable aspects, but struggle with output validation and dynamic correctness. The authors identify recurrent issues such as sign errors, date-axis misalignment, rate-versus-amount confusion, discounting timing errors, hardcoded values, and failure to proactively recalculate workbooks.
BlueFin also studies behaviour and cost. Some models are more expensive without being uniformly better. GPT-5.5 often uses Python execution early in manipulation tasks, reducing turn count but creating verification challenges if recalculation is not explicit. Some weaker models fail to call recalculation proactively. Sonnet occasionally performs read-only exploration and then submits an unchanged workbook. The politest possible term for this is “unhelpful”. The less polite term is “intern behaviour with API billing”.
The business point is not that spreadsheet agents are useless. It is that judging them by final appearance is dangerous. A workbook can look professional and still fail the moment an input changes. A model can produce a useful partial build but require human review. A costly model can be dominated by a cheaper one on a given task regime. The value is in the workflow trace: what the agent read, wrote, recalculated, verified, skipped, and broke.
What these papers show, and what they do not
It is worth separating evidence from interpretation.
The papers show:
| Claim | Supported by |
|---|---|
| Adversarial perturbations can be connected through optimised paths that preserve attack effectiveness. | MoCo-EA |
| Intermediate points along adversarial paths can improve transferability relative to endpoints. | MoCo-EA |
| Diffusion concept erasure can leave trajectory-level vulnerabilities that permit concept awakening. | ConceptAgent |
| Black-box surrogate-guided intermediate states can bypass disrupted text-to-concept mappings in tested settings. | ConceptAgent |
| Finance spreadsheet agents should be evaluated through dynamic correctness, formulas, perturbation tests, tool use, human utility, and cost. | BlueFin |
| Frontier LLM agents still struggle on realistic finance workbook tasks. | BlueFin |
The business interpretation is:
| Interpretation | Why it follows |
|---|---|
| Endpoint-only evaluation is inadequate for high-value AI deployment. | All three papers identify important behaviour that appears along paths, trajectories, or workflows. |
| AI governance should include process telemetry. | The relevant failure signals include intermediate states, tool actions, recalculation steps, and trajectory manipulation. |
| Procurement should benchmark task regimes, not model brands. | BlueFin shows cost-performance differences and model-specific behaviours; MoCo-EA and ConceptAgent show that capability and risk depend on interaction mechanics. |
| Safety controls should be tested against process-level bypasses. | ConceptAgent and MoCo-EA both show that manipulating the path can matter as much as manipulating the final input. |
The papers do not prove that one universal mathematical mechanism governs all AI systems. A Bézier path in adversarial perturbation space is not the same object as a denoising trajectory, and neither is the same object as an LLM agent’s tool log. Anyone who pretends otherwise is not synthesising research; they are doing interpretive jazz.
The synthesis is more modest and more useful: across distinct AI systems, endpoint evaluation misses important behaviour. That is enough to change how businesses should test, buy, and govern AI.
The operator’s framework: path-level assurance
A process-aware AI governance programme does not need to become a research lab. It does need to stop treating final outputs as the only evidence.
A practical framework has five layers.
| Layer | Question | Example control |
|---|---|---|
| Output | Is the final artefact correct, safe, and useful? | Rubric scoring, human review, value checks |
| Structure | Does the artefact remain valid under change? | Formula inspection, dependency checks, perturbation testing |
| Trajectory | How did the system reach the output? | Tool-call logs, intermediate-state snapshots, denoising or reasoning traces where available |
| Robustness | What happens under adversarial or off-distribution pressure? | Red-team paths, surrogate prompts, input perturbations, workflow stress tests |
| Economics | Was the quality worth the cost and latency? | Cost-per-success, retry rate, human rework time, escalation rate |
The word “trajectory” will mean different things across systems. For an LLM spreadsheet agent, it means workbook reads, writes, Python calls, recalculation events, and final submission. For a diffusion model, it may mean denoising states, prompt conditioning, image conditioning, or latent interventions. For a classifier, it may mean perturbation paths through the loss landscape. For a business process, it may mean the chain of API calls, retrieved documents, transformations, approvals, and handoffs.
The common requirement is traceability. You cannot govern what you cannot observe. You can only hope it behaves, which is an adorable strategy until procurement signs the contract.
How this changes AI procurement
Many AI buying processes still ask the wrong questions. They ask:
- Which model is best?
- What is the benchmark score?
- Does the demo look good?
- Can it automate this workflow?
- Is it safe?
Those questions are not useless. They are incomplete. A process-aware procurement review asks:
| Old question | Better question |
|---|---|
| Does the output look right? | Does it remain right when assumptions change? |
| Is the model accurate? | Which task regimes expose failure, and why? |
| Is the system safe? | Which bypass paths have been tested? |
| Is the model expensive? | What is the cost per usable output after review and rework? |
| Can it use tools? | Does tool use produce auditable, reversible, and validated state changes? |
| Does it pass a benchmark? | Does the benchmark resemble our workflow, data, constraints, and failure costs? |
BlueFin is especially important here because it makes the cost of realistic evaluation visible. Building meaningful benchmarks for professional workflows is expensive. The paper reports substantial contributor and review effort per task. That is not a flaw. That is the price of measuring work that actually matters.
Cheap benchmarks are often cheap because they measure toy problems, static answers, or conveniently verifiable outputs. Real business work contains domain conventions, brittle integrations, hidden dependencies, tolerance thresholds, formatting expectations, and human usability requirements. This is why “the model got the number right” is not enough. The number may be hardcoded. The number may be right for the current case and wrong for every scenario that matters. The number may be surrounded by a workbook that no analyst wants to touch without gloves.
How this changes AI safety review
The ConceptAgent paper should make safety teams particularly wary of static control claims. “We erased the concept” is a stronger claim than “we suppressed direct prompt activation under our test conditions”. The second is more accurate. The first fits better in marketing copy, which is usually the problem.
A process-aware safety review should test:
- Direct prompts.
- Paraphrases and surrogate concepts.
- Multi-step workflows.
- Image or file conditioning.
- Intermediate representations where accessible.
- Tool-mediated transformations.
- Composition attacks.
- Retry and selection strategies.
For diffusion systems, this means testing not just whether a forbidden prompt is rejected or softened, but whether related visual structure can be introduced through permitted concepts, partial generations, masks, latent states, reference images, or downstream editing. For LLM systems, it means testing whether a prohibited output can be assembled across steps through summaries, transformations, tool calls, or memory. For classifiers, it means testing not only known adversarial examples but structured neighbourhoods and paths.
The underlying control principle is simple: safety claims should be phrased in terms of tested pathways. A system is not “safe”. It is robust against specified classes of attempts under specified access assumptions, tools, and operational constraints. Less glamorous, more true. A fair trade.
The hidden metric: rework
One of the most useful ideas implicit in BlueFin is that enterprise AI should be measured by rework, not merely correctness. A spreadsheet agent that completes 60% of a model but leaves dynamic formulas broken may be worse than a model that completes 40% cleanly, depending on review cost. A generated image that is visually acceptable but violates brand safety under alternate conditioning creates downstream legal or reputational rework. A classifier that resists point attacks but fails along connected perturbation paths creates security rework.
For operators, the key metric is not:
It is closer to:
This formula is deliberately plain. It reminds us that the model’s benchmark score is only one term. The denominator is where optimistic AI projects go to become “strategic learnings”.
A process-aware benchmark should therefore record:
- first-pass success rate;
- number of tool calls;
- number of retries;
- human correction time;
- dynamic correctness under perturbation;
- failure severity;
- escalation frequency;
- cost per accepted output;
- auditability of the path;
- reversibility of state changes.
This is not bureaucratic decoration. It is how you distinguish automation from automated mess production.
The uncomfortable boundary
There is one important caveat. Path-level assurance does not mean every AI system must expose every internal state. Some models are closed. Some trajectories are inaccessible. Some intermediate representations are not interpretable. Some logs are too large to store indefinitely. Some vendors will assure you that everything is fine, which is kind of them.
The point is not that every organisation needs perfect visibility. The point is that evaluation should move as far upstream into the process as the risk justifies.
For low-risk drafting tasks, final review may be enough. For finance models, medical triage, compliance workflows, credit decisions, infrastructure monitoring, content safety, or security-sensitive classification, endpoint review alone is thin evidence. The more expensive the failure, the more process visibility you need.
A reasonable assurance design matches depth to risk:
| Risk level | Evaluation depth |
|---|---|
| Low | Output review, sample checks, basic logging |
| Medium | Rubric evaluation, workflow traces, retry analysis, human rework tracking |
| High | Perturbation tests, dynamic validation, adversarial red-teaming, process telemetry, escalation gates |
| Critical | Formal controls, independent audits, continuous monitoring, incident replay, restricted autonomy |
The mistake is not using AI. The mistake is buying autonomy while evaluating it like autocomplete.
What managers should do next
The practical actions are straightforward.
First, require workflow traces for agentic systems. For any AI system that changes files, calls tools, edits workbooks, sends messages, executes code, or modifies records, the trace is part of the deliverable. Without it, review becomes archaeology.
Second, evaluate dynamic correctness. In spreadsheets, change assumptions and recalculate. In workflows, alter inputs and verify downstream state. In retrieval systems, change the evidence base and inspect whether conclusions update properly. Static correctness is the demo. Dynamic correctness is the job.
Third, red-team paths, not just prompts. Test sequences, surrogate inputs, partial completions, tool combinations, and perturbation neighbourhoods. Attackers and failure modes both enjoy moving around the furniture.
Fourth, measure cost per usable output. Do not compare model prices in isolation. Compare total cost after retries, review, rework, latency, and failure remediation. A cheaper model that produces auditable partial work may beat an expensive model that confidently creates a beautiful liability.
Fifth, build domain-specific rubrics. BlueFin’s lesson is that realistic evaluation requires domain expertise. Generic benchmarks are useful for market gossip. They are less useful for deciding whether an AI analyst should touch your acquisition model.
The larger conclusion
These papers point toward a practical shift in AI management. The next serious layer of AI evaluation will not be only about bigger benchmarks or better final-answer scores. It will be about process evidence.
MoCo-EA shows that adversarial vulnerability can be path-shaped. ConceptAgent shows that generative safety can fail through denoising trajectories. BlueFin shows that business usefulness depends on the workflow between instruction and artefact. Different systems, same managerial headache.
That headache has a name: path dependence.
If the output is the only thing you inspect, you are not evaluating the system. You are evaluating the system’s alibi.
For business operators, the remedy is not panic. It is instrumentation. Log the path. Test the path. Price the path. Govern the path. Then decide how much autonomy the system deserves.
AI assurance is becoming less like grading a final exam and more like auditing a production line. That may sound less glamorous. Good. Glamour is what vendors use when the formula cells are hardcoded.
Cognaptus: Automate the Present, Incubate the Future.
-
Hyo Seo Kim, Gang Luo, Can Chen, Binghui Wang, Yue Duan, and Ren Wang, “MoCo-EA: Exploiting Adversarial Mode Connectivity for Efficient Evolutionary Attacks,” arXiv:2605.18919, 2026. https://arxiv.org/html/2605.18919 ↩︎
-
Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, and Yi Zhang, “Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework,” arXiv:2605.18150, 2026. https://arxiv.org/pdf/2605.18150 ↩︎
-
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, and Zach Kirshner, “BlueFin: Benchmarking LLM Agents on Financial Spreadsheets,” arXiv:2605.30907, 2026. https://arxiv.org/html/2605.30907 ↩︎