TL;DR for operators

Runbooks decay. APIs shift, data schemas mutate, file paths move, and the “expert procedure” that worked last quarter starts quietly steering an agent into a wall. The paper behind this article, SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing, asks a useful operational question: can an agent skill be improved when nobody has provided hidden tests, reference answers, task rewards, or expert labels?1

The answer is: sometimes, and the “sometimes” is the important part. SkillAudit runs the same task twice, once with the candidate skill and once without it. It then audits the divergence. If the skill causes better tool use, cleaner artifacts, or more compliant output, the system protects that guidance. If the skill causes a bad path, wrong schema, broken formula, or conflicting workflow, the system edits or removes the offending passage. The clever move is not magic self-improvement. It is controlled comparison.

The reported result is substantial: on 89 containerized SkillsBench tasks, SkillAudit reaches 73.9% average task reward, compared with 40.9% for no-skill execution and 56.7% for the static benchmark-provided skill. That is a 17.2 percentage point gain over the expert-authored static skill, under an evolution loop that does not see the real verifier until after evolution ends.

For enterprise use, the lesson is sharp. This is a credible pattern for maintaining agent instructions when procedures have observable contracts: output files, exact headers, schemas, formulas, library calls, paths, version constraints, runtime errors, and reproducible artifacts. It is not a license to let agents invent domain truth. When the missing knowledge is semantic, procedural, or judgment-heavy and leaves no structural footprint, the auditor can stare heroically at the artifacts and still learn approximately nothing. Very brave. Not very useful.

The managerial translation: treat agent skills less like documentation and more like executable operational contracts. The best skills are not tutorials. They are compact, navigable, verifier-readable procedures that tell the agent exactly what must happen and what the environment can check.

The real problem is not writing the skill once

Most enterprise AI failures are not dramatic. They are boring. A workflow instruction says to use one folder, the workspace uses another. A skill says “generate dist/index.html,” the task demands /root/output/index.html. A procedure contains useful advice, but it is buried under a 7,000-line bundle of adjacent nonsense. The agent is not “confused” in the romantic sense. It has been handed a procedural junk drawer.

Agent skills are meant to solve a real problem: frozen models need task-specific procedural knowledge without fine-tuning. A skill can package natural-language guidance, helper scripts, examples, and workflow constraints. This is attractive because it is cheaper and more portable than updating model weights. It also gives teams a concrete artifact to inspect, version, and reuse.

The difficulty is that skills are not static assets. A skill that was correct when written can become incomplete, over-specific, misaligned, or actively harmful after deployment. The usual response is to refine it using feedback. But most refinement systems quietly assume some privileged signal: hidden tests, validation scores, reward functions, expert references, historical failure logs, support tickets, or a human reviewer who kindly does the hard part.

SkillAudit deliberately removes that comfort blanket. It assumes the practitioner has only three things: the task description, the workspace data, and an initial skill. No hidden answer key. No oracle. No reward during optimization. The final benchmark reward exists for the paper’s evaluation, but the evolution loop itself does not get to see it.

That constraint matters because it resembles a large class of operational reality. Many organizations do not have full evaluation harnesses for every internal agent task. They often have a task description, a messy workspace, a half-reliable runbook, and optimism. The last item is rarely sufficient.

The mechanism: compare the skill against its absence

SkillAudit’s central move is paired trajectory auditing. For each iteration, it runs the task twice under comparable conditions:

  1. with the current skill injected into the agent context;
  2. without that skill.

The purpose is not to ask which final answer is correct. Under the ground-truth-free constraint, the system is not allowed to know that. The purpose is to isolate what the skill changes.

A single trajectory confounds task difficulty, model competence, and skill influence. If the agent fails while using a skill, maybe the task is hard. Maybe the model is weak. Maybe the skill is harmful. Maybe Tuesday is cursed. Paired execution at least separates one variable: what happens when this procedural package is present versus absent.

The output is not yet an edit. A trajectory difference is evidence, not diagnosis. The system still needs to know which passage in the skill caused the behavior, whether the behavior was useful, and whether an edit would violate hard task constraints. SkillAudit solves this with two components: PACE and the Anchor Verifier.

Component What it does Operational translation Where it breaks
Paired trajectory auditing Runs the same task with and without the skill Compare the runbook’s actual behavioral effect, not its literary quality Weak when the no-skill run is incoherent or offers no useful contrast
PACE Uses 12 evaluator templates across Process Adherence, Artifact Evidence, Consistency, and Effectiveness Delta Convert divergence into passage-level edit instructions LLM-generated judgments can drift or over-read weak evidence
Anchor Verifier A deterministic script compiled once from task-visible constraints Lock hard constraints such as file existence, headers, schemas, recomputable values, and required companions Narrow by design; it cannot check hidden semantic correctness
Refine pipeline Removes noise from broadly useful skills Preserve working procedures, prune bloat Cannot add much new knowledge
Repair pipeline Replaces or deletes passages that conflict with the task Fix actively harmful procedural guidance Still depends on observable evidence or a useful without-skill fragment

The table is the paper in miniature. SkillAudit is not a synthetic expert. It is an audit loop. It manufactures a weak optimization signal by observing whether the skill helps or hurts execution relative to the model’s unaided behavior.

That is a much humbler claim than “the agent learns autonomously,” and a much more useful one.

PACE turns “something changed” into “edit this paragraph”

PACE stands for Process-Aligned Contrastive Evaluation. The name sounds like it has been processed through the usual acronym refinery, but the role is clear. PACE compares the with-skill and without-skill trajectories across four dimensions:

PACE dimension Representative question
Process Adherence Did the agent follow the skill’s steps and tool-use pattern?
Artifact Evidence Do the produced files satisfy the declared constraints?
Consistency Does the skill match the task’s schema, output format, and workflow?
Effectiveness Delta Did the skill actually improve behavior at divergence points?

The important detail is anchoring. PACE does not merely say “the skill is bad.” It emits action signals tied to specific quoted segments of the skill document. It can mark passages as surgery targets or protected segments. That turns evaluation into edit guidance.

This distinction is not decorative. Many feedback systems produce a verdict: pass, fail, better, worse. Verdicts are useful for selection, but poor for repair. A maintainer needs to know whether the harmful content is the hardcoded cache path, the obsolete D3 version, the extra tutorial section, the wrong formula-preservation instruction, or the fact that the skill gives no output contract at all.

PACE is the part that tries to say: this exact paragraph caused that exact bad behavior. No paragraph, no surgery. No surgery, no evolution. Apparently even agent self-improvement needs paperwork.

The Anchor Verifier is a guardrail, not an oracle

PACE is soft. It is LLM-mediated evaluation, so it can drift. Later rounds may become too lenient. A model judging its own traces can develop the institutional memory of a very tired compliance department.

The Anchor Verifier exists to prevent that. It is generated once from the task specification and then locked. It checks only constraints derivable from the task and workspace without ground truth: required files, exact paths, schemas, headers, enumerated values, numeric fields recomputable from available data, and required companion files. If a proposed edit causes an Anchor regression, the system rolls it back.

The design is intentionally narrow. A broad verifier would become a disguised oracle, or worse, a hallucinated oracle. A narrow verifier cannot solve everything, but its failures are more interpretable. If it says the output file is missing, the output file is missing. If it says a header is wrong, the header is wrong. This is the sort of boring determinism enterprise systems pretend not to need until they very much need it.

The Anchor Verifier also clarifies SkillAudit’s boundary. The system can guard against structural regressions. It cannot certify hidden correctness. If the task requires “patch the vulnerability properly,” and the only visible checks are “a patch file exists” and “Maven compiles,” then the verifier may bless a structurally acceptable but semantically inadequate result. The paper does not hide this. It names observability as the central limit.

Refine and Repair are different because skills fail differently

SkillAudit uses two edit routes.

Refine is conservative. It assumes the skill is broadly useful but cluttered, noisy, redundant, or slightly misaligned. Its default action is subtraction. It protects helped segments, uses a tighter modification budget, and can exit early. It is the “do not break the working thing” pipeline.

Repair is more aggressive. It assumes the skill’s core workflow conflicts with the task. It can replace harmful passages, delete bad content, and fill diagnosed gaps after removing the conflict. It does not pre-lock the same protected core, because the core may be the problem.

This is not just software architecture taste. A single update policy would fail in opposite ways. If all skills are treated conservatively, truly harmful skills never get fixed. If all skills are treated aggressively, working skills are rewritten into rubble. SkillAudit’s routing step tries to decide whether the skill needs pruning or surgery.

The paper does not yet prove the routing heuristic independently. In fact, the conclusion explicitly calls for future work measuring how often Refine versus Repair assignment matches the post-hoc appropriate choice. That matters. The dual-strategy design is plausible and supported by case evidence, but the component-level necessity remains unproven without ablations.

The main result is a large gain, but read it correctly

The headline result is easy to misread. SkillAudit improves average reward, but the result is not “skills always help.” In some domains, the static skill is weaker than no skill. That is precisely why auditing matters.

Domain Tasks No skill Static skill SkillAudit Gain over static
Software Engineering 16 53.8 40.3 78.8 +38.5
Office & White Collar 15 60.0 60.0 86.7 +26.7
Natural Science 15 41.6 73.3 83.7 +10.4
Industrial & Physical Systems 14 21.4 61.2 71.4 +10.2
Finance & Economics 9 33.3 44.4 44.4 +0.0
Mathematics & OR 8 25.0 35.6 57.5 +21.9
Cybersecurity 6 41.7 63.8 66.7 +2.9
Media & Content Production 6 33.9 79.2 83.3 +4.1
Average 89 40.9% 56.7% 73.9% +17.2

The Software Engineering result is especially revealing. The static skill baseline is 40.3, below no-skill execution at 53.8. SkillAudit reaches 78.8. That does not mean the agent became a brilliant software engineer by meditation. It means the skill package contained harmful or distracting guidance, and the audit loop could identify and remove enough of it.

Office & White Collar shows another large gain, from 60.0 to 86.7. Mathematics & OR improves from 35.6 to 57.5, a notable gain from a weak static baseline. Finance & Economics is the exception: SkillAudit matches the static skill at 44.4 rather than improving it.

The setup matters. The experiment uses 89 runnable SkillsBench tasks across eight consolidated domains. Each task runs in a Harbor container. Evolution and evaluation are separated: the evolution loop uses a stub container and cannot access the pytest verifier or test content; the real verifier runs only after evolution terminates. The agent model used for evolution and post-evolution evaluation is Claude Code with Claude Opus 4.8.

That gives the result its force and its boundary. The paper shows that this mechanism can improve skill artifacts in a controlled benchmark without privileged feedback during evolution. It does not show that the same gains will transfer unchanged across models, enterprise environments, proprietary tools, or workflows where the task description is vague and the workspace is less obliging.

The evidence is strongest where correctness leaves footprints

The most valuable part of the paper is not the average score. It is the boundary analysis.

SkillAudit protects working skills well. Among 59 tasks where the initial skill reward is at least 0.5, evolution preserves the result on 54 tasks, or 92%, each retaining or improving reward. Weak skills are harder. Among 30 tasks where the initial skill reward is below 0.5, evolution lifts 13, or 43%, to a passing state; the rest remain at reward 0.

That asymmetry is the system’s personality. It is better at not breaking good skills than at inventing missing knowledge for bad ones. This is exactly what we should expect from a ground-truth-free loop. Removing noise is easier than creating truth.

The paper’s strongest explanatory cut is observability, not domain. Skills based on executable, observable knowledge evolve well: library-API usage reaches 79.2% average reward, mathematical methods reach 80.7%, and the combined average over 61 tasks using either is 79.9%. Skills encoding domain procedure are weaker at 69.2% average reward, and they dominate the failure set. Among tasks left at reward 0 after evolution, 77% carry a domain-procedure label, compared with a 65% base rate in the full benchmark.

Task-type labels support the same point. Formatting and generation reach 100%; transformation reaches 88.9%; optimization reaches 80.0%. Search, planning, and repair perform worse at 25.0%, 57.5%, and 45.0%. The reason is not that “finance is hard” or “cybersecurity is hard.” The reason is that some knowledge leaves a witness and some knowledge does not.

A wrong header leaves a witness. A missing output file leaves a witness. A broken API call leaves a witness. A wrong formula often leaves a witness. A mistaken judgment about where to patch a codebase may not. The agent can only audit what the environment exposes. Everything else becomes vibes with extra steps.

The structural edits reveal what agent skills should look like

The paper’s structural analysis reads like a quiet indictment of how people write procedural documentation. Across the 89 tasks, successful edits cluster into a small vocabulary:

  • pruning off-domain skills bundled with relevant ones;
  • removing tutorial prose while keeping executable cores;
  • de-hardcoding paths, versions, and parameters;
  • moving constraints next to the steps they govern;
  • adding verbatim reminders for copy-sensitive outputs;
  • supplying missing input/output contracts;
  • fixing one-character path or filename errors.

The pattern is consistent. SkillAudit moves skill documents toward what an environment can observe and away from what a human reader might find educational. This is a useful slap on the wrist for enterprise teams that confuse “comprehensive documentation” with “good agent instruction.” Agents do not need a seminar on CVSS when the task needs a CSV schema and an offline Trivy command.

One finding is especially practical: none of the 89 initial skill sets contains a top-level index, but 80 of the 89 evolved sets do, often with a keyword-to-skill routing table. The system effectively invents a navigation layer. Multi-file skills need a dispatcher. The agent needs to know what skill to use, when, and what is missing.

The length dynamics are also instructive. Line-count change correlates negatively with initial size, at a reported correlation of $-0.378$ across 89 tasks. Skills below 300 lines gain a median of 15 lines; skills above 1,000 lines lose a mean of 616 lines. Refine tasks show a heavier deletion tail, with a mean line-count change of $-260$ and a minimum of $-7{,}375$ lines. Repair tasks are more targeted, with a mean change of $-165$ and a minimum of $-2{,}867$ lines.

This is not “shorter is always better.” It is denser is better. Starve the agent and it adds the missing contract. Flood the agent and it deletes the noise. A good skill is a compact, navigable, execution-oriented contract.

The case studies show why paired auditing matters

The paper’s case studies are not separate proof of the method; they are mechanism illustrations and implementation detail. They explain how the system uses evidence when the abstract loop becomes concrete.

In software-dependency-audit, the task requires offline Trivy scanning of a 1,282-package package-lock.json and a fixed-schema CSV for high and critical vulnerabilities. The static skill works but is bloated with CVSS tutorial content and a hardcoded cache path. PACE flags the cache mismatch and identifies that the with-skill run uses the full offline Trivy flags while the without-skill run omits critical flags. Refine deletes about 35% of skill lines, protects the load-bearing CVSS source-priority loop, and reaches reward 1.0.

In data-to-d3, the initial skill is actively conflicting. It pins a D3 version, prescribes dist/ paths, and discourages live force simulation, while the task requires a specific output path, D3 behavior, and exact column labels. Repair first triggers a regression revert when the without-skill trajectory passes 13/13 Anchor checks but the with-skill trajectory fails on column-name casing. It then shrinks the skill from 188 to 50 lines and adds a verbatim-label reminder. The benchmark evaluation improves from 0.0 to 1.0.

The appendix adds two more useful examples. lab-unit-harmonization shows why “inert” can be a good verdict: the first iteration offers no high-confidence edit, so the system waits. A decisive contrast appears in iteration two, where the with-skill run passes 8/8 checks and the without-skill run misses one SI conversion. exceltable-in-ppt shows latent harm: a formula recalculation instruction is valid in some contexts but destructive for this task, zeroing out formulas that must be preserved. Repair rolls back, deletes the harmful section, and converges.

These examples share one property: the bad instruction leaves a footprint. Wrong path. Wrong label casing. Missing conversion. Destroyed formulas. The auditor has something to grab. That is why the mechanism works.

What each experiment supports, and what it does not

The paper is disciplined enough to distinguish several evidence types. Operators should do the same.

Evidence block Likely purpose What it supports What it does not prove
Table 3 main benchmark Main evidence SkillAudit improves average reward over no-skill and static-skill baselines across 89 tasks That every component is necessary, or that gains transfer to other models and enterprises
High-quality vs low-quality split Boundary analysis The method protects already-working skills better than it repairs failing ones That recovery failures are always due to missing domain knowledge
Knowledge-type and task-type cuts Explanatory boundary analysis Observability predicts evolution success better than broad domain labels That the labels are a complete taxonomy of evolvability
Structural diff analysis Exploratory mechanism analysis Successful edits converge toward compact, navigable, verifier-observable contracts That every organization should impose the same line-count targets
Main case studies A/B Mechanism illustration and implementation detail Refine and Repair behave differently under concrete task evidence General statistical robustness
Appendix cases C/D Exploratory extension and implementation detail Inert deferral, rollback, and targeted deletion can work in additional settings Full sensitivity testing
Appendix E implementation details Implementation detail Shows how Anchor generation, edit priority, protected segments, and degenerate no-skill runs are handled That these design choices are optimal

The missing item is just as important: there is no systematic ablation study. The paper itself lists this as future work, including variants without the Anchor Verifier, with a single pipeline instead of dual routing, or with raw trajectory diffs instead of structured PACE output. So the right interpretation is not “every module has been proven load-bearing.” It is “the full architecture works in this benchmark, and the qualitative evidence makes the design plausible.”

That distinction matters if you are deciding whether to build this into an internal agent platform. Copying the entire architecture may be sensible. Assuming every part is independently validated would be premature. The machine is promising; the receipt is not itemized.

The business value is cheaper diagnosis, not autonomous wisdom

For enterprise agent operations, SkillAudit points toward a useful maintenance loop:

  1. store skills as versioned artifacts;
  2. run paired executions against representative workspaces;
  3. compare behavior at divergence points;
  4. localize edits to specific passages;
  5. protect known-good segments;
  6. gate commits with deterministic structural checks;
  7. separate conservative refinement from conflict repair.

That is not glamorous. It is also the sort of thing that actually reduces operational risk.

The immediate ROI is not “we never need subject-matter experts again.” That sentence should be put in a drawer and ignored until civilization improves. The realistic value is cheaper diagnosis of procedural drift. When a skill breaks because an output path changed, an API version was over-specified, a schema requirement was hidden in a note, or a bloated bundle distracts the agent, paired auditing can localize the problem faster than manual spelunking.

The second value is governance. If skills are treated as editable operational assets, every change can be tied to a trajectory divergence, an Anchor check, and a commit decision. That is a better audit trail than “the prompt seemed improved.” It also makes agent performance less dependent on heroic prompt authors who remember every edge case by force of caffeine.

The third value is authoring discipline. SkillAudit’s edits imply a design standard: write skills like tests would read them. Include exact filenames, output schemas, column names, path constraints, formulas, tool calls, method requirements, and routing indexes. Avoid long tutorials, broad persona bundles, and hardcoded defaults that contradict the task. If the agent needs background theory, fine, but do not bury the executable contract under it like a bureaucratic fossil.

The boundary: ground-truth-free does not mean truth-generating

The likely misconception is that a ground-truth-free skill evolution system can discover missing domain truth by comparing itself to itself. It cannot, at least not reliably. SkillAudit can use observable behavioral evidence. It cannot protect knowledge that leaves no trace, and it cannot repair a procedure whose correctness is invisible to the available checks.

The paper’s own examples make this clear. In a CVE repair task, the auditor can confirm that a patch file exists and Maven compiles. It cannot verify that the patch blocks the exploit at runtime. In financial-modeling-qa, the evolved index notes that dice game scoring algorithms are not covered, but the available signal only confirms that answer.txt exists and contains a number. The system recognizes a gap without being able to fill it. This is progress, in the same way that knowing where the hole is helps, but it is still a hole.

This is where enterprise deployment must be blunt. SkillAudit-like loops are strongest for tasks with externally visible procedural contracts:

Strong fit Weak fit
File transformation Strategic judgment
Schema repair Legal interpretation without testable constraints
Spreadsheet formula preservation Domain procedures with hidden correctness
API/tool workflow correction Vulnerability fixes without exploit validation
Output formatting and routing Planning where intermediate quality is hard to observe
Mathematical methods with checkable outputs Semantic search where relevance is subjective or unscored

A practical system should route tasks accordingly. Use ground-truth-free auditing where the environment gives you evidence. Add human review, synthetic tests, expert rules, or production feedback where it does not. The trick is not to worship autonomy. The trick is to know when autonomy has a measuring stick.

What remains uncertain

Several uncertainties matter before treating SkillAudit as a production-ready recipe.

First, model dependence is unresolved. The main evaluation uses Claude Opus 4.8 for evolution and evaluation. The paper’s future work explicitly names cross-model transfer as an open question. A skill evolved by one model may encode procedural knowledge that transfers, or it may encode model-specific habits. Enterprises with heterogeneous agent stacks should not assume portability for free.

Second, component necessity is not established. The architecture includes paired execution, PACE, an Anchor Verifier, Refine/Repair routing, protected segments, version control, and rollback logic. The paper argues for each component, but does not run systematic ablations. That leaves open whether a simpler version could achieve much of the gain, or whether one omitted component would collapse performance.

Third, routing reliability is not measured independently. Refine versus Repair is conceptually clean, but the cost of misrouting remains unclear. A harmful skill routed to Refine may be under-fixed. A useful skill routed to Repair may be over-edited. The paper recognizes this as future work.

Fourth, benchmark realism is partial. The tasks are containerized and executable, which is a strength for measurement. But enterprise workflows often involve permissions, changing external systems, partially specified goals, sensitive data constraints, and social judgment. These conditions may reduce the clarity of paired trajectories and Anchor checks.

Finally, the method requires compute and orchestration. Running paired executions for up to five iterations, evaluating traces with multiple PACE templates, managing versions, and building verifiers is not free. For high-value workflows, the cost may be justified. For low-value one-off tasks, manual repair or a simpler test harness may win. Very tragic for anyone hoping one architecture would solve budget allocation.

The enterprise takeaway: make skills observable before making them autonomous

SkillAudit is best understood as a maintenance architecture for agent procedures. Its contribution is not that agents can become experts without feedback. Its contribution is that useful feedback can sometimes be manufactured from contrast: what the agent does with the skill, what it does without the skill, and what observable artifacts reveal about the difference.

That reframes agent-skill engineering. The goal is not to write impressive documentation. The goal is to write procedures whose claims can be witnessed by execution. A skill should say what file to produce, what schema to satisfy, what path to avoid, what formula to preserve, what API call matters, and what routing decision selects which sub-skill. The more a skill resembles an executable contract, the more evolvable it becomes.

The paper’s most commercially relevant sentence is not its score table. It is the implied rule: a skill is only as evolvable as its weakest important claim is observable. That rule is less glamorous than autonomous self-improvement. It is also more likely to survive contact with an actual enterprise.

In agent operations, that is usually the difference between a demo and a system. One applauds the final answer. The other asks which instruction caused the behavior, whether the artifact proves it, and whether the next edit can be rolled back.

Naturally, the second one is less fun. It is also where the money is.

Cognaptus: Automate the Present, Incubate the Future.


  1. Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, and Xueqi Cheng, “SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing,” arXiv:2606.14239v1, 12 June 2026. https://arxiv.org/abs/2606.14239 ↩︎