Metric Freedom: When Your AI Gets Smarter by Doing Less

AI teams like committees.

Not human committees, of course. Those are unfashionable. We now prefer committees made of agents: one agent plans, one verifies, one critiques, one searches, one writes code, one supervises the others, and somewhere in the corner a “coordinator” burns tokens making everyone feel aligned.

This architecture is not stupid. Multi-agent systems solve real problems: they divide labor, preserve specialized expertise, and make complicated workflows easier to inspect. But they also bring the usual committee tax: coordination overhead, fragmented context, brittle phase ordering, and the faint smell of process worship.

So the next idea is obvious: compress the committee into one capable agent with the right tools and skills. Keep the expertise. Remove the meeting.

The inconvenience is that this does not reliably work.

In From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?, Binyan Xu, Dong Fang, Haitao Li, and Kehuan Zhang argue that the success of this compression is not mainly determined by the task. It is determined by the evaluation metric.¹ That is the useful part of the paper. Not “single agents beat multi-agent systems,” not “skills are magic,” and not “pipelines are dead.” The sharper claim is this: before deciding how much structure to put into an AI workflow, look at how success is scored.

The paper calls the key quantity Metric Freedom, written as $F$. The name is slightly grand, but the idea is refreshingly practical: some metrics punish behavioral variation sharply, while others tolerate many different routes to a good score. If the metric is rigid, structured skill guidance can help. If the metric is free, the same structure can become a leash.

That is the mechanism. The rest of the paper is evidence, architecture, and boundary conditions around that mechanism.

Metric Freedom asks whether different behavior actually changes the score

The easiest way to understand Metric Freedom is to forget agents for a moment and look at grading.

Suppose an agent solves the same kind of problem several times. Its outputs differ. Sometimes it chooses one method; sometimes another. Sometimes its reasoning trace is long and tool-heavy; sometimes it is short and direct. The question is not merely whether the outputs differ. The question is whether those differences matter to the metric.

For each pair of baseline runs, the paper compares two distances:

behavioral distance: how different the outputs or reasoning traces are;
score distance: how different their evaluation scores are.

The authors use a Mantel-style rank correlation between those distance matrices. In simplified form:

$$ F_X = 1 - r_M(X) $$

where $r_M(X)$ is the Spearman rank correlation between behavioral distance and score distance, and $X$ can refer to final outputs or reasoning traces.

Low $F$ means behavioral differences strongly predict score differences. This is a knife-edge metric. If the agent chooses the wrong path, the score moves sharply. High $F$ means behavioral differences do not strongly predict score differences. The scoring surface is flatter. The agent can wander without being punished much.

That distinction matters because a distilled skill is not just “knowledge.” It is also a constraint. A skill nudges the model toward some paths and away from others. Whether that is useful depends on whether the metric rewards staying inside a narrow corridor.

Metric regime	What the score does	What structure does	Business interpretation
Low $F$	Different paths produce different scores	Guides the agent into the narrow success corridor	Skill engineering is likely worth it
Mid $F$	Some differences matter, others do not	Selective structure helps, but full pipelines become risky	Keep tools and knowledge; treat pipelines carefully
High $F$	Many paths score similarly	Heavy structure restricts useful exploration	Prefer lightweight tools and minimal procedural guidance

This is already a better decision rule than “multi-agent good” or “single-agent cheaper.” Those slogans are architecture-first. Metric Freedom is scoring-first.

The central example: same task, same runs, opposite metric geometry

The cleanest part of the paper is the causal estimation case.

Causal estimation can be judged in more than one way. One metric is Method Selection Accuracy (MSA): did the system choose the correct causal method, such as instrumental variables, difference-in-differences, regression discontinuity, or ordinary regression? This is rigid. Choosing the wrong method is not “almost right.” It is wrong.

Another metric is Mean Relative Error (MRE): how close is the estimated effect size to the target value? This is more forgiving. Different methods may sometimes converge to similar numerical estimates, especially when the data structure is not cruel enough to expose every methodological sin.

The paper uses the same raw-agent run pairs and evaluates them under these two lenses. Under MSA, behavioral distance maps tightly to score distance: if two runs choose different causal methods, they often receive different correctness outcomes. The paper reports MSA around the low-freedom regime, with one illustration at $F \approx 0.10$. Under MRE, the same behavioral diversity is much less predictive of score differences; the paper reports MRE around the high-freedom regime, with one illustration at $F \approx 1.03$.

Same task. Same agent behavior. Different metric topology.

That is the article-level point. If we describe the paper as “multi-agent systems can be distilled,” we miss the actual lesson. The lesson is that a task does not have one natural amount of structure. The amount of useful structure depends on what the metric treats as an error.

The result also explains why earlier skill-distillation findings can look erratic. A skill can deliver a large gain on one metric and little or negative normalized lift on another metric attached to the same task. That is not necessarily experimental noise. It may be the metric telling you that structure is useful for one notion of success and unnecessary for another.

Annoying, yes. But useful annoyance is still useful.

AdaSkill is less a compression trick than a structure budget

The paper’s system, AdaSkill, has two stages. The architecture is less important than the policy it encodes: spend structure only where the metric can pay it back.

Stage 1 converts a multi-agent system into a single-agent skill. But it does not faithfully copy the original multi-agent pipeline. That is the whole point. A faithful compiler would preserve exactly the thing that may be causing the waste: rigid ordering, inter-agent coordination, and procedural scaffolding.

Instead, AdaSkill separates MAS components into categories:

MAS component	AdaSkill treatment	Why
Callable tools	Keep	Tools expand capability without necessarily constraining reasoning
Domain knowledge	Keep, but convert into conditional reference guidance	Knowledge helps, but should not always become mandatory procedure
Pipeline structure	Scale with $F$	Useful in rigid metrics; harmful when exploration is valuable
Coordination mechanisms	Discard	A single model does not need simulated message-passing bureaucracy

This is where the paper improves on the usual “distill the multi-agent workflow” story. It does not treat the MAS as sacred. It treats it as a messy source of assets: some tools, some knowledge, some useful heuristics, and quite a lot of organizational furniture.

For low-$F$ metrics, the converted skill may preserve more step guidance because the success corridor is narrow. For higher-$F$ metrics, the skill strips away more pipeline structure and leaves the model with tools and reference knowledge. In other words, the skill becomes less like an instruction manual and more like a well-organized shelf.

That distinction is not cosmetic. The ablation results identify pipeline structure as the main source of harm at high Metric Freedom. Tools and knowledge provide relatively consistent gains; pipeline imposition is what turns helpful guidance into exploration tax.

Stage 2 is not “optimize harder”; it is “optimize where the landscape permits it”

A common enterprise reflex is to iterate whenever the first system is not good enough. Add a review loop. Add an analyzer. Add an evaluator. Add more samples. Add more bureaucracy, but call it “agentic optimization” so the invoice feels modern.

The paper is more disciplined. Stage 2 adds an automated skill iterator with four roles: an Explore agent, a Main agent, a Runner, and a stateless Analyzer. It examines failures, proposes targeted fixes, updates tools or reasoning guidance, and stops on convergence, sufficiency, oscillation, or budget exhaustion.

The important part is not that the iterator exists. We have no shortage of systems that rewrite their own prompts and declare adulthood. The important part is when the paper says to use it.

In the latest version of the paper, Stage 2 is activated for mid- and high-freedom metrics, not for the most rigid metrics. This may sound counterintuitive if one only remembers the simple rule “low $F$ means skills help.” The distinction is between initial structure and iterative refinement.

Low-$F$ metrics benefit from strong initial guidance because the target is narrow. But once you start iteratively patching edge cases, the knife-edge landscape can become unstable: fixing one case breaks another. The appendix illustrates this with causal estimation. The iterator reaches 100% validation MSA by version 2, then later iterations fall back to 80% as one fix disrupts another. That is not failure in the dramatic Hollywood sense. It is the quieter and more common enterprise failure: the system keeps “improving” until it starts undoing itself.

Mid-freedom metrics, by contrast, leave room for targeted improvement without every rule becoming a landmine. Text-to-SQL execution accuracy, for example, sits around the middle of the spectrum. The paper reports steady improvement and plateauing rather than destructive oscillation. Feature engineering, at the high end, shows only marginal gains because the metric is already flat; there is little to refine.

So the operational rule is more subtle:

Decision	Low $F$	Mid $F$	High $F$
Add initial skill structure	Yes	Selectively	Minimally
Preserve full MAS pipeline	Usually no	No	Definitely no
Use iterative skill optimization	Carefully, if at all	Often useful	Usually low ROI
Main failure mode	brittle edge-case oscillation	overfitting if validation is weak	optimization theater

This is the mechanism-first reading: Metric Freedom does not say “do more” or “do less.” It says which kind of intervention has a plausible causal path to improvement.

The evidence is strongest where it separates mechanisms from benchmarks

The experiments cover four task families, eleven datasets, and six metrics: text-to-SQL, causal estimation, causal discovery, and feature engineering. The paper compares original multi-agent systems, a raw base agent, a faithful MAS compiler, ablations such as tools-only and knowledge-only, and AdaSkill.

The headline performance numbers are useful but not the main reason to read the paper. AdaSkill reports competitive or better performance while reducing cost by up to 8× and latency by 3–15×. Good. Everyone likes cheaper systems that run faster. Very brave position.

The more important evidence is that the results line up with the proposed mechanism.

Evidence item	Likely purpose	What it supports	What it does not prove
Causal estimation MSA vs MRE	Main mechanism test	The same behavior can be rigid under one metric and free under another	That every causal task will show the same magnitude
Global $F$ vs normalized lift	Main predictive evidence	$F_{out}$ and $F_{trace}$ are negatively correlated with skill lift	That the exact thresholds are universal
Tools / knowledge / pipeline ablation	Ablation	Pipeline structure is the main source of high-$F$ harm; tools and knowledge are safer	That tools are always beneficial in poorly specified enterprise systems
Budget sensitivity for $F$ estimation	Robustness / sensitivity test	A small run budget can estimate $F$ reasonably in the tested setup	That proxy estimation works for open-ended tasks
GPT-5.1 replication	Backbone robustness test	The negative relationship persists beyond the main Claude Sonnet 4.6 runs	That all future models will preserve the same slopes
Stage 2 trajectories	Implementation behavior	Low-$F$ iteration can oscillate; mid/high-$F$ iteration can plateau more smoothly	That self-optimization is safe without strong validation splits

The global correlation is especially important. The paper reports $r(F_{out}, \text{lift}\ast{norm}) = -0.85$ and $r(F\ast{trace}, \text{lift}_{norm}) = -0.77$. In plain English: as Metric Freedom rises, the useful lift from skill structure falls. A replication using GPT-5.1 preserves the negative trend, although the slope is shallower, consistent with the idea that stronger base models leave less skill-fillable headroom.

That last point matters for business readers. A skill that is valuable today may become redundant tomorrow, not because the task changed, but because the base model internalized enough domain knowledge that external scaffolding no longer adds much. This is where Metric Freedom becomes more than a one-time diagnostic. It can be tracked over model generations as a rough thermometer of whether knowledge still needs to be externalized.

The business value is cheaper diagnosis, not just cheaper inference

The obvious business takeaway is cost reduction. Replace multi-agent workflows with a single skill-equipped agent, get similar performance, pay less, wait less. That is true, but it is also the least interesting version of the argument.

The more useful takeaway is diagnostic.

Before building a complicated AI automation workflow, a team can run a small raw-agent evaluation sample and estimate how tightly behavior differences map to score differences. The paper’s operating point uses six questions and six runs per question, with a reported one-time diagnostic cost of about $6.12 in their setup. The exact number will change by model, task, and vendor pricing. The principle is stable: run a cheap diagnostic before committing to expensive architecture.

For enterprise AI, that changes the build sequence.

Most teams do something like this:

choose a workflow architecture;
build tools and agent roles;
evaluate performance;
add more structure when results disappoint.

Metric Freedom suggests a better order:

define the metric;
run diverse raw-agent baselines;
estimate whether the metric is rigid or free;
decide how much structure to impose;
only then build the skill or workflow.

This is not glamorous. It is also exactly why it is useful. Architecture decisions should be boring when the evidence is good.

A rigid enterprise metric might be compliance classification, correct database field selection, method choice, eligibility screening, or exact procedural routing. There, distilled skills and structured guidance can be worth the effort. A freer metric might be exploratory market research, feature ideation, draft generation, or customer insight synthesis. There, forcing a pipeline may make the agent more obedient and less useful, which is a very popular way to fail politely.

What the paper shows, what Cognaptus infers, and what remains uncertain

The paper directly shows that Metric Freedom can be computed from raw runs on structured benchmarks, and that it strongly predicts normalized skill lift across the tested tasks. It also shows that adaptive distillation can match or outperform original MAS baselines while reducing cost and latency in those settings.

Cognaptus would infer three practical rules from this:

Practical rule	Use it when	Reason
Diagnose the metric before designing the agent	The workflow has measurable outputs	The metric determines whether structure helps or restricts
Keep tools more readily than pipelines	Tools extend capability without forcing a path	Tool access is less likely to impose exploration tax
Treat iteration as a separate decision from distillation	The first skill works but leaves measurable headroom	Refinement can improve mid-$F$ cases, but can oscillate on knife-edge metrics

What remains uncertain is equally important.

First, $F$ is currently easiest to define for auto-graded, structured tasks where output distances are computable. It is much less straightforward for human-evaluated writing quality, conversation usefulness, brand tone, or strategic judgment. Those are exactly the tasks many businesses care about. Excellent, the universe has retained its sense of humor.

Second, the metric requires multiple raw-agent runs. The paper presents this as a low diagnostic cost, and it often will be. But for long workflows with expensive tools, data permissions, or human review, even baseline sampling may not be trivial.

Third, the exact thresholds should not be treated as sacred. The paper’s results are strong but still benchmark-bound. The number $F=0.50$ should not become the new enterprise astrology sign.

Fourth, the paper’s version history itself is a reminder to read carefully. The latest PDF gives a more refined account of Stage 2 than a simple “low freedom means optimize more” rule. Low freedom supports initial structure; it does not automatically justify repeated self-optimization.

Doing less is not minimalism; it is metric-aware restraint

The current AI workflow fashion rewards visible sophistication. More agents look more serious. More steps look more controlled. More validators look more responsible. Somewhere, a dashboard glows.

Metric Freedom cuts through that theater by asking a colder question: does the metric actually reward controlling the path?

If yes, structure is not bureaucracy. It is guidance. Build the skill, encode the knowledge, and constrain the agent enough to hit the narrow target.

If no, structure is not discipline. It is drag. Keep the tools, provide reference knowledge, and let the model explore.

That is why the title of this article is not just a joke. Sometimes your AI gets smarter by doing less because the extra structure was never intelligence. It was a committee costume.

The practical lesson is simple enough to be dangerous: before you distill a multi-agent system, diagnose the freedom of the metric. The architecture should follow the score, not the other way around.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Binyan Xu, Dong Fang, Haitao Li, and Kehuan Zhang, “From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?”, arXiv:2604.01608v3, 24 Apr 2026, https://arxiv.org/pdf/2604.01608. This article uses the PDF version because it contains the latest revision and several details not reflected in the earlier HTML rendering. ↩︎

Metric Freedom asks whether different behavior actually changes the score#

The central example: same task, same runs, opposite metric geometry#

AdaSkill is less a compression trick than a structure budget#

Stage 2 is not “optimize harder”; it is “optimize where the landscape permits it”#

The evidence is strongest where it separates mechanisms from benchmarks#

The business value is cheaper diagnosis, not just cheaper inference#

What the paper shows, what Cognaptus infers, and what remains uncertain#

Doing less is not minimalism; it is metric-aware restraint#