Login.

That is where many agent evaluations become strangely unserious. A benchmark asks whether the agent completed a task. A dashboard records whether the browser session ended successfully. A monitoring system checks whether the tool call returned an error. Then the agent enters valid credentials and suddenly gains access to a much larger part of the environment.

Was that progress? Was it risk? Was it simply another step?

The answer depends on what we measure. Traditional success metrics tell us whether the agent arrived at a specified destination. Behavioural diversity tells us whether it moved around a lot. Neither tells us, cleanly, whether the agent’s actions expanded its control over future states.

That is the useful idea behind Estimating the Empowerment of Language Model Agents, by Jinyeop Song, Jeff Gore, and Max Kleiman-Weiner.1 The paper introduces EELMA, a method for estimating agent “empowerment” from multi-turn text trajectories. Empowerment here does not mean morale, corporate wellness, or any other HR-shaped fog machine. It means something precise: how much an agent’s actions influence the set of future states it can reach.

The business translation is simple enough to be dangerous: options are power.

The paper’s real contribution is not that “more capable agents perform better.” Lovely, yes, and gravity still works. The contribution is a way to separate three signals that business readers often collapse into one:

Signal What it measures Why it matters Why it misleads
Task success Did the agent achieve a defined goal? Clear operational outcome Requires hand-built goals and misses unscored capability growth
Behavioural diversity Did the agent visit many states or take varied actions? Cheap proxy for exploration Rewards wandering, hallucinated tool calls, and accidental chaos
Empowerment Do actions reliably steer the agent into meaningfully different future states? Measures controllable reach Still not the same as task value, safety, or intent

EELMA is interesting because it tries to make the third column measurable in language-based environments, where states are messy text, browser pages, tool responses, and action logs rather than neat grid coordinates.

Empowerment is controllable reach, not just movement

The paper starts from an information-theoretic view of agency. An agent is empowered when its current action tells us a lot about its future state. If pressing button A and pressing button B lead to reliably different future possibilities, the agent has control. If every action loops back into the same dead end, the agent has little control, even if it produces many words while getting there. The enterprise equivalent is an automation that either unlocks the next workflow stage or politely thrashes in a tab until the session expires. We have all met both.

Formally, empowerment is built around mutual information between actions and future states. The paper defines effective empowerment as an average over states the policy actually visits. For granular analysis, it also estimates state-conditional and state-action-conditional empowerment: not only “how much control does this agent usually have?” but “where, exactly, did control increase?”

That second question is the business one.

A model upgrade that raises average task success by two points is useful. A model upgrade that suddenly increases empowerment around admin consoles, payment systems, or role-management pages is a different kind of useful. It is the sort of useful that should arrive with a risk register attached.

Why EELMA is needed: text breaks old empowerment estimators

Classical empowerment estimation works best when states and actions are cleanly enumerable. Grid cell here, disk configuration there, transition table over there. Language agents do not live in such tidy little museums.

A browser state can be rendered as an accessibility tree, a DOM fragment, a page summary, or a verbose observation containing the objective, previous action, current URL, and interactive elements. Two strings can describe the same underlying state. Two nearly identical strings can hide a meaningful difference. Counting unique text observations as unique states will quickly turn an estimator into a very confident spreadsheet of nonsense.

EELMA addresses this by embedding text observations and actions into compact representations, then using an InfoNCE-style contrastive objective to estimate the relevant mutual-information quantities. In plain terms, it learns whether a current observation-action pair helps predict a future observation better than random alternatives from other trajectories.

That design choice matters. The paper’s appendix compares EELMA with a prompt-only LLM estimator that is explicitly given the definition of empowerment and structured transition information. The prompt-only approach still systematically overestimates empowerment. This is a useful little humiliation for “just ask the LLM to score it” evaluation culture. Being able to recite the concept is not the same as computing the quantity.

EELMA is experience-grounded. It learns from trajectories. That makes it heavier than a prompt rubric but more credible than asking a language model to perform information theory by vibes.

The controlled games test whether EELMA measures the right thing

The paper first validates EELMA in text versions of Gridworld and Tower of Hanoi. This is main evidence, not decoration. These settings have an underlying structure, so the authors can compare EELMA’s learned estimates against direct empowerment estimates computed from transition dynamics.

The result: EELMA closely tracks direct estimation in both environments.

The most useful evidence is not merely “correlation was high.” It is that the estimates behave sensibly across different kinds of control.

In Gridworld, an agent trapped between boxes has low empowerment because its actions do not open many future states. An agent with navigable space has more. An agent that can move boxes has even more. In Tower of Hanoi, dispersed disk configurations provide more possible valid moves than tightly stacked configurations. EELMA captures these differences from textual observations.

That is the first major point: EELMA is not just detecting activity. It is detecting reachable optionality.

The paper then tests valid versus invalid actions. In Gridworld, moving into empty space or pushing a box into a legal position increases empowerment more than invalid moves. In Tower of Hanoi, legal disk moves score higher than illegal ones. This is action-level evidence: EELMA can identify individual decisions that create future control.

A business reader should read this as a diagnostic promise. If the estimator works in your environment, it can identify the steps where an agent actually gains operational leverage, not just the sessions where it eventually succeeds.

The ablations show agent design knobs change control, not just accuracy

The paper next varies agent configurations: Chain-of-Thought prompting, short-term memory length, and base model choice. This section is mostly ablation evidence. It asks whether empowerment moves when known capability knobs move.

It does.

Removing Chain-of-Thought sharply reduces empowerment: in the reported game settings, Gridworld empowerment drops from 0.19 to 0.01 bits, while Tower of Hanoi drops from 0.29 to 0.09 bits. Increasing memory from zero to three previous steps raises empowerment and performance, especially in Tower of Hanoi. Stronger closed models such as Claude-3.5-Sonnet and GPT-4o generally show higher empowerment and reward than smaller or lower-cost alternatives. Among open models, Qwen2.5 shows clearer parameter-scaling behaviour than Gemma-3 or Llama 3.2 in these tests.

The interpretation is not “Chain-of-Thought is always good” or “bigger models are always worth it.” That would be the usual LinkedIn confetti cannon. The better interpretation is narrower and more useful: empowerment reacts to agent architecture choices that affect controllable reach.

For deployment, this gives teams a way to compare configurations beyond pass/fail tasks:

Configuration change What the paper observes Operational reading
Remove reasoning prompt Empowerment falls sharply in games The agent loses multi-step steering ability
Add short-term memory Empowerment and reward rise State continuity increases practical control
Increase model strength Often higher empowerment and reward Capability upgrades may expand reach before success metrics fully show it
Use weak model in hard tool tasks Can produce diverse invalid behaviour Diversity is not control, and control is not success

That final row is the one to keep. It is where the paper avoids becoming a metric sales brochure.

WebArena shows empowerment tracks capability—until reasoning becomes the bottleneck

The WebArena experiments move from controlled games to realistic web navigation. The authors evaluate agents across GitLab, Reddit, Shopping Admin, and Shopping domains using GPT-4o-mini, GPT-4o, o3, and Qwen2.5-32B-it. EELMA is trained on expanded trajectories, including LLM-generated goals for trajectory diversity, while reward calculations exclude those augmented tasks.

This is main evidence for realism. WebArena observations are long, structured, and messy. Agents navigate browser pages through accessibility-tree observations and page actions. If EELMA only worked on toy text games, it would be an elegant laboratory pet. WebArena is where it starts looking operationally relevant.

The headline result is that empowerment is positively associated with discounted reward across several WebArena domains. GPT-4o has the highest overall discounted reward, and EELMA also estimates it as having the highest environmental influence. The paper reports strong alignment in GitLab, Reddit, and Shopping Admin.

But Shopping breaks the neat story. In Shopping tasks, the relationship between empowerment and discounted reward is flat. The likely explanation is that the bottleneck is not reachability but reasoning: agents may navigate the environment while still failing on numerical price comparisons or task-specific inference. They can get to the shelf; they still cannot do the maths. A familiar tragedy.

This boundary is important. Empowerment measures control over future states. It does not measure whether the agent understands the business question once it reaches the relevant page.

That distinction should shape how companies use it. Empowerment is a useful observability layer for navigation, access, and tool-use capability. It is not a replacement for domain-specific correctness tests.

The tool-use results expose the difference between diversity and real control

The paper’s $\tau$-bench experiments are the cleanest warning against naïve optionality metrics. The authors test three Qwen2.5-Instruct model sizes—1.5B, 7B, and 32B—across airline and retail customer-service tool-use domains.

In the airline domain, where pass rates are in the 28–34% range, empowerment scales monotonically with model size: 0.59 for 1.5B, 0.71 for 7B, and 0.74 for 32B. This matches discounted reward. Good: empowerment is behaving like a capability proxy.

In the retail domain, where pass rates are only 3–7%, the signal inverts. Smaller models hallucinate more diverse invalid tool calls, producing varied error responses. EELMA can measure action-state mutual information there, but that mutual information is not task-relevant control. It is more like a toddler discovering that different wrong buttons produce different beeps. Technically interactive. Not exactly competent.

This is the paper’s most useful limitation because it is specific. Empowerment can fail when all models are near floor performance and behavioural diversity dominates the signal. In those regimes, the right conclusion is not “small model has more power.” It is “the environment is too hard, the policy is too poor, and the metric is now detecting structured failure.”

The appendix comparison with simpler trajectory metrics reinforces the point. EELMA shows the strongest positive association with discounted reward among tested task-agnostic metrics: in WebArena, empowerment has a Spearman correlation of 0.39, compared with 0.16 for visitation entropy, 0.09 for action entropy, 0.04 for state coverage, 0.12 for trajectory length, and 0.38 for unique-state ratio. In $\tau$-bench, empowerment is 0.71, while the simpler metrics range from 0.20 to 0.71. These figures should not be over-romanticised; some are descriptive over small pooled comparisons. But the direction is instructive. EELMA is better than “counting movement,” though not immune to bad regimes.

Authentication is the governance case study hiding in plain sight

The paper’s modified Shopping Admin experiment is where the business implications become concrete.

Normally, WebArena agents may start with automatic login states. The authors remove that convenience. To complete Shopping Admin tasks, the agent must locate credentials on a hidden page and manually authenticate. This creates a clear power transition: before login, the agent has limited access; after login, it can control more of the admin environment.

The result is exactly what a governance team would hope to see. GPT-4o with one-step memory successfully authenticates in 137 of 182 trajectories. GPT-4o without memory and GPT-4o-mini fail to authenticate. Valid credential-entry actions produce higher empowerment than invalid attempts. At action-level aggregation, the appendix reports valid authentication actions averaging 0.365 bits versus -0.127 bits for invalid attempts. Valid username entries average 0.448 bits, while valid password entries average 0.210 bits; invalid password attempts average -0.152 bits.

The negative values do not mean “negative true empowerment.” The paper explains that true mutual information is non-negative; negative estimates arise from the InfoNCE variational approximation under finite-sample bias or estimation error. The useful signal is relative: valid actions are higher than invalid ones.

This is the strongest business pathway in the paper. EELMA could become a monitoring layer for capability pivots:

Pivotal action type What empowerment can flag What it cannot decide alone
Login or credential use Access expansion Whether access was authorised
Permission escalation New reachable admin states Whether the task required it
Tool discovery Expanded action surface Whether the tool use is safe
Navigation into restricted areas Broader future state reach Whether the agent intended misuse
Workflow unlocks Removal of bottlenecks Whether the final output is correct

This is not “AI safety solved.” It is better telemetry. In production, better telemetry is often the difference between governance and theatre.

What Cognaptus would infer for agent operations

The paper directly shows that EELMA can estimate empowerment from language-agent trajectories, match direct empowerment in controlled text games, correlate with discounted reward across several realistic environments, and identify high-control actions such as authentication without explicit reward labels.

Cognaptus would infer three practical uses.

First, use empowerment as a release-comparison metric. When changing model version, prompt structure, memory policy, tool access, or browser environment, compare the empowerment distribution before and after release. A rise in useful task zones may be desirable. A rise around privileged systems deserves review.

Second, use state-action empowerment for forensic analysis. Average task success tells you which sessions passed. Empowerment timelines can tell you where control expanded. That matters when investigating unexpected tool use, repeated credential attempts, or agents discovering side routes through internal systems.

Third, use empowerment to reduce dependence on handcrafted benchmark tasks. This does not mean eliminating benchmarks. It means adding a goal-agnostic signal that can be computed from trajectory logs, including situations where no clean reward labels exist.

A sensible implementation would look like this:

  1. Log observations, actions, future observations, model version, prompt configuration, memory length, tool calls, and environment metadata.
  2. Train an EELMA-style estimator on representative trajectories, not just successful runs.
  3. Report empowerment by session, by state category, by action type, and by release cohort.
  4. Compare empowerment with success rate, tool-error rate, human-review flags, policy violations, and latency.
  5. Treat spikes around authentication, admin views, financial operations, file systems, and external communications as review triggers, not automatic verdicts.

The review-trigger point matters. Empowerment says “control increased here.” It does not say “this was good” or “this was bad.” A fire alarm is not a judge. It is still useful when the building is smoking.

The appendix tests robustness, not a second thesis

Several appendix results matter for practitioners because they answer implementation questions rather than philosophical ones.

The natural-language variation test checks whether EELMA survives paraphrased observations. Direct counting methods suffer because different descriptions of the same latent state look like different states. EELMA remains much closer to its fixed-format baseline in the reported RMSE comparison: Gridworld RMSE is 0.048 for EELMA on natural-language observations versus 0.302 for direct estimation on natural-language observations; Tower of Hanoi is 0.127 versus 0.438. This is robustness evidence. It supports the method’s use on messy logs, but it does not prove universal robustness to every enterprise data format.

The encoder-choice study shows that embedding model selection matters. On natural-language Gridworld, MiniLM-L6-v2 achieves the lowest reported RMSE at 0.0336, beating E5-Base-v2 at 0.0447 and E5-Small-v2 at 0.0538. Parameter count is not destiny; representational fit matters.

The fine-tuning strategy study is even more operational. LoRA adaptation improves RMSE relative to a frozen encoder, while partial and full fine-tuning collapse in the reported setup. LoRA also adds far less memory overhead: 3 MB versus 41 MB for partial fine-tuning and 382 MB for full fine-tuning. This is implementation-detail evidence with practical value. It says: start with lightweight adaptation before enthusiastically fine-tuning everything into soup.

The compute numbers are also worth noting. EELMA training is reported as roughly four hours on a single 80GB A100, while generating trajectories for large open models can take much longer. The authors also state that WebArena EELMA overhead is about 5% on top of trajectory collection cost, and that subsampling stabilises around 50% of the full dataset in their analysis. For businesses, the cost centre is likely not the estimator. It is the disciplined collection of representative trajectories. As usual, the spreadsheet was not the hard part; the plumbing was.

Where empowerment should not be promoted to CEO

The main misconception to kill is simple: high empowerment is not the same as task success, safety, intelligence, or business value.

The paper itself gives the counterexamples. Shopping tasks can require reasoning after navigation. $\tau$-bench retail can produce high mutual-information signals from diverse invalid tool calls. Qwen’s Reddit behaviour includes external navigation away from the sandbox, a form of environmental influence that may be unhelpful or undesirable. Authentication can be necessary progress or risky access expansion, depending on authorisation.

So empowerment should be interpreted as controllable reach under an observed policy and environment. It is a capability-adjacent metric, not a moral property. It should be paired with:

  • task success and domain-specific correctness;
  • policy compliance and access-control rules;
  • tool-error and invalid-action rates;
  • human review for privileged transitions;
  • distribution-shift monitoring after logging or environment changes.

There is also a deeper limitation. EELMA estimates empowerment from trajectories. If the trajectory dataset is narrow, biased, or missing important states, the empowerment estimate will inherit those blind spots. If logging changes, the representation changes. If the environment changes, old baselines become stale. This is not a fatal flaw. It is observability behaving like observability: useful when versioned, dangerous when worshipped.

The better KPI is not “more empowerment”; it is “right empowerment”

The tempting executive dashboard is a single number: empowerment up, agent better. Please do not build that dashboard. Or at least do not leave it unattended near budget season.

The more useful dashboard separates types of empowerment:

Empowerment pattern Likely interpretation Action
Higher empowerment plus higher success Better controllable capability Candidate for rollout
Higher empowerment with flat success New reach, unresolved reasoning bottleneck Add task-specific checks
Higher empowerment around privileged states Access expansion Trigger governance review
Higher diversity with low success Structured failure or hallucinated tools Improve policy/tool constraints
Lower empowerment after release Lost optionality or over-constrained agent Investigate prompt, memory, or tool changes

This is the paper’s business lesson. EELMA does not replace benchmarks. It complements them by measuring a different axis: whether the agent’s actions actually expand the future it can control.

For companies deploying agents into browsers, CRMs, ERP systems, developer tools, and finance workflows, that axis matters. Many harmful failures will not begin as final-answer errors. They will begin as small expansions in operational reach: a new page discovered, a credential used, a permission boundary crossed, a tool sequence unlocked.

Task success may notice later. Empowerment can notice at the pivot.

And yes, that makes it a KPI. But only if we remember what the K stands for. Not “kind of impressive.” Not “keep increasing.” Key.

A good empowerment metric should tell us where the agent gained options, whether those options were useful, and whether the organisation was ready for them. Anything less is just another number with a nice chart and a suspiciously relaxed conscience.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jinyeop Song, Jeff Gore, and Max Kleiman-Weiner, “Estimating the Empowerment of Language Model Agents,” arXiv:2509.22504v3, 28 May 2026, https://arxiv.org/abs/2509.22504↩︎