TL;DR for operators
A smart agent can still be a bad decision-maker. That is the useful, slightly annoying lesson from LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities.1 The paper studies Gemma2 models acting in simple decision environments and finds that they often fail not because they cannot describe the right strategy, but because they do not reliably execute it.
The authors identify three failure modes. First, greediness: the model latches onto the best-looking option it has already tried, leaving much of the action space unexplored. Second, frequency bias: smaller models copy actions that appear often in the context, even when those actions are not high-reward. Third, the knowing-doing gap: a model can compute the Upper Confidence Bound strategy reasonably well and still choose the greedy action instead of the action its own computation recommends. Very clever. Also very operationally inconvenient.
Reinforcement learning fine-tuning (RLFT) on self-generated chain-of-thought rationales helps. It lowers regret in multi-armed bandits, improves contextual bandit performance, and raises Tic-tac-toe returns. But it does not magically turn LLMs into principled exploratory agents. The strongest improvements come when RLFT is paired with extra scaffolding: try-all strategies, exploration bonuses, legal-action context, expert traces, and more reasoning tokens.
For enterprise deployment, the paper is not saying “use RLFT and relax.” It is saying: if an LLM agent controls recommendations, procurement, pricing, routing, trading assistance, or workflow allocation, measure whether it explores alternatives before it exploits early winners. A fluent rationale is not an exploration policy. A bigger model is not an audit trail. And “the agent thought step-by-step” is not the same as “the agent actually followed the step it wrote down.”
The model pressed the shiny button
Imagine a procurement agent testing vendors. Vendor A delivers a decent first quote. Vendor B has not replied yet. Vendor C is unfamiliar. Vendor D looks administratively annoying. The agent writes a neat paragraph about balancing exploration and exploitation, then keeps recommending Vendor A because the first observed reward was good enough.
That is the basic pathology this paper isolates. The setting is deliberately simple: bandit tasks, contextual bandits, and Tic-tac-toe. No messy enterprise integrations, no ambiguous human politics, no spreadsheet from 2017 named final_final_REAL.xlsx. Just actions, rewards, and histories.
That simplicity is the point. If an LLM agent cannot explore properly in a bandit problem, the problem is unlikely to disappear when the same agent is wrapped in a dashboard and asked to optimise sales outreach.
The authors use multi-armed bandits because they strip decision-making down to its most awkward core: choosing between options when some are known, some are unknown, and the best long-run strategy may require short-run experimentation. A classical algorithm such as UCB assigns high value to uncertain actions, often trying untested arms before settling into exploitation. A rough form is:
The first term rewards what has performed well. The second term rewards what has not been tried enough. In other words: do not marry the first button that smiles at you.
LLMs, apparently, need that reminder.
The paper’s real contribution is a failure mechanism, not a leaderboard
The authors test Gemma2 instruction-tuned models at 2B, 9B, and 27B scale. They prompt the models to act in text-based environments, often with chain-of-thought instructions, and then evaluate action choices over repeated interaction. The models see previous actions and rewards, then output a final action in a parseable format such as ACTION=green.
The experiments are organised around a useful diagnostic sequence:
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Failure-mode analysis in multi-armed bandits | Main evidence | LLM agents show greediness, frequency bias, and knowing-doing failures even in simple decision tasks | That every frontier model fails identically |
| RLFT on self-generated CoT rationales | Main intervention | Reward-based fine-tuning improves exploration and reduces regret | That RLFT alone solves exploration |
| Low/high-noise and contextual bandit results | Robustness and transfer within controlled tasks | Improvements are not limited to one bandit variant | That results transfer to open-ended business workflows |
| Exploration mechanisms | Ablation / design comparison | Try-all strategies and exploration bonuses materially improve outcomes | That prompting tricks are sufficient for reliable autonomy |
| Tic-tac-toe | Exploratory extension to stateful environments | RLFT helps beyond stateless bandits | That LLM agents can handle complex sequential planning |
| Legal actions, CoT removal, expert data, thinking-time tests | Ablations and implementation guidance | Context, reasoning budget, and expert data strongly affect behaviour | That more tokens or more data always improve ROI |
This is why a mechanism-first reading matters. A summary would say, “RLFT improves decision-making.” Accurate, but slightly too smooth. The sharper claim is that LLM agents fail through identifiable behavioural mechanisms, and RLFT only partially reshapes those mechanisms.
Greediness is premature closure with better grammar
The first failure mode is greediness. In the paper’s bandit setting, the model often over-favours the best-performing action among the small set it has already tried. That sounds rational until one notices the missing part: it stops sampling enough of the unknown space.
The numbers are revealing. In 10-arm bandits over 50 interaction steps, Gemma2 2B with CoT covers about 40% of available actions. Gemma2 9B and 27B do better, covering about 65%, or 6.5 actions out of 10. Without CoT, all models cover only about 25%. When the action space grows to 20 arms, the largest models cover only about 45% of actions.
That is not exploration. That is browsing the menu, ordering the first decent thing, and later calling it strategy.
The key interpretation is not “small models are bad.” The 27B model also remains greedy. Scale helps with some distortions, but it does not automatically create a disciplined exploration policy. CoT helps too, but not enough. The model can talk more thoughtfully and still leave a large fraction of the action space untouched.
For business agents, this is the first operational warning. An agent that improves over random behaviour may still be systematically under-exploring. In recommendation, that means over-serving early winners. In procurement, it means favouring familiar vendors. In workflow routing, it means sending cases to the first team that appears competent. The metric to watch is not just average reward; it is coverage of reasonable alternatives before exploitation begins.
Frequency bias is the ghost of pre-training
The second failure mode is frequency bias. Here the model repeatedly selects the action that appears most often in the context, even when that action is not the best-rewarded one.
The authors construct histories where a target action is repeated many times, then observe how the model’s action distribution changes. Gemma2 2B is strongly affected: the paper reports a 96% frequent-action fraction in the relevant setting. In plain language, the smaller model sees repeated text and increasingly treats repetition as instruction. The action is common, therefore it must be right. A very human mistake, though humans usually need meetings to achieve it.
Gemma2 27B mostly escapes this specific frequency bias, with a reported 14% frequent-action fraction. But the escape route is not perfect rationality. The larger model shifts towards greediness instead. It becomes less hypnotised by repeated tokens, but still over-commits to the best observed reward among tried actions.
This distinction matters. “Bigger model” is not a single cure. It may trade one behavioural failure for another. In an enterprise setting, a small operational agent may copy repeated workflow patterns from logs, while a larger agent may avoid copying but still prematurely exploit early positive feedback. Both can look sensible in local traces. Both can be wrong globally.
The knowing-doing gap is not a metaphor here
The most interesting part of the paper is the knowing-doing experiment. The authors prompt Gemma2 27B to act according to UCB: write down the algorithm, compute the relevant values, and then choose the action according to those values.
This is a clean test because it separates “can the model reason about the policy?” from “does the model actually execute the policy?” The model is not merely asked to guess. It is asked to compute.
The result is uncomfortable. The model produces correct rationales in about 87% of cases in the main figure. Yet when the rationale is correct, it often still selects the greedy action rather than the UCB-optimal one. The paper reports that among correctly computed rationales, the model chooses the greedy action 58% of the time and the optimal action 21% of the time.
The appendix example makes the failure almost theatrical. The model computes that untried buttons have infinite UCB values, explains that they should be prioritised for exploration, and then selects the previously strong green button anyway. It knows the rule, calculates the signal, says the right words, and presses the wrong button.
That should sound familiar to anyone who has reviewed an agent trace where the reasoning looked excellent until the final action quietly betrayed it.
The lesson is brutal but useful: rationales are not guarantees. They are artefacts. They may reveal the model’s accessible knowledge, but they do not prove that the final action is causally governed by that knowledge. For high-stakes agents, the final action needs independent validation against the stated policy.
RLFT helps because rewards punish pretty but useless reasoning
The authors then apply reinforcement learning fine-tuning on self-generated CoT rationales. The model interacts with the environment, generates reasoning plus an action, receives reward, and is fine-tuned so that reasoning-action patterns leading to higher reward become more likely.
This is not supervised imitation of a fixed dataset in the main intervention. It is reward-based adjustment of the model’s own decision traces. The implementation uses a PPO-style clipping objective with a KL constraint to the reference policy. Actions are extracted from generated text using regex patterns, and invalid outputs receive a reward penalty, set to -5 by default. This last detail is boring in exactly the way production details are boring: if the system cannot reliably emit valid actions, the clever policy does not matter.
The main result is positive. RLFT lowers cumulative regret for Gemma2 2B and 9B in Gaussian multi-armed bandits. For 2B, it narrows the gap to larger models and to UCB. The authors also repeat the experiment on contextual MovieLens bandits, where the agent acts as a recommendation system using user descriptions and movie options. There, ICL performance is roughly around the random baseline, while RLFT-fine-tuned Gemma2 2B performs similarly to UCB.
RLFT also changes the failure modes, not just the headline regret. For Gemma2 2B, action coverage increases by 12 percentage points after 30K updates in the 10-arm setting. The 20-arm appendix result shows a similar 13-point improvement. Frequency bias is also reduced: for low repetition windows, the frequent-action fraction drops from 70% to 35%, while “other” actions rise from 8% to 35%. The caveat is important: at high repetition counts, the frequency bias remains elevated. RLFT counteracts the bias; it does not erase it.
The mechanism is therefore not “RLFT makes LLMs rational.” It is more modest and more useful: RLFT makes the behaviour more reward-sensitive, which can reduce pathological copying and premature exploitation. But if the reward process does not explicitly value exploration, the model may still under-explore.
Tic-tac-toe shows transfer, but also the need for state constraints
The Tic-tac-toe experiment is an exploratory extension into a stateful environment. Unlike a bandit, Tic-tac-toe has transitions: an action changes the board, and the next state depends on the current move. The agent receives 1 for winning, 0 for drawing, and -1 for losing.
RLFT substantially improves Gemma2 2B. Against a random opponent, the average return rises from 0.15 to 0.75. Against an optimal MCTS baseline, the agent moves from a heavily losing result towards drawing. That is a meaningful improvement, especially because the setting is still text-based and action extraction remains brittle.
But the legal-action ablation is the part operators should remember. When the legal actions are provided in the context, the RLFT agent reaches about 0.75 average return. Without legal actions, the return drops to about 0.45. The model struggles to identify the currently valid action subset on its own.
That result generalises conceptually to business workflows. Do not merely tell an agent to “choose the next best action.” Provide the valid action set. Remove impossible actions. Encode policy constraints upstream. A model that must infer the action space from prose is being asked to waste cognition on guardrails the system designer should have supplied. Yes, it is less glamorous. So are brakes.
Exploration scaffolds beat motivational speeches
The paper then tests exploration mechanisms: try-all, $\epsilon$-greedy, context randomisation, context summaries, self-correction, self-consistency, and an exploration bonus.
The strongest result is not a baroque agentic ritual. It is simple scaffolding. A try-all strategy, inspired by UCB’s initial sampling of untried actions, gives the model enough information to act much better. Gemma2 27B almost closes the gap to UCB under this setup. The interpretation is direct: once the model has enough information about action quality, it can often select appropriately. Its weakness is generating that information through exploration.
The exploration bonus is also revealing. Adding a +1 reward for selecting an untried action during RLFT increases action coverage from 50% to 70% and lowers regret towards the expert compared with regular RLFT. That is reward shaping doing exactly what reward shaping is supposed to do: making the desired behaviour visible to the optimiser.
Self-correction and self-consistency are included, but they are not the central story. The operational lesson is that exploration must be designed into the training and deployment loop. Asking the model to reconsider can help in some settings, but it is not a substitute for a reward function that values information gathering.
The ablations are an engineering checklist, not decorative science
The paper’s ablations are useful because they expose which knobs matter.
First, CoT matters. Removing CoT instructions and forcing short action-only outputs speeds training, but performance suffers. In the 10-arm Gaussian MAB setting, RLFT without CoT barely reaches the performance of ICL with CoT. The reasoning tokens are not just verbal garnish; they appear to support exploration and rationalisation.
Second, expert data is powerful when available. The authors construct UCB expert datasets with 32K rollouts and 1.6M transitions, both with and without CoT. Supervised fine-tuning on those expert traces can mimic UCB and reach comparable regret. That is not surprising, but it is operationally important. If a business already has high-quality decision traces from a trusted policy, imitation may be cheaper and more predictable than learning everything through reward interaction.
Third, more thinking time helps, but it costs. Increasing the generation budget from 256 to 512 tokens improves Gemma2 2B performance to roughly the level of Gemma2 9B with RLFT. Reducing the budget to 16 or 64 tokens hurts. But the cost is not linear in a pleasant spreadsheet-friendly way. In a 50-step task with a 500-token generation budget, the agent may produce up to 25K tokens per episode. During RLFT, rollout generation can dominate training time. Reasoning is not free just because it is text.
Fourth, some implementation details refuse to be ignored. The authors train on 8 H100 GPUs, use an accumulated batch size of 128, fine-tune bandit agents for 30K updates, and report mean results with 95% confidence intervals across three seeds. They also note that LoRA reduced memory but was insufficient for improving decision-making in their RLFT setting. That does not make LoRA useless; it makes adapter choice an empirical question, not a procurement checkbox.
What the paper directly shows, what Cognaptus infers, and what remains uncertain
The cleanest business reading is to separate evidence from inference.
| Layer | Statement | Confidence |
|---|---|---|
| What the paper directly shows | Gemma2 2B/9B/27B agents exhibit greediness, frequency bias, and a measurable knowing-doing gap in controlled decision tasks | High within the tested settings |
| What the paper directly shows | RLFT on self-generated CoT rationales reduces regret and improves exploration, but remains below classical exploration methods in some comparisons | High within the tested settings |
| What the paper directly shows | Explicit scaffolds such as try-all, exploration bonuses, legal-action context, expert data, and larger reasoning budgets can materially improve behaviour | High for the tested ablations |
| What Cognaptus infers | Enterprise agents should be evaluated on exploration coverage, not only final reward, task completion, or quality of rationale | Strong practical inference |
| What Cognaptus infers | Production systems should constrain valid actions, reward information gathering, and verify final actions against the stated decision policy | Strong practical inference |
| What remains uncertain | Whether the same magnitudes hold for frontier proprietary models in long-horizon, tool-rich, multi-user business processes | Open |
| What remains uncertain | Whether RLFT is economically preferable to supervised fine-tuning, rules, retrieval, simulators, or hybrid control in a given enterprise workflow | Context-dependent |
This distinction matters because the temptation is obvious. A vendor will read “RLFT improves decision-making” and turn it into “our agent learns optimal business actions.” Kindly confiscate that slide.
The paper supports a more disciplined view: language models can be components in decision systems, but the decision system still needs exploration design. The model’s text interface does not eliminate the old RL problem. It merely gives the old RL problem a more persuasive voice.
The enterprise failure mode is local success mistaken for global optimisation
Many business agents will be judged on short-term reward proxies. Did the email get a reply? Did the user click? Did the supplier accept? Did the ticket close? These metrics are seductive because they are observable. They are also dangerous because they can reward early exploitation.
A greedy LLM agent may look excellent during early pilots. It finds a workable answer quickly, repeats it, and avoids obviously bad moves. The trouble appears later: under-tested alternatives, stale recommendations, vendor concentration, pricing rigidity, workflow bottlenecks, or systematically ignored customer segments. By then, the agent has not failed dramatically. It has merely narrowed the organisation’s search space while sounding helpful.
The paper’s bandit framing gives operators a better diagnostic vocabulary:
| Operational symptom | Likely agent mechanism | Design response |
|---|---|---|
| Agent keeps choosing the first option that worked | Greediness | Require minimum exploration coverage before exploitation |
| Agent copies common historical actions | Frequency bias | Debias training logs; measure action diversity under repeated contexts |
| Agent writes correct rationale but chooses inconsistent action | Knowing-doing gap | Validate action against policy computation before execution |
| Agent fails in constrained games or workflows | Missing valid-action context | Provide allowed action sets explicitly |
| Agent improves with longer reasoning but becomes expensive | Reasoning-budget dependency | Set token budgets by decision value, not aesthetic preference |
| Agent performs well only with expert traces | Imitation dependence | Use supervised fine-tuning where expert policies already exist |
This is the operational version of the paper: do not ask whether the agent “reasons.” Ask whether the agent’s behaviour preserves enough uncertainty to learn.
The boundary: controlled tasks, not a universal verdict
The limitations are not small print; they define how the result should be used.
The models are Gemma2 2B, 9B, and 27B, not the full frontier model zoo. The environments are controlled: multi-armed bandits, contextual MovieLens bandits, and text-based Tic-tac-toe. The bandit horizon is 50 steps, which the authors consider sufficient for 5 and 10 arms but insufficient for 20 arms. The tasks are useful precisely because they isolate exploration; they do not capture all the messiness of enterprise automation.
The paper also leaves open the economics. RLFT can help, but it requires rollouts, reward design, infrastructure, and careful evaluation. In some business settings, explicit rules, expert demonstrations, constrained action spaces, retrieval, simulators, or classical optimisation may deliver most of the value with less drama. Deep learning is not always improved by adding another training loop and hoping the invoice is character-building.
Finally, the paper tests decision-making in environments where rewards are available. Many business processes have delayed, noisy, political, or misaligned rewards. “Closed the ticket” may not mean “solved the problem.” “User clicked” may not mean “user benefited.” If the reward is wrong, RLFT may faithfully optimise the wrong behaviour. As ever, the machine is not confused; it is obedient in the worst possible way.
The useful conclusion is design discipline
The article’s title says “smart AI gets it wrong,” but the more precise version is this: smart AI can get the action-selection mechanism wrong even when it gets the explanation right.
That is the knowing-doing gap in its most practical form. It is not philosophical hand-wringing about whether models “understand.” It is a testable systems issue. Does the agent choose the action implied by its own decision rule? Does it explore enough before exploiting? Does it confuse repeated context with good evidence? Does it need the valid action set written out because otherwise it wastes itself guessing what moves are allowed?
The paper’s answer is uncomfortable but constructive. RLFT improves behaviour. CoT helps. Expert data helps. Exploration bonuses help. Legal-action context helps. More thinking time helps, until the bill arrives. But none of these removes the need for explicit decision governance.
For business leaders, the message is simple: do not buy autonomy by the paragraph. Evaluate it by the action distribution.
For builders, the checklist is equally plain: measure coverage, reward exploration, constrain actions, verify action-rationale consistency, and compare against classical baselines before declaring victory. The old algorithms may not write charming explanations, but at least UCB knows not to ignore an untried button while saying it should explore it.
Cognaptus: Automate the Present, Incubate the Future.
-
Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, and Razvan Pascanu, “LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities,” arXiv:2504.16078, 2025, https://arxiv.org/abs/2504.16078. ↩︎