When Tokens Become Actions: A Policy Gradient Built for Transformers

Tool calls are not tokens. Neither are paragraphs, reasoning blocks, spreadsheet edits, web searches, code executions, or the awkward little detours an agent takes before finally answering the user.

Yet much of reinforcement learning for language models still behaves as if it must choose between two unsatisfying extremes. At one end, every token is treated as a tiny action. At the other, the whole answer is treated as one indivisible action. The first view is mathematically tidy and operationally noisy. The second is practical for verifiable tasks, but it compresses an entire reasoning process into one final score, which is a bit like reviewing an employee only by checking whether the office building is still standing.

The paper “GPG: Generalized Policy Gradient Theorem for Transformer-based Policies” argues that this binary choice is unnecessary.¹ Its central move is simple but important: for Transformer policies, an action does not have to be a token, and it does not have to be the full output. It can be an arbitrary output segment: a tool call, a reasoning phase, a paragraph, or any chunk of generation that has semantic or operational meaning.

That makes the paper more interesting than a new benchmark table. The table matters, but the mechanism matters first. GPG reframes policy optimization as a question of where the learning signal should attach.

The old action unit is too small, too large, or both

Classical policy gradient theory was built for reinforcement learning settings where an agent observes a state, takes an action, receives reward, and moves through an environment. That structure fits games, robotics, and control problems reasonably well.

A Transformer-based language model behaves differently. Its “state” is the current context. Its next state is created by appending the generated token. The transition is not an external stochastic environment in the usual sense; it is the autoregressive construction of the sequence itself.

The paper leans on this distinction. In a Transformer, the next context is deterministically formed by the previous context plus the generated token. That means the policy can be written not only as a product over tokens, but also as a product over larger generated segments.

This is the conceptual hinge.

If every output token is a separate action, we get token-level policy gradient. This is granular, but the reward signal can become noisy or poorly aligned, especially when success depends on a later tool result or final answer.

If the entire output sequence is one action, we get something closer to the GRPO style of whole-trajectory comparison. This is useful when only the final answer is verifiable. But it also means the model receives limited information about which part of the trajectory mattered.

GPG says: stop treating these as separate religions. They are endpoints on the same segmentation spectrum.

GPG turns segmentation into the policy-gradient dial

The paper defines macro-states and macro-actions over a generated sequence. The macro-state is the context before a segment. The macro-action is the segment generated from that context.

In simplified form, the Transformer policy over an output can be decomposed as:

$$ \pi_\theta(MA \mid MS_1) = \prod_{T=1}^{K} \pi_\theta(MA_T \mid MS_T) $$

Here, $K$ is the number of macro-action segments. If $K$ equals the number of output tokens, each macro-action is one token. If $K=1$, the entire output is one macro-action. If $1<K<|\text{output}|$, the model is trained over meaningful intermediate chunks.

The generalized policy gradient then takes the form:

$$ \nabla_\theta J(\theta) = \mathbb{E}\ast{\tau \sim \pi\ast\theta} \left[ \sum_{T=1}^{K} \nabla_\theta \log \pi_\theta(MA_T \mid MS_T)\Phi_T \right] $$

The formula is not the dramatic part. The dramatic part is what it allows.

Segmentation choice	What the action means	Method recovered or enabled	Practical interpretation
$K=	\text{output}	$	One token	Standard token-level policy gradient	Maximum granularity, often noisy credit assignment
$K=1$	Full output sequence	GRPO-like sequence-level optimization	Useful when only final answers are scored
$1<K<	\text{output}	$	Meaningful output segment	General GPG case	Tool calls, reasoning blocks, paragraphs, or uncertainty-triggered branches become trainable units

This is why the paper should not be read as “yet another RL algorithm beats GRPO.” That would be the lazy version. The deeper contribution is that GPG provides a shared theoretical frame in which token-level policy gradient and sequence-level GRPO become special cases.

The reader misconception worth correcting is therefore precise: the choice is not token-level versus whole-answer reinforcement learning. The choice is where to cut the trajectory.

Why Transformers make this segmentation natural

The proof depends on two properties that are easy to overlook because they are so familiar.

First, Transformer generation is autoregressive. Each generated token conditions on the previous input and output tokens. Second, the state transition inside generation is deterministic: the next state is the previous state plus the action already taken.

The paper uses these properties to rewrite the policy over a full output sequence into macro-action probabilities. Once that rewriting is valid, the gradient can be expressed as a sum over macro-actions rather than necessarily as a sum over individual tokens.

This is architecture-aware reinforcement learning. Not in the vague marketing sense where “architecture-aware” means someone remembered the word Transformer. Here it means the optimization theorem explicitly uses the autoregressive structure of the model.

That distinction matters because LLM agents are not merely producing answers. They are increasingly producing structured trajectories:

a reasoning block;
a browser query;
a retrieved passage;
a Python execution;
a partial answer;
a correction;
a final response.

Treating that entire chain as one action wastes structure. Treating every token as its own action ignores structure. GPG says the structure itself should define the action units.

The implementation pipeline is really a segmentation-and-credit pipeline

The paper’s practical implementation section translates the theorem into a four-phase pipeline. This part is best read as an implementation design, not as independent empirical evidence.

Paper component	Likely purpose	What it supports	What it does not prove
Transformer macro-state / macro-action decomposition	Main theoretical mechanism	GPG can express policy gradients over arbitrary output segments	That every segmentation rule will work well in practice
Relation to token-level PG and GRPO	Theoretical comparison with prior methods	Existing methods are special cases of the GPG frame	That GPG automatically dominates them under all tasks
Four-phase pipeline	Implementation detail	How GPG can be instantiated for LLM training	That each phase independently improves results
ARPO benchmark table	Main empirical evidence	Segment-aware agent training can outperform trajectory-level RL baselines on selected tasks	Production reliability, cost efficiency, safety, or enterprise ROI

The four phases are:

Trajectory initialization: sample multiple output trajectories from the policy model.
Macro-action segmentation: split trajectories using marker tokens or semantic boundaries.
Macro-action beaming: generate candidate continuations from macro-states.
Advantage estimation: compute calibrated advantages based on reward differences across related trajectories.

The segmentation examples are revealing. For agentic reasoning, special tags such as tool-use or thinking markers can define boundaries. For document composition, paragraph breaks can act as boundaries. For creative problem-solving, the paper suggests high-entropy tokens as possible split points, because uncertainty may indicate a meaningful branch in generation.

The agentic case is the cleanest. Tool-using agents already produce explicit structural markers. A tool call is not merely a string. It is an operational step. A method that can attach learning signal around that step is better aligned with what the agent is actually doing.

The beaming step then expands candidate continuations from macro-states. This is not just more sampling for the sake of more sampling. It creates alternative futures from the same partial trajectory, making it easier to compare which continuation after a given segment helped or hurt.

The advantage estimation step is also important. The paper begins with group-relative reward normalization, then calibrates token-level or segment-level advantages by averaging over trajectories that share a prefix. In plain language: when several trajectories have the same beginning, the method tries to assign credit based on how different continuations behave after that shared point.

That is the heart of the mechanism. It is not merely “reward the good final answer.” It is closer to “find the branch where the answer became better.”

ARPO is the paper’s agentic instantiation, not the whole theorem

The paper instantiates GPG as Agentic Reinforced Policy Optimization, or ARPO, for tool-use agents. This is a sensible experimental choice because tool-use trajectories naturally contain segmentation boundaries. A browser search, code interpreter call, or structured reasoning tag gives the method something concrete to cut around.

The experiments cover two task families:

Task family	Benchmarks used	Evaluation style
Mathematical reasoning	GSM8K, MATH, MATH500, AIME2024, AIME2025	LLM-as-judge using Qwen2.5-72B-Instruct for non-knowledge-intensive tasks
Knowledge-intensive reasoning	WebWalker, HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle	F1 score

The evaluated models include Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Baselines include direct instruction-tuned models, tool-integrated reasoning prompting, and trajectory-level RL methods: GRPO, DAPO, and REINFORCE++.

The headline result is straightforward: ARPO achieves the highest average score for each tested backbone.

Backbone	Best trajectory-level baseline average	ARPO average	Absolute gain
Qwen2.5-3B	50.6	52.8	+2.2
Qwen2.5-7B	56.5	58.3	+1.8
Llama3.1-8B	51.1	55.3	+4.2

The gains are not cartoonishly large, which is good. Cartoonishly large benchmark gains usually deserve a raised eyebrow and a second coffee. Here the result is more modest and more useful: ARPO improves average performance over strong RL baselines in the precise setting where its mechanism should matter, namely long-horizon tool-augmented reasoning.

The pattern across task types is also more informative than the average alone. Knowledge-intensive tasks such as HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle require multi-step information use. In those settings, the difference between “the final answer was right” and “this search/tool step helped” is operationally meaningful. That is where segment-aware learning has a plausible advantage over whole-trajectory reinforcement.

Still, the table should be interpreted carefully. ARPO does not win every individual cell. For example, some baselines match or exceed it on particular math benchmarks under particular backbones. The result supports the claim that ARPO is stronger on average across the selected evaluation suite, not that segment-aware optimization is magically superior in every local comparison. Magic remains unavailable, despite considerable industry procurement interest.

The evidence supports the package, not every internal component

The experimental section is main evidence for ARPO as an applied method. It is not a full ablation study of every design choice implied by GPG.

This matters. The paper presents a theoretical framework, an implementation pipeline, and benchmark results. The results support the combined ARPO approach under the reported setup. They do not isolate the independent contribution of:

segmentation boundaries;
macro-action beaming;
calibrated advantage estimation;
the specific tool environment;
the choice of reward/evaluation protocol.

That does not invalidate the paper. It simply tells us what the evidence can and cannot carry.

The comparison with GRPO, DAPO, and REINFORCE++ is a comparison with prior trajectory-level RL methods. It supports the claim that segment-aware optimization can be more effective for the chosen agentic reasoning tasks. It does not yet answer whether the same gains would appear under different tools, proprietary workflows, larger closed models, non-English data, stricter human evaluation, or cost-constrained production settings.

The implementation figures should also be read correctly. They clarify the mechanism. They are not empirical tests. Figure-style diagrams in papers often seduce readers into thinking the pipeline has already been separately validated. It has been explained. That is different.

Business value begins at workflow segmentation

The practical lesson for business users is not “use ARPO tomorrow.” Most companies are not training RL-optimized LLM agents from scratch, and pretending otherwise is how consulting decks get composted.

The more useful lesson is that agent workflows should be designed around trainable and evaluable segments.

Enterprise AI systems already have natural macro-actions:

Business workflow	Possible macro-action unit	Why it matters
Research assistant	Search query, source selection, evidence extraction	Evaluates whether the agent found useful evidence before writing
Spreadsheet automation	Formula creation, data cleaning step, validation check	Separates calculation quality from final report wording
Customer support	Intent classification, policy lookup, escalation decision	Identifies which step caused a wrong response
Coding assistant	Test generation, function edit, runtime execution, bug fix	Rewards working intermediate actions, not only final patch text
Compliance review	Clause extraction, risk classification, citation mapping	Makes auditability part of the learning structure

This is the business relevance pathway from the paper: macro-action segmentation gives organizations a way to think about agent training and evaluation at the level where work actually happens.

What the paper directly shows is that, in selected benchmark settings, ARPO improves average performance over trajectory-level RL baselines for tool-use reasoning. What Cognaptus can infer is broader but more tentative: enterprise agent systems may benefit from defining explicit intermediate action boundaries, because those boundaries can support better credit assignment, monitoring, and workflow diagnosis.

That inference is not the same as proof of ROI. It is a design direction.

The agent stack becomes less like a prompt and more like an operating procedure

A prompt says, “Do the task.” A workflow says, “Search here, verify there, compute this, cite that, stop if uncertain.”

GPG favors the second world.

If macro-actions matter, then the way an agent’s output is structured becomes part of the optimization surface. Tags, tool schemas, paragraph boundaries, intermediate checkpoints, and uncertainty markers are no longer just formatting choices. They are potential credit-assignment boundaries.

This has several operational consequences.

First, tool interfaces should be designed for learnable action boundaries. A messy blob of text that sometimes contains a tool call is harder to optimize than a structured trajectory where tool use is explicit.

Second, evaluation should not only score final answers. A company can evaluate intermediate steps: Was the database query valid? Was the retrieved document relevant? Did the code execution actually test the claim? Did the agent escalate when the confidence threshold was breached?

Third, logs become training assets. If the agent’s trajectory is structured into meaningful macro-actions, production traces can later support more diagnostic learning. Without structure, logs are just expensive noodles.

Fourth, workflow governance becomes more realistic. Regulators, managers, and users rarely care which token caused a failure. They care which decision step caused it. Segment-level analysis is closer to human audit logic.

This is where the paper’s theoretical framing quietly touches business process automation. Not because every firm will implement GPG. Most will not. But because the paper formalizes a direction that serious agent systems are already moving toward: from prompt engineering to structured, inspectable, optimizable action traces.

Where the result should not be overextended

The paper’s boundaries are specific.

The empirical evidence is based on selected mathematical and knowledge-intensive reasoning benchmarks. Those are useful stress tests, especially for tool-use and delayed reward, but they are not the same as enterprise production workflows.

The evaluated backbones are open models in the Qwen2.5 and Llama3.1 families at relatively modest sizes. The results may not transfer linearly to larger proprietary systems, different training recipes, or more constrained deployment environments.

The evaluation protocol uses F1 for knowledge-intensive tasks and LLM-as-judge with Qwen2.5-72B-Instruct for other tasks. That is acceptable for a research benchmark, but business settings may require human review, deterministic verification, compliance scoring, or cost-sensitive evaluation.

The paper also does not prove that every segmentation strategy is good. A tool boundary is natural. A paragraph break may be useful for document generation. A high-entropy token as a creative boundary is intriguing, but more exploratory. Bad segmentation could easily create bad credit assignment. Cutting a trajectory at the wrong places is not sophistication; it is just a more mathematical way to be confused.

Finally, the paper does not establish lower inference cost, lower training cost, safety improvement, or production reliability. Those may become future advantages, but they are not demonstrated here.

The real contribution is making the action unit negotiable

The most useful way to read GPG is as a generalization of the action unit in Transformer policy optimization.

Before GPG, the practical debate often looked like this:

token-level optimization gives granularity but struggles with credit assignment over long reasoning;
sequence-level optimization gives stable final-outcome comparison but hides intermediate structure.

GPG changes the framing:

define macro-actions;
attach policy gradients to those macro-actions;
recover token-level and sequence-level methods as special cases;
instantiate the middle ground for structured agents.

This is a clean conceptual move. It does not overthrow PPO. It does not make GRPO obsolete. It does not solve all of agent training, because the universe remains inconsiderate. What it does is give researchers and system designers a better vocabulary for the actual problem: deciding which parts of a Transformer-generated trajectory should be treated as decisions.

For business readers, that vocabulary matters. The next generation of useful agents will not be judged only by fluent final answers. They will be judged by whether they can execute structured work: search, calculate, compare, cite, decide, and recover from errors.

When tokens become actions, the real question is no longer whether the model can generate. It is whether the system knows where the meaningful actions begin and end.

Cognaptus: Automate the Present, Incubate the Future.

Hangyu Mao, Guangting Dong, and Zhicheng Dou, “GPG: Generalized Policy Gradient Theorem for Transformer-based Policies,” arXiv:2512.10365, 2025, https://arxiv.org/abs/2512.10365. ↩︎

The old action unit is too small, too large, or both#

GPG turns segmentation into the policy-gradient dial#

Why Transformers make this segmentation natural#

The implementation pipeline is really a segmentation-and-credit pipeline#

ARPO is the paper’s agentic instantiation, not the whole theorem#

The evidence supports the package, not every internal component#

Business value begins at workflow segmentation#

The agent stack becomes less like a prompt and more like an operating procedure#

Where the result should not be overextended#

The real contribution is making the action unit negotiable#