Tool Time, Any Time: Inside RLFactory’s Plug‑and‑Play RL for Multi‑Turn Tool Use

Tool calls are where agent demos stop being cute.

A chatbot can talk through a task all day. A working agent has to search, query, execute, verify, retry, and sometimes discover that the tool it politely called has returned a malformed answer after making everyone wait. That is the difference between “reasoning about work” and doing work. The former gives you fluent paragraphs. The latter gives you latency, interface contracts, timeout handling, reward ambiguity, and a suspicious number of JSON parsing errors. Glamorous, naturally.

RLFactory is aimed at that less glamorous layer. The paper introduces a plug-and-play reinforcement learning post-training framework for LLMs that need to use tools across multiple turns.¹ Its core idea is not that it invents a new autonomous agent species. It is that agent RL should treat tool feedback as part of the training state, run tool calls efficiently, and let different tasks define different reward mechanisms without rewriting the whole training stack.

That sounds like infrastructure because it is. And in this case, that is the point.

The useful reading of RLFactory is mechanism-first: what changes in the training loop when the model’s next decision depends on external observations, not merely on its own previous tokens? Once that is clear, the evidence becomes easier to interpret. The Search-R1 experiment is interesting, but narrow. The architectural pattern is the stronger contribution.

RLFactory is a training framework, not a new agent brain

The easy misconception is to read RLFactory as another “agent” paper: give the model tools, add reinforcement learning, watch it become industrious. That is not quite what is happening.

RLFactory is a post-training framework for tool-use agents. It sits around a base model and a tool environment, organising how multi-turn trajectories are generated, how tool calls are parsed and invoked, how tool outputs are fed back, and how rewards are computed. The paper grounds its implementation in the Search-R1 tool invocation process, uses the veRL training framework, relies on Qwen-Agent for tool construction, and uses MCP-style configuration for unified tool registration.

In plain terms, it tries to make this pipeline less bespoke:

The model emits a response that may contain a tool call.
The system parses the tool intention and arguments.
The tool is invoked.
The result is returned to the model.
The model decides what to do next.
A reward mechanism judges the trajectory.

That loop is familiar to anyone building agentic systems. The paper’s claim is that reinforcement learning for this loop needs a different state representation, different loss handling, better tool invocation efficiency, and more flexible reward logic.

A chatbot-only RL setup can pretend the model’s world is text. A tool-using setup cannot. The tool result changes what the next good action is. If a search tool returns nothing, the model should reformulate. If a code interpreter raises an error, the model should debug. If a database query returns a mismatch, the model should adjust the query rather than confidently narrate victory. This is not philosophical. It is the state transition.

Observation tokens move tool feedback into the MDP

The central technical move in RLFactory is to reconstruct the Markov Decision Process state around both model-generated tokens and externally returned observations.

The paper describes the state at step $t$ as:

$$ s_t = {(X_0, O_0), (X_1, O_1), \ldots, (X_t, O_t)} = {X_{\leq t}, O_{\leq t}} $$

Here, $X_{\leq t}$ is the model’s text trajectory: prompts, reasoning, intermediate responses, and tool-call instructions. $O_{\leq t}$ is the observation trajectory: search results, execution logs, image outputs, database returns, or other feedback produced by the environment.

This matters because tool feedback is not merely context decoration. It is evidence the policy should condition on. If the model has already called a tool and received an observation, the next action should depend on that observation. In a pure text-only RL formulation, the state is essentially “prompt plus model tokens so far.” RLFactory says that is insufficient for tool-using agents, because the model’s own tokens do not contain the environment’s reply until the framework explicitly appends it.

The difference is small in notation and large in practice. Without observation tokens, the system is training on a partial view of the interaction. With them, the model sees the environment’s response as part of the trajectory before it chooses the next action.

That is the first piece of plumbing worth keeping.

Loss masking keeps the model responsible for its own decisions

Observation tokens introduce a second issue: the model did not generate them.

A search result, SQL execution output, compiler log, or image model response is environmental feedback. The training loop should let the model learn from it, but should not punish the model as if it authored that content. RLFactory therefore applies a loss mask to tool-returned observations.

This is less exciting than “agent planning,” which is precisely why it matters. If a framework casually mixes model tokens and environment tokens in the loss, the optimisation target becomes muddy. The model should be trained on the quality of its actions: when to call a tool, which tool to call, what parameters to use, when to stop, and how to use the returned evidence. It should not be trained to reproduce the literal output of a search API or code interpreter as though that output were its own policy action.

The operational implication is straightforward. Tool outputs should be visible to the model but excluded from the parts of the trajectory where the model is being directly optimised to predict or generate tokens. That keeps the training signal aligned with decision-making rather than accidental imitation of the environment.

This is one of those details that separates a workable tool-RL framework from a demo stitched together with optimism and duct tape.

Generate, parse, invoke, update: the loop is the product

RLFactory’s multi-turn flow is built around four steps: generate, parse, invoke, update.

Step	What happens	Why it matters
Generate	The model emits a response, which may include a tool call, intermediate reasoning, or a final answer.	The model’s output is treated as an action, not just text.
Parse	A ToolManager extracts tool intentions, names, and arguments from the response.	Tool use becomes structured enough to execute reliably.
Invoke	The relevant tool or tools are called, using asynchronous execution.	Slow tools do not need to block the whole rollout.
Update	Tool results are formatted and appended as observation tokens.	The next model action is conditioned on fresh environment feedback.

This loop is the paper’s real object of study. It is also where many agent systems quietly fail.

Parsing is a brittle boundary. Tool schemas change, models emit malformed arguments, and enterprise APIs rarely line up as neatly as a benchmark environment would like. RLFactory handles this through a ToolManager layer, with default parsing logic and the option to customise for private tool protocols or specialised workflows.

Invocation is another boundary. Tools differ in latency and reliability. Calling three tools serially because the framework is simple may be acceptable in a prototype. It becomes expensive during RL rollouts, where every interaction multiplies across samples, steps, and training iterations. RLFactory uses Python’s asyncio to support parallel tool invocation, so waiting for one slow tool does not stall every other call.

Update is the feedback boundary. The system formats results, appends them as observation tokens, and continues the interaction until the model produces a final answer or reaches the maximum number of tool invocations.

There is no mystery here. That is a compliment. The paper’s best contribution is turning a messy agent interaction into a modular training routine.

The tool abstraction is deliberately broad

RLFactory does not restrict “tools” to simple API calls. The paper groups them into three categories:

Tool category	Examples in the paper	Operational consequence
Program tools	Search interfaces, code interpreters, calculators	Extend the model with retrieval, computation, and execution.
Model tools	External language models, image generators such as Stable Diffusion	Let one model call another specialised model as part of the workflow.
Agent tools	Multi-step systems such as a literature research agent combining search, summarisation, and citation parsing	Treat a whole workflow as an invokable component.

The broad abstraction is useful because enterprise agent systems rarely use one clean tool type. A business intelligence assistant may call a SQL database, a chart renderer, a policy document retriever, and a separate model for report rewriting. A software engineering agent may call a codebase search tool, a test runner, a compiler, and a patch verifier. A procurement workflow may call inventory, supplier, compliance, and approval systems.

In RLFactory, tools are registered through an MCP-style configuration file containing metadata such as names, parameter formats, requirements, defaults, and invocation endpoints. That lowers the cost of adding or swapping tools, at least at the framework level.

The caveat is obvious but worth stating: a tool registry does not eliminate the hard work of making tools reliable. It standardises how the framework sees them. It does not magically make every internal API well documented, stable, safe, or pleasant. Sadly, YAML cannot save civilisation.

Reward modularity is necessary because tasks disagree about truth

Tool-use tasks do not share one natural reward function. That is why RLFactory supports three reward strategies: rule-based rewards, model-evaluation rewards, and tool-verification rewards.

Reward strategy	Best fit	Mechanism	Business reading	Boundary
Rule-based reward	Tasks with explicit correctness criteria, such as math or NL2SQL	Weighted scores over format validity, task completion, and efficiency	Cheap, deterministic evaluation for structured workflows	Brittle when outputs are open-ended
Model-as-judge reward	Creative planning, research, knowledge graph search, or other tasks difficult to encode as rules	A stronger model scores the full trajectory against a prompt-defined criterion	Useful when human-like judgment is needed at scale	Judge bias, prompt sensitivity, and cost become part of the system
Tool-verification reward	Code generation, SQL, executable tasks	Run the generated action through a verifier tool and compare with expected output	Strong fit for enterprise workflows where correctness can be executed or checked	Requires safe verifier infrastructure and task-specific expected results

The modularity is more important than any single reward type. In real deployments, teams often need hybrids. A SQL agent may receive a rule reward for valid query format, a tool-verification reward for execution correctness, and perhaps a judge reward for whether the final explanation is readable. A code agent may need tests as verifiers, lint rules as structured rewards, and a model judge for documentation quality.

RLFactory’s Env interface is designed so reward calculation and tool verification can be implemented within the environment, while the underlying training machinery remains reusable. This is the “factory” part of the name: the framework tries to separate task-specific evaluation from the generic RL loop.

That separation is commercially meaningful. Many companies do not suffer from a shortage of agent demos. They suffer from one-off agent pipelines where every new tool, verifier, or scoring rule becomes a fresh integration project. RLFactory is valuable insofar as it reduces that integration tax.

The evidence is promising, but it is one Search-R1 reproduction

The paper’s main empirical evidence is an experiment on Search-R1 using the Natural Questions dataset. It compares Search-R1 variants trained with GRPO-style reinforcement learning under the same A100*8 resource class.

Method	Test Score (NQ)	Convergence Time	Resources
Search-R1-Qwen2.5-3B-Instruct-GRPO	0.421	23h	A100*8
Search-R1-Qwen2.5-7B-Instruct-GRPO	0.473	36h	A100*8
Search-RL-Qwen3-4B-Instruct-GRPO	0.486	5h	A100*8

The headline result is that the Qwen3-4B RLFactory run reaches 0.486 on NQ in 5 hours, compared with 0.421 in 23 hours for Qwen2.5-3B and 0.473 in 36 hours for Qwen2.5-7B. The abstract also states a 6.8× training throughput improvement.

There is a small but important textual inconsistency. The table and abstract list the Qwen2.5-7B score as 0.473, while one sentence in the experiment section says 0.429. Since the table and abstract agree on 0.473, that is the value to treat as primary, with the prose mismatch flagged rather than silently cleaned up.

The figure showing mean reward score trends is best read as supporting evidence for convergence behaviour, not as a second independent benchmark. It visualises reward trajectories with 95% confidence intervals and shows the Qwen3-4B run improving competitively. Its likely purpose is to illustrate training stability and convergence, consistent with the table. It does not isolate which part of RLFactory caused the improvement.

That distinction matters. The comparison changes both framework and base model family: Qwen3-4B versus Qwen2.5 variants. The result is useful, but it should not be read as a clean ablation proving that observation tokens, async invocation, loss masking, or reward modularity individually caused the gain. The paper does not provide such ablations.

A disciplined interpretation looks like this:

Evidence item	Likely purpose	What it supports	What it does not prove
Table 1 Search-R1/NQ comparison	Main evidence	RLFactory with Qwen3-4B achieves a higher reported NQ score and shorter convergence time than listed Search-R1 baselines	Universal superiority across tasks, tools, models, or enterprise workflows
Mean reward score trend figure	Main evidence / convergence illustration	The Qwen3-4B run shows stable improvement and competitive reward behaviour	Causal attribution to a specific framework component
Architecture description	Implementation detail	The framework has modular layers for tool use, training, reward calculation, and WebUI	That every module is robust under messy production tool failures
Reward strategy taxonomy	Implementation detail / design rationale	The framework can support multiple reward types	That open-ended model-judged rewards are solved

This is not a criticism of the paper so much as a reading instruction. The result is encouraging. It is not a blank cheque.

The business value is lower integration cost, not instant autonomy

For business users, the most useful interpretation is not “smaller model beats bigger model.” That is tempting, tidy, and probably too convenient.

The better interpretation is that RLFactory attacks the cost structure of training tool-using agents. Its value lies in making repeated components reusable: tool registration, parsing, invocation, observation handling, reward computation, and verifier integration.

That has several practical consequences.

First, RLFactory points toward standardised tool interfaces. If every internal tool is exposed through a consistent metadata layer, agent training becomes less dependent on bespoke wrappers. That matters for companies with fragmented systems: databases, search indices, ticketing systems, analytics tools, code repositories, and approval workflows.

Second, asynchronous invocation matters for economics. Tool-use RL is wall-clock sensitive because rollouts require interaction with external systems. If tool calls are serial and slow, training becomes painfully expensive before the model has learned anything useful. Parallel invocation and caching are not nice-to-have engineering details. They are throughput multipliers.

Third, modular rewards let teams match evaluation to the workflow. A customer support agent should not be judged like a SQL agent. A code repair agent should not be judged like a travel planner. RLFactory’s structure encourages teams to ask: which parts of this task are rule-checkable, which parts are executable, and which parts require a judge?

Fourth, loss-masked observation handling helps separate policy learning from environment imitation. In regulated or operational settings, this is not just technical neatness. It supports clearer accountability: the model is trained on its decisions, while tool outputs remain external evidence.

The business inference is therefore specific: RLFactory is relevant for teams that already know which tools they want agents to use and now need a more systematic way to train, evaluate, and iterate those agents. It is less relevant for teams still looking for a vague “AI agent strategy,” a phrase that should usually be returned to sender with a polite note attached.

Where this applies first

The strongest near-term use cases are workflows where tool outputs materially determine success and where rewards can be partly automated.

Search and research agents are the most direct fit, because the paper’s experiment is built around Search-R1. The agent must decide when to search, how to reformulate queries, and when retrieved evidence is sufficient.

NL2SQL and analytics assistants fit the reward design well. Format validity can be rule-scored, query execution can be tool-verified, and final explanations can be judged separately.

Code agents also fit, at least in principle. Test execution, static checks, and runtime logs are natural observation tokens; verifiers can provide strong reward signals. The risk is sandboxing and safe execution, not the conceptual framework.

Workflow automation agents may benefit where enterprise tools are already exposed through stable APIs. The framework’s broad tool abstraction can cover programs, models, and agent-like subflows. The weak point is heterogeneity: internal systems often have inconsistent permissions, unstable schemas, and undocumented edge cases. RLFactory gives a place to plug those systems in; it does not make them clean.

Planning agents, such as travel or procurement planners, are more complicated. They involve multi-step tool use, but rewards are harder to define. A model judge may help, but judge-based reward is itself an engineering and governance problem. If the reward is vague, the agent will learn the vagueness with admirable dedication.

What remains uncertain

The paper is candid that future work includes expanding compatibility with more environments and complex task scenarios, improving reward mechanisms for highly open-ended tasks, and supporting a wider range of LLMs. Those are not minor footnotes; they define the boundary of current usefulness.

The first uncertainty is generalisation. The reported experiment is on Search-R1 and NQ. It does not establish performance across code, SQL, planning, multimodal tasks, or enterprise toolchains.

The second uncertainty is component attribution. We do not yet know how much of the gain comes from Qwen3 as the base model, how much from RLFactory’s rollout efficiency, how much from observation handling, and how much from the surrounding implementation choices. A stronger evidence package would include ablations: async versus sync, observation loss masking versus no masking, reward strategies separately and in combination, and tests across several tool environments.

The third uncertainty is model-as-judge reward quality. RLFactory supports it, but support is not the same as validation. Judge rewards can drift, encode hidden preferences, reward verbosity, or miss factual errors. For open-ended tasks, this remains a live problem.

The fourth uncertainty is production robustness. The paper discusses heterogeneous tools and interface issues, but it does not deeply test failure modes such as malformed tool outputs, permission errors, endpoint latency spikes, conflicting observations, partial failures, or adversarial tool responses. Those are exactly the situations enterprise agents meet shortly after the demo applause stops.

None of these issues invalidate the framework. They decide how confidently it should be adopted.

A sensible enterprise pilot would test the plumbing, not the slogan

A practical pilot should not begin with “let’s build a general agent.” That sentence has consumed enough budgets already.

A better pilot would choose one narrow workflow with clear tool dependence and measurable success. For example: a SQL analytics assistant, a code-fix assistant with unit tests, or a search assistant over a controlled knowledge base.

The pilot should instrument four things:

Pilot question	Metric or artefact
Can tool calls be parsed and executed reliably?	Tool-call success rate, schema error rate, retry rate
Does async invocation reduce rollout time?	Calls per minute, wall-clock convergence time, timeout impact
Do observation tokens improve downstream decisions?	Task success after tool feedback, correction rate after failed calls
Which reward mix is actually useful?	Rule-only vs verifier-only vs judge-assisted comparison

This is the right level of ambition. RLFactory’s promise is not that one framework turns a model into a tireless digital employee. It is that tool-use RL can become less artisanal. That is a more boring claim, and therefore more likely to survive contact with procurement.

The Cognaptus take

RLFactory is best understood as operations engineering for agent reinforcement learning. Its strongest ideas are not flashy: represent tool feedback explicitly, mask environment outputs from loss, run tool calls asynchronously, decouple tool environments from training, and support multiple reward mechanisms.

The Search-R1 result gives the paper a useful proof point: Qwen3-4B with RLFactory reports 0.486 on NQ in 5 hours on A100*8, compared with 0.421 and 0.473 baselines taking 23 and 36 hours respectively. That is worth attention. It is not yet broad evidence that the framework dominates across tasks or that smaller models generally outperform larger ones.

For builders, the lesson is sharper. If agents are supposed to do real work, the training loop must be organised around the places where work actually happens: tools, observations, verifiers, and latency. RLFactory provides a credible template for that organisation.

Not magic. Better plumbing. In agent systems, that may be the more valuable thing.

Cognaptus: Automate the Present, Incubate the Future.

RLFactory Team, “RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use,” arXiv:2509.06980, 2025. https://arxiv.org/abs/2509.06980 ↩︎

RLFactory is a training framework, not a new agent brain#

Observation tokens move tool feedback into the MDP#

Loss masking keeps the model responsible for its own decisions#

Generate, parse, invoke, update: the loop is the product#

The tool abstraction is deliberately broad#

Reward modularity is necessary because tasks disagree about truth#

The evidence is promising, but it is one Search-R1 reproduction#

The business value is lower integration cost, not instant autonomy#

Where this applies first#

What remains uncertain#

A sensible enterprise pilot would test the plumbing, not the slogan#

The Cognaptus take#