Large language models are finally learning to work the tools instead of merely talking about them. RLFactory proposes a clean way to post‑train LLMs for multi‑turn tool use by rebuilding the reinforcement learning loop around tool feedback, not just text. The result: quicker training, higher stability, and a framework teams can actually adopt.

Why this matters (and where prior setups struggle)

Most RL-for-LLMs treat the environment as pure text: the model thinks, emits tokens, gets a scalar reward. But real tasks—searching, querying databases, compiling code, booking travel—depend on external tools that return structured results, fail intermittently, and vary in latency and format. Hard problems emerge:

  • State blindness: vanilla RL ignores rich tool feedback.
  • Latency drag: slow tools stall training throughput.
  • Reward mismatch: closed-form rules work for SQL; they fail for open-ended tasks like research planning.

RLFactory tackles all three.

The core ideas

1) Reconstruct the RL state with Observation Tokens

Instead of modeling state as “prompt + model tokens so far,” RLFactory adds observation tokens sourced from tools (search results, execution logs, images). These are appended to the trajectory but masked from loss, so the model learns from them without being penalized for them. Practically, this closes the loop: model → tool → environment → observation → next decision.

2) Async multi-tool invocation (no more waiting rooms)

Tool calls run concurrently via asyncio. If one API hangs, others still return; if a task needs three tools (weather, maps, inventory), all can proceed in parallel. This materially raises rollout throughput and reduces wall-clock training time.

3) Decoupled architecture you can actually plug into

Tool registry and training are split. Tools are defined via MCP-style metadata (names, params, endpoints), managed by a ToolManager (default: Qwen-compatible). You can add program tools (search, calculators, code), model tools (e.g., SD, GPT), or agent tools (multi-step workflows). Minimal code edits; realistic for enterprise stacks with existing APIs.

4) Modular reward strategies—mix and match

Reward type Best for How it works Pros Gotchas
Rule-based Verifiable tasks (NL2SQL, math) Score by format validity, correctness, efficiency Deterministic, cheap Narrow; brittle for open tasks
Model-as-judge Open-ended outputs (research, plans) A strong model (e.g., VLM) scores trajectories Flexible, general Costly; prompt design matters
Tool-verified Code/queries needing execution Run candidate outputs and compare against expected Grounded, task-faithful Requires sandbox / verifier infra

Crucially, RLFactory lets you compose these: e.g., format rules + execution checks, or judge scores to shape style while verifiers guard correctness.

The training loop that scales

The interaction follows Generate → Parse → Invoke → Update:

  1. Generate a candidate step (may contain tool calls).
  2. Parse tool intents/params (ToolManager).
  3. Invoke tools asynchronously; collect observations.
  4. Update context with observation tokens; continue or finish.

With loss masks over observations, the learner optimizes policy on its decisions, not the environment’s echoes.

Evidence it works

On Search‑R1 with Qwen3‑4B, RLFactory reports a test score of 0.486 on NQ—beating a larger Qwen2.5‑7B‑Instruct‑GRPO (0.473) trained with similar methods—while boosting training throughput by 6.8×. The key takeaway isn’t just a win on one metric; it’s that the architecture—observation‑aware state, async tools, modular rewards—translates into sample‑ and time‑efficiency.

What this means for builders

If you’re standing up agentic systems at a company:

  • Adopt observation‑aware RL for any workflow where tools dominate outcomes (search, BI, codegen, RPA, booking). Pure-CoT RL leaves performance on the table.
  • Define tools once, reuse everywhere. MCP-style registries make adding or swapping tools operationally sane.
  • Start with hybrid rewards. For BI assistants: rules for SQL validity + tool-verified execution; add judge scores for clarity and safety.
  • Prioritize async and caching. You won’t hit practical throughput otherwise; parallelism + result caches are the hidden multipliers.

A pragmatic rollout plan (2 sprints)

  • Sprint 1 (Infra): instrument a small set of critical tools (search, SQL, code-runner) with MCP metadata; stand up async invokers; ship a verifier for SQL and a simple judge prompt.
  • Sprint 2 (Training): fine-tune a compact model (3–7B) on your tasks with loss-masked observations; run ablations: rules vs tool-verify vs judge; measure tokens/sec, calls/sec, task success.

Where we still need answers

  • Judge drift vs ground truth: how much can model-judged rewards deviate before behaviors go sideways?
  • Long-horizon credit assignment: multi-turn tasks with branching tools still need better return shaping.
  • Robustness under tool heterogeneity: format misalignments and flaky endpoints remain a real-world tax; how resilient is the parser under noise?

Cognaptus verdict

RLFactory feels less like a paper trick and more like operations engineering for agent RL. By moving state, calls, and rewards to where real work happens—the tools—it sets a practical template for enterprise agents. If your agents must do things (not just say things), this is a framework worth piloting.


Cognaptus: Automate the Present, Incubate the Future