Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream

Code assistants have a small accounting problem. Not the glamorous kind involving model capability, agentic workflows, or yet another dashboard with a glowing neural blob. The ordinary kind: every time a model proposes code during reinforcement learning, someone—or something—has to run it, test it, score it, and feed that score back into training.

For code generation, that “something” is often a verifier: compiler checks, unit tests, runtime limits, and functional correctness signals. This is one reason reinforcement learning with verifiable rewards is attractive. It gives the training loop a cleaner signal than human preference prose. But it also makes training operationally heavy. The model must sample; the sampled code must be executed or checked; the reward must be computed; then the model updates. Repeat until budget fatigue arrives, as it always does.

The paper Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning asks a practical question: what if the expensive sampling-and-verification loop is not performed during training at all?1

Its answer is not “skip verification.” That would be convenient, and also professionally unserious. The paper’s actual move is more interesting: shift verification upstream into a dataset that already contains many code solutions and correctness metadata, then reuse policy-gradient-style training objectives in a fully offline setup.

That mechanism matters more than the headline result. Offline RL here is not merely cheaper supervised fine-tuning with a more fashionable name. It is an attempt to preserve the comparative signal between good and bad solutions while avoiding the repeated online generation and execution cost that makes code RL expensive.

The mechanism: replace live sampling with verified solution groups

Standard online RL for code models has a familiar shape. A learner model generates candidate code. A verifier evaluates whether the code passes tests or fails in some identifiable way. The resulting reward updates the model. The same model family is therefore both the source of new samples and the object being trained, even if, in practice, inference engines and training systems may introduce mild off-policy behavior.

The paper turns that loop sideways.

Instead of asking the current model to generate fresh candidates, the authors train on pre-collected solutions from CodeNet. CodeNet contains roughly 4,000 programming problems and 14 million submitted solutions across more than 50 programming languages, with metadata including functional correctness, syntactic correctness, runtime, memory usage, and code size. The study restricts itself to Python solutions and uses functional and syntactic correctness as feedback.

That choice creates a different training object. The model is no longer learning only from correct demonstrations, as in conventional SFT. It is learning from grouped solution sets where some answers are better than others. The reward is not a vague preference label; it is tied to code status:

Code status Reward
All test cases passed $+1.0$
Test cases failed $-0.1$
Time limit exceeded $-0.5$
Runtime error $-0.6$
Compile error $-1.0$

The practical idea is simple enough: if a dataset already tells you which submitted programs passed and which failed, you can train the model to put more probability mass on the better solutions without running a fresh execution loop at every update.

But the paper’s implementation is not just “reward correct code, punish bad code.” It uses advantage estimates within groups of solutions for the same problem. Each group is required to contain at least one correct and one incorrect solution. This requirement is restrictive—it excludes problems where no correct solution exists, often the hardest cases—but it gives the training objective a useful contrast. A correct solution is not rewarded in the abstract; it is rewarded relative to other candidate solutions for the same prompt.

That is the central mechanism: verified historical diversity becomes a substitute for live exploration.

Why this is not just SFT with a reward sticker

The obvious reader shortcut is to say: if the dataset has correct code, just fine-tune on correct code. This is the kind of shortcut that feels efficient until it loses the information you actually wanted.

Supervised fine-tuning mostly says: imitate this target. It does not naturally encode the difference between “this failed tests,” “this timed out,” “this did not compile,” and “this passed.” Nor does it use the failed examples as structured negative evidence. Offline RL, as used here, can still look at weaker candidates and reduce their contribution without throwing them away entirely.

The paper experiments with several variants of the advantage term:

Training variant Likely purpose in the paper What it tests
SFT Baseline comparison Whether simply imitating available data is enough
RLOO Main offline RL adaptation Whether leave-one-out policy-gradient training transfers to fully offline solution groups
RLOO with GRPO-style advantage Main mechanism variant Whether normalized within-group advantage improves the learning signal
Exponential advantage Sensitivity / reward-scaling variant Whether changing the reward scale helps smaller models learn

This distinction matters because the best-performing method is not stable across every model and benchmark. The paper is not a heroic “offline RL beats everything” story. It is closer to a training-control story: the value depends on model size, problem difficulty, advantage formulation, and reward scale.

That may sound less cinematic. It is also more useful.

The main evidence: hard problems are where the signal becomes valuable

The authors evaluate Qwen2.5-Coder at two sizes: 0.5B and 7B. The 0.5B model is fully fine-tuned. The 7B model uses LoRA adapters. Training is deliberately compute-constrained: one A100 80GB GPU, under 30 hours for the 7B model, under 5 hours for the 0.5B model, with small batch sizes and a 2048-token cap.

The benchmarks are MBPP and APPS+. MBPP contains simpler Python problems and reports pass@1. APPS+ is split into introductory, interview, and competition levels and reports both pass@1 and pass@10.

The strongest business-relevant finding appears on the harder APPS+ tasks for the 7B model.

Qwen2.5-Coder 7B on APPS+ Intro pass@1 Interview pass@1 Competition pass@1 Intro pass@10 Interview pass@10 Competition pass@10
Base 4.50 7.01 1.42 11.90 25.70 8.22
SFT 9.21 11.30 1.38 16.50 30.30 6.29
RLOO 6.98 10.31 2.06 13.80 27.80 8.04
RLOO + GRPO-style advantage 7.72 12.14 2.67 14.70 30.20 11.54
Exponential advantage 7.14 11.09 2.38 15.10 30.30 9.97

On introductory tasks, SFT gives the highest score. That is not embarrassing; it is exactly what one might expect. Easier tasks often reward imitation because the mapping from prompt to acceptable program is relatively straightforward.

The interesting reversal happens at competition difficulty. SFT slightly degrades pass@1 from 1.42 to 1.38 and pass@10 from 8.22 to 6.29. The GRPO-style offline RL variant improves pass@1 to 2.67 and pass@10 to 11.54.

Those numbers are still small in absolute terms. Nobody should read 2.67 pass@1 on competition problems and start ordering confetti. But the direction is important. The harder the problem, the less safe it is to assume that imitating correct-looking samples is enough. Relative correctness signals begin to matter.

Pass@10 is especially revealing. In code generation, pass@1 asks whether the first answer works. Pass@10 asks whether a set of sampled answers contains a working solution. A method that improves pass@10 may be preserving useful diversity rather than collapsing the model into one narrow imitation pattern. The paper explicitly connects offline training to diversity: because the dataset already contains varied historical submissions, the training process can avoid some entropy-collapse concerns associated with online RL.

For enterprise code assistants, this distinction is not academic decoration. First-answer accuracy matters for developer trust. But candidate-set quality also matters in workflows where the system can propose alternatives, run tests, repair failures, or escalate uncertain cases. A method that improves the pool of plausible solutions may be operationally valuable even before it produces a perfect first attempt.

The small model result is a reward-scaling warning, not a universal win

The 0.5B model tells a more complicated story.

Qwen2.5-Coder 0.5B on APPS+ Intro pass@1 Interview pass@1 Competition pass@1 Intro pass@10 Interview pass@10 Competition pass@10
Base 0.03 0.06 0.00 0.30 0.60 0.00
SFT 1.94 1.87 0.16 5.90 8.10 0.87
RLOO 0.06 0.12 0.05 0.50 1.10 0.35
RLOO + GRPO-style advantage 0.07 0.59 0.05 0.40 3.50 0.35
Exponential advantage 1.67 2.15 0.10 5.40 8.60 0.70

Here, ordinary RLOO and GRPO-style advantage do not beat SFT on introductory tasks. The exponential advantage variant does much better than the other offline RL variants and reaches the best interview-level scores, with pass@1 at 2.15 and pass@10 at 8.60. The authors interpret this as a reward-scale issue: exponentiating the normalized advantage changes the range from $[-1, 1]$ to approximately $[0.37, 2.72]$.

This is not a minor hyperparameter footnote. It is an operational warning.

Small models may need stronger or differently shaped training signals before reward-based contrast becomes useful. A signal that is informative for a 7B model may be too weak, too noisy, or too poorly scaled for a 0.5B model. Conversely, the same exponential advantage that helps the small model on APPS+ harms it on MBPP, where performance drops to 35 from the 36.2 base score.

So the paper’s third contribution should be read carefully: reward scaling matters, especially for smaller models. It is not enough to say “use offline RL.” The reward transform itself becomes part of the product engineering surface.

That is the sort of finding that enterprise teams often discover after burning training budget and then naming the folder final_final_v7_really_final. Better to notice it earlier.

MBPP shows where offline RL stops being impressive

The MBPP results are a useful boundary test rather than a second thesis.

Training method Qwen2.5-Coder 0.5B MBPP pass@1 Qwen2.5-Coder 7B MBPP pass@1
Base 36.2 64.4
SFT 35.0 63.2
RLOO 37.4 64.4
RLOO + GRPO-style advantage 39.4 64.0
Exponential advantage 35.0 64.4

For the 0.5B model, GRPO-style advantage improves MBPP pass@1 from 36.2 to 39.4. That is a real gain. For the 7B model, there is essentially no improvement; the base is already at 64.4, and the variants hover around the same level or slightly below.

This supports a disciplined interpretation: offline RL is more valuable when there is room for reward-guided improvement and when the task is difficult enough for contrastive correctness signals to matter. If a capable model already handles a simple benchmark well, offline RL may become training theater. Everyone is busy, the GPU is warm, and the metric refuses to care.

The paper itself states this boundary clearly. Offline RL provides little to no improvement when the base model already performs strongly, and SFT can outperform offline RL on easier problems.

That boundary is important for adoption. A company should not apply offline RL because it sounds more advanced than SFT. It should apply offline RL where the marginal information in negative and mixed-quality examples is worth more than the simplicity of imitation.

What the paper directly shows, and what business teams may infer

The paper directly shows that online policy-gradient-style methods can be adapted to a fully offline code-generation setting, using pre-collected human-written solutions and correctness metadata. It shows meaningful gains under constrained compute, especially for harder APPS+ tasks with Qwen2.5-Coder 7B. It also shows that advantage formulation and reward scaling materially affect results.

The business inference is narrower but valuable: if an organization already has large archives of code submissions, test outcomes, compiler logs, CI failures, runtime errors, code review decisions, or internal bug-fix histories, those artifacts may become post-training fuel.

The asset is not “code data” in the generic sense. The asset is code paired with verifiable outcome metadata.

Enterprise artifact Offline-RL relevance Practical business meaning Boundary
Unit-test results Functional correctness reward Train models toward code that passes known behavioral checks Tests may be incomplete or biased toward legacy assumptions
Compiler and runtime logs Failure-type reward Distinguish compile errors, runtime errors, timeout failures, and partial failures Logs must be standardized across languages and environments
CI/CD histories Historical execution metadata Reuse existing engineering operations as model-training feedback CI outcomes may reflect infrastructure noise, not only code quality
Code review outcomes Human + procedural quality signal Add maintainability or security-oriented labels beyond test passing Review comments are noisier and less directly verifiable
Programming challenge archives Clean grouped problem-solution data Create within-problem contrast between correct and incorrect solutions May not match enterprise codebase complexity

The inference should not be overstretched. The paper does not show that offline RL will improve every code assistant, every language, every model family, or every enterprise repository. It does not test multi-file software engineering tasks, long-horizon agent workflows, security patches, migration work, or production codebase integration.

But it does suggest a direction: the next advantage in enterprise code-model training may come less from collecting more pristine demonstrations and more from preserving the messy outcome structure around attempts, failures, and repairs.

In other words, the failed code is not just trash. It may be training signal with bad posture.

The operational control stack is the real product lesson

For a business team, the paper’s mechanism translates into a control stack:

  1. Collect solution attempts, not only final accepted code.
  2. Attach verifiable outcome metadata, such as test pass, test fail, timeout, runtime error, or compile error.
  3. Group attempts by task, so the model sees relative quality for the same problem.
  4. Choose reward and advantage design, instead of assuming correctness labels automatically produce useful gradients.
  5. Evaluate by task difficulty, because easy-task gains may hide hard-task degradation—or the reverse.
  6. Separate first-answer accuracy from candidate-set quality, using metrics like pass@1 and pass@10 where appropriate.

This is not just a training recipe. It is a data-governance recipe for code AI.

A company that wants to build a serious internal coding assistant should probably ask a less fashionable question than “Which base model should we use?” That question still matters, of course. But after a certain point, the better question is: “Do we have a reliable history of what code worked, what failed, how it failed, and under which task context?”

Without that history, offline RL has little to stand on. With it, the organization may already possess a training asset that was produced accidentally by years of software development operations.

The limits: this is promising, not a deployment guarantee

The paper’s limitations are not decorative. They materially affect how the result should be used.

First, the dataset is imbalanced. CodeNet has a highly uneven number of solutions per problem and a high proportion of correct solutions. The authors cap correct solutions at 50 per problem, but this is a heuristic, not a complete solution. In enterprise settings, imbalance may be worse: popular modules may have abundant logs, while rare failure modes may barely appear.

Second, each training group must contain at least one correct and one incorrect solution. This makes the advantage estimate cleaner, but it removes cases where no correct solution exists. Unfortunately, those cases may be exactly where a business most wants help: novel failures, missing tests, migration edge cases, or unresolved bugs.

Third, the experiments are purely offline. That is the point of the paper, but also a boundary. A hybrid setup with a small amount of online interaction may be better when the offline dataset lacks coverage or when the model needs to explore new solution patterns.

Fourth, the authors directly apply online-style policy-gradient methods in an offline setting. This is empirically useful, but the theory and algorithm design are not settled. The paper itself points toward more tailored offline RL methods and value-based approaches such as Q-learning.

Finally, the evaluation is limited to Qwen2.5-Coder models, Python solutions, CodeNet-derived training, and MBPP/APPS+ benchmarks. These are meaningful tests, but they are not the same as enterprise software engineering inside a living codebase with dependencies, style constraints, security policies, and product deadlines. Reality, as usual, refuses to fit inside a benchmark table. Very rude of it.

The takeaway: verification does not disappear; it changes address

The most useful reading of this paper is not that offline RL is a cheaper replacement for online RL. It is that code-model post-training can be reorganized around where verification happens.

Online RL verifies during training. SFT often ignores structured failures and imitates target outputs. Offline RL, in this paper’s form, uses previously verified attempts as grouped evidence. It lowers training-time execution cost by moving verification into the dataset construction stage.

That shift has a practical consequence. The scarce resource is not only GPU time. It is reliable, task-linked, outcome-labeled code history.

For organizations building code assistants, this changes the strategic question. The question is not merely whether they can afford more reinforcement learning. The question is whether their engineering systems already generate the right kind of evidence—and whether that evidence is clean enough to train on.

Offline RL does not make verification optional. It makes verification reusable.

That is less magical than a self-improving coding agent. It is also much closer to something businesses can actually operationalize.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mingze Wu, Abhinav Anand, Shweta Verma, and Mira Mezini, “Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning,” arXiv:2605.28409v1, 2026. https://arxiv.org/abs/2605.28409 ↩︎