Bending the Beam, Not the Brain: What RL with Perfect Rewards Still Can’t Teach LLMs

Beams are honest objects. Push them, load them, move their supports, and they obey equilibrium equations without theatrical ambiguity. Language models, unfortunately, are less well-behaved.

That is what makes BeamPERL a useful paper. It does not test LLM reasoning on a vague benchmark where “correctness” means pleasing a judge, matching a rubric, or sounding sufficiently graduate-school. It asks a compact reasoning model to solve a classical beam statics task: calculate support reactions for a loaded beam. The answers can be checked by a symbolic solver. The reward can be exact. No vibes, no partial credit, no “the answer feels plausible.”¹

The tempting conclusion would be simple: if the reward is perfect, the model should learn the physics.

BeamPERL says: not necessarily.

The model improves. It becomes better at producing correct beam-reaction answers. The best checkpoint raises Pass@1 from 12.50% to 20.83%, a 66.7% relative improvement, and raises Pass@7 from 29.17% to 41.67%. That is not nothing. In a narrow technical domain, a small model, LoRA adapters, and deterministic verification can produce real specialization.

But the important result is not the improvement. The important result is the shape of the improvement.

The model generalizes to some unfamiliar cases, especially beams with multiple loads while keeping the same end-support structure. Then it struggles when the supports move, even though the same static equilibrium principles still apply. Later in training, it may preserve the required output format while producing incoherent content under certain out-of-distribution cases. The formatting survives. The reasoning does not always come along for the ride. A very modern AI failure, wearing a hard hat.

The experiment is small because the mechanism is the point

BeamPERL is not trying to build a production structural-analysis system. It is closer to a diagnostic probe: what exactly does reinforcement learning with verifiable rewards teach a compact model?

The task is deliberately constrained. The paper focuses on the first step in beam statics: calculating support reactions. A beam has a pin support and a roller support. With no horizontal load, the horizontal reaction is zero. The two vertical reactions are solved using force balance and moment balance:

$$ \sum F_H = 0,\quad \sum F_V = 0,\quad \sum M = 0 $$

For a human engineer, this is not an exotic reasoning problem. That is precisely why it is useful. The physics is simple enough to verify exactly, but structured enough to reveal whether the model is learning a transferable principle or a procedural routine.

The training data uses beams with supports fixed at the two ends and a single downward point load. The authors generate 189 distinct beam configurations, then use four natural-language formulations per configuration, yielding 756 question–answer pairs. The correct answers are computed through a modified SymBeam/SymPy symbolic pipeline.

The evaluation set is much smaller: 24 examples. That number should immediately prevent overexcited benchmark interpretation. But the set is designed to separate three regimes:

Evaluation group	Likely purpose	What it tests
Held-out in-distribution beams	Main evidence	Whether the model improves on unseen parameter combinations from the same structural regime
Multiple-load beams with end supports	OOD compositional test	Whether the model can extend the single-load template by superposition-like reasoning
Beams with shifted supports or cantilevered ends	OOD topology test	Whether the model understands the governing equilibrium setup when the structural geometry changes

This is the right kind of small experiment: not statistically grand, but mechanistically revealing. It does not answer “Can LLMs do engineering?” It asks a narrower and more useful question: when a reward is exact, does the model learn the invariant structure of the problem, or does it learn the easiest reward-producing procedure available?

The reward checks the final answer, not the model’s private physics

The training setup is intentionally minimalist. BeamPERL starts from DeepSeek-R1-Distill-Qwen-1.5B, the smallest distilled model in the R1 family. The authors avoid a domain-specific supervised fine-tuning stage with teacher-generated reasoning traces. Instead, the model is trained through GRPO-style reinforcement learning with verifiable rewards.

Only LoRA adapters are updated. The base weights remain frozen. That reduces trainable parameters from 1.777 billion to 36.93 million, a 97.9% reduction. In plain business language: the paper is testing whether a compact model can be cheaply specialized without retraining the whole beast.

The reward function is the central mechanism:

$$ R(o_i)=\frac{1}{3}\left(R_{\text{format}}(o_i)+2R_{\text{accuracy}}(o_i)\right) $$

The format reward checks whether the output follows the required structure: a reasoning section enclosed by <think> tags and at least one boxed final expression. The accuracy reward checks whether the extracted reaction-force coefficients match the symbolic ground truth, using multiset matching and a tolerance of $10^{-4}$.

This is a clean design, but it contains the entire story.

The model is rewarded for producing a parseable final answer that matches the solver. It is not rewarded for constructing the right free-body diagram. It is not rewarded for choosing the right moment point. It is not rewarded for explicitly representing the governing equations under changed support locations. Those may emerge as useful internal behavior. Or they may not.

That is the misconception BeamPERL quietly kills: verifiable final-answer rewards are not the same as verifiable reasoning.

They are a strong filter on outcomes. They are not a microscope pointed inside the model.

The first training regime is useful: formatting improves, answers improve, token use falls

The early training behavior is encouraging. The total reward rises sharply from roughly 0.2 to 0.8 over the first 120 training examples, then plateaus. Completion length decreases early and then stabilizes, meaning the model is not merely producing longer answers to look more thoughtful. It becomes more efficient at the task.

This matters. A lazy interpretation of the paper would say, “RLVR does not work.” That is not what the paper shows.

RLVR works, within limits.

Metric	Base model	BeamPERL best checkpoint	Interpretation
Pass@1	12.50%	20.83%	Better first-sample correctness
Pass@7	29.17%	41.67%	Better chance that at least one sampled answer is correct
Majority@7	0.00%	4.17%	Slightly better consistency, still weak

The best checkpoint is not a miracle. But it is a real movement in the right direction. The model learns to obey the required format and produce more correct answers. On held-out in-distribution examples, Pass@7 remains around 50%, while Pass@1 and Majority@7 rise toward 50% by the end of training.

That is operationally relevant. Many businesses do not need one general model to become a mechanical engineer, lawyer, accountant, and procurement analyst before lunch. They need narrow copilots that perform repetitive, checkable tasks better than a general model would. BeamPERL supports that path.

But it also shows where the path narrows.

The real evidence is the anisotropy: more loads are easier than moved supports

The paper’s most useful result is not “accuracy went up.” It is that accuracy went up unevenly.

When the evaluation examples contain multiple loads but keep supports at the beam ends, the model improves. That makes sense. Multiple point loads preserve much of the original structural pattern. The model can reuse something like the single-load procedure: sum the loads, compute moments, distribute reactions.

When the support locations move, performance peaks around the best checkpoint and then declines. This is more revealing. Moving supports changes the effective lever arms and the structural configuration. The same equilibrium principles apply, but the familiar end-support template no longer fits cleanly.

A human who understands statics does not panic when a support moves inward. The moment equation changes; the principle does not. BeamPERL’s behavior suggests that the model has learned a procedure partly tied to the training topology, not a robust internal representation of equilibrium as a general constraint.

That difference is easy to miss if we only look at aggregate accuracy. Aggregate numbers compress the very thing business users need to know: which variation breaks the model.

Paper observation	What it supports	What it does not prove
Best checkpoint improves Pass@1 and Pass@7	PE-RLVR can specialize a compact distilled model for a narrow engineering calculation	The model has learned general beam mechanics
Multi-load OOD cases improve	The learned procedure transfers along a compositional axis close to training data	Robust transfer to arbitrary structural configurations
Shifted-support OOD cases degrade after the optimal regime	Topological shifts expose brittleness in the learned policy	That all RLVR-trained models must fail similarly
Late checkpoints show incoherence under some OOD cases despite format retention	Format alignment can survive while semantic reasoning deteriorates	The reward function is useless

This is why the paper is more valuable than a leaderboard entry. It distinguishes parametric extension from structural transfer.

For engineering AI, that distinction is not academic. Parametric variation is “same workflow, different numbers.” Structural transfer is “same laws, different setup.” Many business failures occur when teams test the former and deploy into the latter.

The model learns a template before it learns a theory, and maybe never reaches the theory

The authors interpret the behavior as evidence of procedural template learning rather than full internalization of governing equations. That is a strong claim, and it should be read carefully.

The paper does not open the model and directly inspect a symbolic physics representation sitting somewhere in a transformer layer, wearing a tiny lab coat. Instead, it infers from behavior: what improves, what fails, and when.

The evidence is consistent with two regimes.

In the early regime, the model aligns to the desired output structure and becomes better at producing parseable answers. This is useful and expected. The combination of system prompt, format reward, and accuracy reward teaches the model what an acceptable solution looks like.

In the later regime, further optimization pushes the policy farther from the base model. KL divergence increases sharply beyond the best-performing checkpoint. Evaluation performance no longer improves uniformly. Some OOD behavior deteriorates. In one shifted-support example, the final checkpoint preserves the response format but produces semantically meaningless mixed-language content inside it.

That is not “reasoning under pressure.” That is the model keeping the suit and losing the body.

The paper links this to a familiar RL problem: reward hacking, or at least reward over-optimization. The model is optimizing a reward that is correlated with competence only inside a certain regime. Push too far, and the local reward landscape no longer guides the model toward robust reasoning.

The important nuance is that reward hacking here does not mean the model deliberately cheats. It means the training process discovers behaviors that satisfy parts of the reward structure without preserving the deeper capability we hoped the reward would produce. This is why anthropomorphic language is usually a tax on clarity. The model is not cunning. The optimization is.

General math ability survives early training, then erodes later

BeamPERL also evaluates selected checkpoints on AMC23, AIME24, and AIME25 to see whether beam specialization damages broader mathematical reasoning.

At the best-performing BeamPERL checkpoint, the results are modestly reassuring:

Benchmark	Base model	BeamPERL best checkpoint
AMC23	72.5%	75.0%
AIME24	33.3%	40.0%
AIME25	23.3%	23.3%

So the best checkpoint does not immediately destroy general mathematical ability. It may even produce small positive transfer at intermediate stages.

But the time path matters again. After roughly 200 training examples, benchmark performance declines, and the decline becomes more pronounced by the final checkpoint at 360 examples. The same pattern appears in the beam evaluations: the useful training window is not the longest training window.

This is a practical lesson disguised as a research result. Fine-tuning is not a moral journey where more effort creates a better character. Sometimes more training just makes the model more specialized, more brittle, and more confident in the wrong parts of its learned routine.

For companies, the implication is blunt: checkpoint selection is not housekeeping. It is risk management.

The business value is narrow automation, not autonomous engineering judgment

What does the paper directly show?

It shows that a compact distilled reasoning model can be specialized for a narrow, verifiable engineering calculation using parameter-efficient RL, symbolic ground truth, and no teacher-generated reasoning traces. It also shows that the resulting competence is distribution-dependent. The model improves along some axes and fails along others.

What can Cognaptus reasonably infer for business use?

The strongest near-term use case is not a general “AI engineer.” It is a constrained technical copilot for repetitive calculations where:

the input format can be controlled;
the answer can be independently verified;
the task class is narrow enough to validate exhaustively or near-exhaustively;
a symbolic solver, rules engine, or simulation tool can remain in the loop.

That is still valuable. Many businesses have exactly this kind of work: repeated compliance calculations, engineering pre-checks, spreadsheet-like technical assessments, cost-estimation routines, financial formula workflows, and quality-control checks.

The weak business interpretation would be: “Small models can learn engineering reasoning cheaply.”

The stronger interpretation is: “Small models can be shaped into useful interfaces around checkable technical procedures, as long as we do not confuse procedure completion with domain understanding.”

A model like BeamPERL is better viewed as an assistant that proposes structured solutions, not as a final authority. The solver is still the adult in the room. Not glamorous, perhaps, but adults are underrated in production systems.

A deployment framework: make topology visible, not just accuracy measurable

The paper’s anisotropic generalization result points to a practical evaluation framework. Do not validate only on random held-out examples. Random holdout often measures whether the model handles familiar problem forms with new numbers. That is useful, but insufficient.

For technical AI systems, validation should separate at least three kinds of generalization:

Validation layer	Example in BeamPERL	Business analogue	Deployment question
Linguistic variation	Same beam, differently worded question	Different user phrasing	Does the model parse the request robustly?
Parametric variation	Same support layout, different load magnitude/location	Same workflow, different numeric inputs	Does the model interpolate within the operating regime?
Topological variation	Supports move; cantilevered ends appear	Same laws or rules, different structural setup	Does the model know when the template no longer applies?

Most model evaluations overemphasize the first two. Production failures often live in the third.

For a business workflow, “topology” does not always mean physical geometry. In finance, it may mean a changed contract structure. In compliance, it may mean a different regulatory dependency. In logistics, it may mean a different constraint graph. In accounting, it may mean a transaction that looks numerically familiar but belongs to a different recognition category.

The lesson travels well: identify the structure that governs the calculation, then test whether the model still works when that structure changes.

What the paper does not prove, and why that matters

BeamPERL is intentionally narrow. That is a strength for mechanism analysis, but a boundary for business interpretation.

The evaluation set contains only 24 beam examples. It is structured, but small. The task is support-reaction calculation, not full structural analysis with shear diagrams, bending moments, deflections, frames, trusses, load combinations, safety factors, and real engineering codes. The reward weights — one-third format, two-thirds accuracy — were chosen experimentally and not systematically ablated. The setup also uses a distilled reasoning model that already has mathematical priors; pure RL on weaker small base models struggled to get useful reward signals.

So the paper does not prove that RLVR cannot produce general reasoning. It proves something narrower and more operationally useful: exact outcome rewards, by themselves, did not reliably produce transferable beam-mechanics reasoning in this controlled setting.

That narrower claim is enough.

Businesses do not need a metaphysical theory of whether LLMs “understand.” They need to know whether a model will fail when the problem is still governed by the same rule but no longer resembles the training template. BeamPERL says: yes, that failure mode is real, and exact final-answer rewards do not automatically remove it.

The better design pattern: scaffold, verify, stop early, and keep tools close

The paper’s future-work suggestions are not decorative. They point toward a more realistic architecture for domain-specific AI.

A practical system should combine several controls:

Design choice	Why it matters
Structured reasoning scaffolds	They may teach the model how to represent the problem before reward optimization narrows behavior
Process-level checks	They verify intermediate steps, not only final answers
Topology-aware validation sets	They reveal whether the model handles structural shifts, not just new numbers
Early stopping and checkpoint selection	They prevent late-stage policy drift from eroding robustness
Solver or simulator fallback	They keep deterministic verification in the production loop
KL or drift monitoring	They track when specialization becomes behavioral instability

The last point is especially important. In BeamPERL, the best behavior appears before the final checkpoint. This means a production team should not treat fine-tuning as a one-way march toward improvement. It should treat training as a search over trade-offs: task accuracy, general reasoning retention, output stability, and out-of-distribution behavior.

The system should also know when not to answer. A model that recognizes “this beam configuration is outside my validated topology” is more useful than a model that confidently performs template cosplay.

That is the unglamorous discipline of reliable automation: not just making models answer, but teaching systems when answers require tools, escalation, or refusal.

The real lesson is about reward precision, not reward weakness

BeamPERL is valuable because it uses a strong reward and still finds a limitation.

If the reward were noisy, subjective, or poorly specified, the conclusion would be easy: improve the reward. But here the reward is analytically grounded. It checks exact coefficients against symbolic truth. It is precisely the kind of reward many technical AI builders hope will rescue them from the messiness of human evaluation.

And it helps. It just does not solve the deeper problem.

Reward precision tells the model which outputs win. It does not guarantee the model learns the conceptual machinery we would use to win. Sometimes the shortest path to reward is a reusable procedure. Sometimes that procedure generalizes compositionally. Sometimes it breaks when the topology changes. The reward can be perfect and still under-specify the reasoning we care about.

For business readers, this is the useful replacement for the naive belief:

Verifiable rewards are excellent for building narrow, checkable AI systems. They are not magic solvents for reasoning risk.

That distinction should shape procurement, product design, and internal AI governance. If a vendor says their model is safe because outputs are verified, ask what happens when the problem structure changes. Ask whether intermediate reasoning is checked. Ask how they choose checkpoints. Ask whether the validation set contains topological shifts, not just fresh examples.

It may briefly ruin the sales call. That is acceptable collateral damage.

Conclusion: exact answers are not the same as transferable reasoning

BeamPERL bends a compact language model toward better beam-reaction answers. That alone is promising. It shows that small, open, parameter-efficient models can be adapted for narrow scientific and engineering tasks using symbolic verification and modest compute.

But the more important finding is the boundary: the model’s competence is directional. It improves along familiar and compositional axes, then weakens under structural shifts that should still be governed by the same physics. Later training can preserve format while degrading semantic coherence. Accuracy rises, then robustness becomes negotiable. A familiar software story, now with equilibrium equations.

For Cognaptus readers, the business conclusion is not pessimistic. It is disciplined.

Use small specialized models where the task is narrow, repetitive, and externally checkable. Pair them with solvers, process checks, and topology-aware validation. Stop training when validation says the model is at its best, not when the epoch politely ends. And never mistake a correct final answer for proof that the model has learned the domain.

The beam may be simple. The lesson is not.

Cognaptus: Automate the Present, Incubate the Future.

Tarjei Paule Hage and Markus J. Buehler, “BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning,” arXiv:2603.04124, 2026. https://arxiv.org/abs/2603.04124 ↩︎

The experiment is small because the mechanism is the point#

The reward checks the final answer, not the model’s private physics#

The first training regime is useful: formatting improves, answers improve, token use falls#

The real evidence is the anisotropy: more loads are easier than moved supports#

The model learns a template before it learns a theory, and maybe never reaches the theory#

General math ability survives early training, then erodes later#

The business value is narrow automation, not autonomous engineering judgment#

A deployment framework: make topology visible, not just accuracy measurable#

What the paper does not prove, and why that matters#

The better design pattern: scaffold, verify, stop early, and keep tools close#

The real lesson is about reward precision, not reward weakness#

Conclusion: exact answers are not the same as transferable reasoning#