Sft | Cognaptus

Plan, Don't Spam: The Goldilocks Rule for Test‑Time Compute

When do you really need a plan? In agentic AI, the answer isn’t “always” (ReAct‑style reasoning at every step) or “never” (greedy next‑action). It’s sometimes—and knowing when is the whole game. A new paper shows that agents that learn to allocate test‑time compute dynamically—planning only when the expected benefit outweighs the cost—beat both extremes on long‑horizon tasks. Why this matters for operators Most enterprise deployments of LLM agents are killed by one of two problems: ...

Spin Doctors: Why RL Fine‑Tuning Mostly Rotates, Not Reinvents

The short of it Reinforcement‑learning fine‑tuning (RL‑FT) often looks like magic: you SFT a model until it aces your dataset, panic when it forgets math or coding edge cases, then run PPO and—voilà—generalization returns. A new paper argues the mechanism isn’t mystical at all: RL‑FT mostly rotates a model’s learned directions back toward broadly useful features, rather than unlocking novel capabilities. In practical terms, cheap surgical resets (shallow layers or top‑rank components) can recover much of that OOD skill without running an expensive RL pipeline. ...

Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

When multimodal large language models (MLLMs) like Gemini or Janus are asked to generate an image and then assess whether that image matches a prompt, you’d expect agreement. But a new study shows this harmony is often missing: the model’s own understanding branch disagrees with what its generation branch creates. This phenomenon—called self-contradiction—isn’t just an embarrassing quirk. As it turns out, it may be the most valuable feedback signal MLLMs have. ...