Cover image

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

If you think AI models are getting too good at math, you’re not wrong. Benchmarks like GSM8K and MATH have been largely conquered. But when it comes to physics—where reasoning isn’t just about arithmetic, but about assumptions, abstractions, and real-world alignment—the picture is murkier. A new paper, PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems, makes a bold stride in this direction. It introduces a massive benchmark of 19,609 physics problems called PHYSICSEVAL and rigorously tests how frontier LLMs fare across topics from thermodynamics to quantum mechanics. Yet the real breakthrough isn’t just the dataset—it’s the method: multi-agent inference-time critique. ...

August 4, 2025 · 3 min · Zelina
Cover image

Memory Games: The Data Contamination Crisis in Reinforcement Learning

Reinforcement learning (RL) has recently emerged as the favored path to boost large language models’ reasoning abilities. The latest headline-grabbing claim? That even random or incorrect reward signals can help models like Qwen2.5 become better reasoners. But a new paper, “Reasoning or Memorization?”, cuts through the hype—and it does so with scalpel-like precision. It reveals that what we thought were signs of emergent reasoning in Qwen2.5 might, in fact, be a textbook case of data contamination. If true, the implications are serious: much of what we thought we knew about RL-driven reasoning gains could be little more than sophisticated memory retrieval. ...

July 15, 2025 · 3 min · Zelina