When Rewards Learn Back: Evolution, but With Gradients
Rewards are where many agent projects go to become expensive folklore. A team wants an AI agent to complete long workflows: search, reason, call tools, check constraints, recover from mistakes, and produce a useful answer. The model can talk. The tools work. The benchmark demo is acceptable. Then reinforcement learning enters the room, and someone has to decide what “good” means at every step. ...