Training Efficiency

Anchors Away: Rethinking How AI Agents Learn to Use Tools

A tool-using AI agent usually fails in a very ordinary way. It does not announce a philosophical crisis. It calls the wrong tool, calls the right tool too many times, writes malformed code, searches before thinking, or confidently takes a useless action because the training process rewarded motion rather than judgment. This is the unglamorous part of agent deployment. The demo shows the agent booking, searching, calculating, and reporting. The training log shows wasted exploration, unstable optimization, and a strange habit of confusing “using tools” with “thinking better.” Apparently, giving a model a calculator does not automatically make it an accountant. Shocking. ...

Freeze Now, Learn Faster: When Parameter Freezing Meets Pipeline Reality

Freeze Now, Learn Faster: When Parameter Freezing Meets Pipeline Reality Freeze. That sounds like the least exciting verb in machine learning. We prefer more heroic verbs: scale, align, reason, distill, orchestrate, agentify. Freeze sounds like something a GPU does right before the invoice becomes spiritually educational. But in large-model training, freezing can be a serious efficiency tool. The idea is simple: if some parameters do not need to be updated at every step, skip their backward computation and save time. The trap is also simple: saving computation is not the same as saving wall-clock time. In pipeline-parallel training, a GPU can compute less and still finish the batch no earlier, because another dependency is blocking the schedule. Congratulations, the model learned less and the training job did not get meaningfully faster. A tiny miracle of systems inefficiency. ...

Greedy Enough to Win: When Loss Starts Driving the Learning Rate

Training runs rarely fail with cinematic drama. They do not burst into flames. They simply become expensive, slow, and faintly embarrassing. A fine-tuning job starts with promise, the loss descends, then progress flattens. Another run behaves well for 200 steps, then becomes jumpy after a data shard changes. A third run is rescued by lowering the learning rate, except nobody knows whether the rescue came too early, too late, or by accident. Eventually, the team does what teams do: try cosine decay again, because at least cosine looks mathematically respectable while doing whatever it was going to do anyway. ...