LLM Training

MatchTIR: Stop Paying Every Token the Same Salary

Payroll is a useful metaphor for agent training because it makes the absurdity obvious. Imagine a project team where one employee finds the right database, another enters the correct query, a third repeatedly calls the wrong API, and a fourth finally writes the report. If the report is accepted, everyone receives the same bonus. If it fails, everyone receives the same blame. Very democratic. Also very stupid. ...

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

Browsing Without the Bloat: Teaching Agents to Think Before They Scroll

An analyst opens a promising webpage. It contains the answer somewhere between a navigation menu, several years of archived material, an interactive table, related articles, legal disclaimers, and enough decorative HTML to keep a language model occupied until lunch. A human scans, clicks, ignores, and moves on. A browser agent is more likely to ingest the entire page, append it to an already swollen context window, and then congratulate itself for having “conducted research.” ...

No Prompt Left Behind: How Shopee’s CompassMax Reinvents RL for Giant MoE Models

Rollouts are expensive little creatures. They consume GPU time, produce long reasoning traces, wait for reward computation, and then—if the reward signal is flat—contribute exactly nothing to learning. The GPU was busy. The training dashboard looked serious. The model learned no usable distinction. Very productive, in the same way a meeting with twelve people and no decision is productive. ...

Weight Watchers for LLMs: Dynamic Dieting Beats Static Selection

TL;DR for operators Training data is not a warehouse inventory problem. It is closer to nutrition. What helps a model early in pretraining may not be what helps it later, and a sample’s value can depend on the other samples sitting in the same batch. Obvious, perhaps. Operationalised? Less often. The paper behind this article, LLM Data Selection and Utilization via Dynamic Bi-level Optimization, proposes a Data Weighting Model, or DWM, that does not merely decide which data enters training. It assigns weights to samples within each batch, freezes those weights while the language model trains for a stage, then updates the weighting model using validation performance through a bi-level optimisation loop.1 ...