Cover image

Teach Me Once: How One‑Shot LLM Guidance Reshapes Hierarchical Planning

Teach Me Once, Then Please Stop Calling the API A familiar enterprise automation story starts with a competent but expensive expert in the loop. At first, the expert is useful. They interpret messy instructions, break tasks into sensible stages, and recover when something goes wrong. Then the workflow scales. Suddenly the expert is being called for every transaction, every exception, every tiny decision that could probably have been handled by a trained local process. What began as intelligence becomes latency, cost, and operational dependency. Very elegant. Very billable. Not always very deployable. ...

December 11, 2025 · 15 min · Zelina
Cover image

Clipped, Grouped, and Decoupled: Why RL Fine-Tuning Still Behaves Like a Negotiation With Chaos

Training a reasoning model sounds wonderfully modern until the model discovers that “being correct” and “looking correct enough to satisfy the reward” are not the same career path. That is the quiet problem behind reinforcement learning fine-tuning for large language models. The research conversation often treats methods like PPO, GRPO, and DAPO as a sequence of upgrades: first the classic algorithm, then the critic-free group method, then the decoupled-and-dynamically-sampled variant with a nicer acronym. Very tidy. Unfortunately, models do not read product positioning decks. ...

December 9, 2025 · 17 min · Zelina
Cover image

No Prompt Left Behind: How Shopee’s CompassMax Reinvents RL for Giant MoE Models

Rollouts are expensive little creatures. They consume GPU time, produce long reasoning traces, wait for reward computation, and then—if the reward signal is flat—contribute exactly nothing to learning. The GPU was busy. The training dashboard looked serious. The model learned no usable distinction. Very productive, in the same way a meeting with twelve people and no decision is productive. ...

December 9, 2025 · 18 min · Zelina
Cover image

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

December 9, 2025 · 14 min · Zelina
Cover image

Worlds Within Reach: How SIMA 2 Turns Virtual Environments into Training Grounds for Generalist Agents

Games are not toys to an AI lab. They are controlled worlds with messy consequences. A game gives an agent what enterprise software and robotics both struggle to provide at scale: visual ambiguity, delayed goals, menus, navigation, tool use, failure states, and a reset button that does not involve a broken warehouse robot or a furious operations manager. That is why Google DeepMind’s SIMA 2 paper is more interesting than “AI can play games again.” We have had that headline several times. It is getting a little tired, and it should probably hydrate. ...

December 6, 2025 · 16 min · Zelina
Cover image

Think Fast, Think Slow: How Omni-AutoThink Rewrites Multimodal Reasoning

A customer sends a voice note, a screenshot, and a short complaint: “Why did your app charge me twice?” A weak AI assistant answers too fast and misses the evidence. A reasoning-heavy assistant thinks through everything, slowly, expensively, and occasionally performs a small philosophical opera over a billing issue. Neither is attractive. One is careless; the other is costly. The practical problem is not whether the model can reason. It is whether the model knows when reasoning is worth the bill. ...

December 4, 2025 · 15 min · Zelina
Cover image

From Building Blocks to Breakthroughs: Why RL Finally Teaches Models to Think

Training an AI model is often sold like a kitchen renovation: add more data, add reinforcement learning, install the shiny reasoning countertop, and suddenly the whole thing looks expensive enough to be intelligent. This paper is useful because it ruins that brochure. The authors of Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies ask a deceptively simple question: does reinforcement learning create new reasoning ability, or does it only increase the probability of behaviors the model could already produce?1 Their answer is not the clean slogan either camp wants. RL can synthesize new compositional reasoning, but only when the model has already learned the right underlying atomic skills. Without that foundation, RL mostly polishes whatever behavior already exists. Sometimes that is reasoning. Sometimes it is just a better-trained shortcut wearing a lab coat. ...

December 2, 2025 · 18 min · Zelina
Cover image

Rules of Attraction: How LLMs Learn to Judge Better Than We Do

Rubrics are supposed to make judgment boring. That is their charm. A good rubric tells a teacher why one essay deserves a 5 instead of a 3, tells a compliance reviewer why one response is acceptable and another is risky, and tells an internal QA team why a generated summary is useful rather than merely confident. In business, boring judgment is valuable. It scales. It can be audited. It survives employee turnover. It does not wake up one morning and decide that “clarity” now means “vibes with a semicolon.” ...

December 2, 2025 · 15 min · Zelina
Cover image

Mind Over Model: Why Metacognitive Agents May Be the Next Frontier in AI Adaptation

A new employee rarely becomes useful by memorizing the handbook once. They watch the workflow, make mistakes, notice patterns, update their private playbook, and gradually stop asking the same obvious questions. That process is not magic. It is a layered form of learning: one part does the task, another part watches how the task is being done, and a third part turns experience into reusable rules. ...

December 1, 2025 · 17 min · Zelina
Cover image

Stock, Shock, and Two Smoking Agents: Why Inventory Needs an Autopilot

A shelf goes empty. A buyer blames the forecast. The forecast blames the promotion calendar. The warehouse blames the supplier. The supplier blames the port, the weather, or, if creativity is running low, “unexpected demand.” This little theatre is familiar because inventory failure is rarely one failure. It is a chain reaction. A SKU is not replenished too late simply because someone forgot to click “order.” It is replenished too late because demand sensing, stock monitoring, supplier reliability, lead-time uncertainty, product perishability, warehouse capacity, and purchasing authority are usually handled by separate systems pretending they are coordinated. Very modern. Very expensive. ...

December 1, 2025 · 16 min · Zelina