Cover image

Doctor GPT, But Make It Explainable

Triage begins with messy language. A patient does not usually arrive as a clean feature vector. They arrive with “I feel tired,” “my stomach is strange,” “I have fever but not always,” or the classic: “I searched online and now I am either fine or dying.” Traditional diagnostic models are not built for this level of human poetry. They prefer structured fields, stable vocabularies, and the fantasy that symptoms behave like dropdown menus. ...

December 22, 2025 · 15 min · Zelina
Cover image

Prompt-to-Parts: When Language Learns to Build

The compiler is the interesting part Blocks are easy to understand. That is why this paper is more interesting than it first looks. At the surface, Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions is a paper about using large language models to generate LEGO-style assemblies from natural language prompts.1 It shows a medieval castle, an International Space Station model, a modular multitool kit, and an image-to-parts helicopter conversion. Naturally, the tempting summary is: “LLMs can now design LEGO models.” ...

December 20, 2025 · 16 min · Zelina
Cover image

ID Crisis, Resolved: When Semantic IDs Stop Fighting Hash IDs

Catalogs have a boring problem. Most items are nearly invisible. A platform may have millions of products, posts, videos, restaurants, songs, or ads, but user interaction is never evenly distributed. A small number of head items collect enough clicks, saves, purchases, and dwell time to become statistically legible. The rest live in the long tail, where the system is expected to recommend them intelligently despite barely having seen them. Very democratic. Very inconvenient. ...

December 14, 2025 · 16 min · Zelina
Cover image

When AI Becomes the Reviewer: Pairwise Judgment at Scale

A committee has one expensive problem before it has any philosophical problem: too many proposals, too little time, and no clean way to know whether Proposal 17 was actually better than Proposal 42. So the usual system does what institutions often do when the task is too large to compare directly. It fragments the work. A few reviewers score a few proposals. Their scores are averaged. A ranked list appears. Everyone pretends the number is more stable than the process that produced it. ...

December 12, 2025 · 16 min · Zelina
Cover image

Rule of Thumb, Meet Rule of Code: How DeepRule Rewrites Retail Optimization

A store manager does not usually make assortment and pricing decisions inside a clean optimization textbook. More often, the decision lives in a less glamorous place: a sales spreadsheet, a distributor agreement, an approval memo, last month’s exception report, a half-remembered rule about which customer can handle which category, and one person in the room saying, “This SKU always works in that region.” Retail intelligence, in other words, often begins as a pile of semi-structured clues wearing a business-casual disguise. ...

December 4, 2025 · 17 min · Zelina
Cover image

CLOZE Encounters: When LLMs Start Editing Medical Ontologies

Hospitals already have the raw material for better medical knowledge systems. It is sitting inside discharge summaries, nursing notes, radiology reports, ECG interpretations, and all the other clinical prose that makes electronic health records look deceptively “digital” while still behaving like a very expensive filing cabinet. The awkward part is that clinical notes are both valuable and dangerous. Valuable, because they contain granular observations that structured fields often miss. Dangerous, because they contain protected health information, idiosyncratic phrasing, and enough local context to make naïve automation look clever right up to the moment it quietly corrupts a downstream system. ...

November 23, 2025 · 16 min · Zelina
Cover image

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

Training data is the quiet tax on modern AI. Someone has to write the examples, verify the answers, clean the failures, and pretend the spreadsheet is a strategy. Reinforcement learning makes that tax even more visible: if a model is supposed to improve through feedback, then the organisation must either provide ground-truth answers, hire evaluators, or build verifiers that can tell success from nonsense. ...

November 1, 2025 · 14 min · Zelina
Cover image

Paths, Not Parrots: When RL Makes LLMs Plan—and When It Doesn’t

A workflow agent usually looks clever right up to the moment one service is down, one permission changes, or one customer case arrives with the wrong sort of mess attached. Then the question becomes painfully simple: did the model learn a plan, or did it learn the usual route? That distinction is the centre of Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective, an ICLR 2026 paper by Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen.1 The paper is not another victory lap for reinforcement learning. It is more useful than that. It asks what, mechanically, changes when a language model is trained for planning with reinforcement learning rather than supervised fine-tuning. ...

October 3, 2025 · 16 min · Zelina
Cover image

Pipes by Prompt, DAGs by Design: Why Hybrid Beats Hero Prompts

The demo is easy. The DAG is not. Pipeline automation has a wonderfully deceptive user story. A business analyst writes: “Take this customer file, clean the locations, geocode the addresses, add weather data, then save the enriched output.” An LLM replies with a Python file. The file looks plausible. There are imports. There is an Airflow DAG. There are operators. There are dependencies. A demo audience nods approvingly. ...

October 1, 2025 · 14 min · Zelina
Cover image

Spin Doctors: Why RL Fine‑Tuning Mostly Rotates, Not Reinvents

TL;DR for operators If your fine-tuned model gets better on the training task while quietly becoming worse outside it, the problem may not be that the model “lost intelligence”. It may have rotated its useful internal directions away from broadly generalizable behaviour. The paper behind this article studies SFT followed by PPO-style RL on two open LLMs using a controlled arithmetic benchmark, then inspects the weight matrices through singular-value decomposition.1 The pattern is clean enough to be operationally interesting: OOD performance peaks early during SFT, falls as SFT continues, and can be substantially restored by RL when the SFT checkpoint is only moderately degraded. But if SFT pushes the model too far into a specialized regime, RL is no longer a reliable rescue crew. Apparently even reinforcement learning has limits. Who knew. ...

August 25, 2025 · 14 min · Zelina