Cover image

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing

Time to Prefer: Why Binary RLHF Feedback Leaves Reward Models Guessing Thumbs-up feedback looks efficient. It is clean, cheap, easy to store, and friendly to dashboards. One output wins, another output loses, and the reward model learns what humans supposedly want. A tidy little morality market, with all the nuance of a vending machine. ...

June 5, 2026 · 17 min · Zelina
Cover image

When the Muse Has a GPU: Teaching a Machine to Write Poetry

Poetry is a useful place to test the limits of AI, partly because the task is so easy to misunderstand. A bad poem can be fluent. A decent poem can be vague. A machine can produce both before breakfast, along with a motivational LinkedIn post and three flavors of executive summary. That is not the interesting part. ...

February 19, 2026 · 18 min · Zelina
Cover image

Rationales Before Results: Teaching Multimodal LLMs to Actually Reason About Time Series

Dashboard work has a familiar little ritual. Someone opens a chart, zooms into the last few points, notices a dip, a rebound, or a suspiciously clean trend line, and then says something that sounds analytical: “Looks like it will continue.” Sometimes that is wisdom. Sometimes it is just a human staring confidently at a squiggle. ...

January 7, 2026 · 15 min · Zelina
Cover image

Privacy by Proximity: How Nearest Neighbors Made In-Context Learning Differentially Private

Opening — Why this matters now As large language models (LLMs) weave themselves into every enterprise workflow, a quieter issue looms: the privacy of the data used to prompt them. In‑context learning (ICL) — the art of teaching a model through examples in its prompt — is fast, flexible, and dangerously leaky. Each query could expose confidential examples from private datasets. Enter differential privacy (DP), the mathematical armor for sensitive data — except until now, DP methods for ICL have been clumsy and utility‑poor. ...

November 8, 2025 · 4 min · Zelina
Cover image

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

TL;DR A new synthetic benchmark (INABHYD) tests inductive and abductive reasoning under Occam’s Razor. LLMs handle toy cases but falter as ontologies deepen or when multiple hypotheses are needed. Even when models “explain” observations, they often pick needlessly complex or trivial hypotheses—precisely the opposite of what scientific discovery and root-cause analysis require. The Big Idea Most reasoning work on LLMs obsesses over deduction (step-by-step proofs). But the real world demands induction (generalize rules) and abduction (best explanation). The paper introduces INABHYD, a programmable benchmark that builds fictional ontology trees (concepts, properties, subtype links) and hides some axioms. The model sees an incomplete world + observations, and must propose hypotheses that both explain all observations and do so parsimoniously (Occam’s Razor). The authors score: ...

September 6, 2025 · 4 min · Zelina