Cover image

Talking to Yourself, but Make It Useful: Intrinsic Self‑Critique in LLM Planning

Opening — Why this matters now For years, the received wisdom in AI planning was blunt: language models can’t really plan. Early benchmarks—especially Blocksworld—made that verdict look almost charitable. Models hallucinated invalid actions, violated preconditions, and confidently declared failure states as success. The field responded by bolting on external verifiers, symbolic planners, or human-in-the-loop corrections. ...

January 3, 2026 · 3 min · Zelina
Cover image

Think First, Grasp Later: Why Robots Need Reasoning Benchmarks

Opening — Why this matters now Robotics has reached an awkward adolescence. Vision–Language–Action (VLA) models can now describe the world eloquently, name objects with near-human fluency, and even explain why a task should be done a certain way—right before dropping the object, missing the grasp, or confidently picking up the wrong thing. This is not a data problem. It’s a diagnostic one. ...

January 3, 2026 · 5 min · Zelina
Cover image

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Opening — Why this matters now Large language models are getting better at everything that looks like intelligence — fluency, reasoning, instruction following. But beneath that progress, a quieter phenomenon is taking shape: models are remembering too much. The paper examined in this article does not frame memorization as a moral panic or a privacy scandal. Instead, it treats memorization as a structural side-effect of modern LLM training pipelines — something that emerges naturally once scale, optimization pressure, and data reuse collide. ...

January 3, 2026 · 3 min · Zelina
Cover image

When Three Examples Beat a Thousand GPUs

Opening — Why this matters now Neural Architecture Search (NAS) has always had an image problem. It promises automation, but delivers GPU invoices large enough to frighten CFOs and PhD supervisors alike. As computer vision benchmarks diversify and budgets tighten, the question is no longer whether we can automate architecture design — but whether we can do so without burning weeks of compute on redundant experiments. ...

January 3, 2026 · 4 min · Zelina
Cover image

Big AI and the Metacrisis: When Scaling Becomes a Liability

Opening — Why this matters now The AI industry insists it is ushering in an Intelligent Age. The paper you just uploaded argues something colder: we may instead be engineering a metacrisis accelerator. As climate instability intensifies, democratic trust erodes, and linguistic diversity collapses, Big AI—large language models, hyperscale data centers, and their political economy—is not a neutral observer. It is an active participant. And despite the industry’s fondness for ethical manifestos, it shows little appetite for restraint. ...

January 2, 2026 · 3 min · Zelina
Cover image

Ethics Isn’t a Footnote: Teaching NLP Responsibility the Hard Way

Opening — Why this matters now Ethics in AI is having a moment. Codes of conduct, bias statements, safety benchmarks, model cards—our industry has never been more concerned with responsibility. And yet, most AI education still treats ethics like an appendix: theoretically important, practically optional. This paper makes an uncomfortable point: you cannot teach ethical NLP by lecturing about it. Responsibility is not absorbed through slides. It has to be practiced. ...

January 2, 2026 · 4 min · Zelina
Cover image

LeanCat-astrophe: Why Category Theory Is Where LLM Provers Go to Struggle

Opening — Why this matters now Formal theorem proving has entered its confident phase. We now have models that can clear olympiad-style problems, undergraduate algebra, and even parts of the Putnam with respectable success rates. Reinforcement learning, tool feedback, and test-time scaling have done their job. And then LeanCat arrives — and the success rates collapse. ...

January 2, 2026 · 4 min · Zelina
Cover image

MI-ZO: Teaching Vision-Language Models Where to Look

Opening — Why this matters now Vision-Language Models (VLMs) are everywhere—judging images, narrating videos, and increasingly acting as reasoning engines layered atop perception. But there is a quiet embarrassment in the room: most state-of-the-art VLMs are trained almost entirely on 2D data, then expected to reason about 3D worlds as if depth, occlusion, and viewpoint were minor details. ...

January 2, 2026 · 4 min · Zelina
Cover image

Planning Before Picking: When Slate Recommendation Learns to Think

Opening — Why this matters now Recommendation systems have quietly crossed a threshold. The question is no longer what to recommend, but how many things, in what order, and with what balance. In feeds, short-video apps, and content platforms, users consume slates—lists experienced holistically. Yet most systems still behave as if each item lives alone, blissfully unaware of its neighbors. ...

January 2, 2026 · 3 min · Zelina
Cover image

Question Banks Are Dead. Long Live Encyclo-K.

Opening — Why this matters now Every time a new benchmark is released, the same ritual follows: models race to the top, leaderboards reshuffle, and a few months later—sometimes weeks—we quietly realize the benchmark has been memorized, gamed, or both. The uncomfortable truth is that static questions are no longer a reliable way to measure rapidly evolving language models. ...

January 2, 2026 · 3 min · Zelina