Cover image

Thinking Isn’t Free: Why Chain-of-Thought Hits a Hard Wall

Opening — Why this matters now Inference-time reasoning has quietly become the dominant performance lever for frontier language models. When benchmarks get hard, we don’t retrain—we let models think longer. More tokens, more scratchpad, more compute. The industry narrative is simple: reasoning scales, so accuracy scales. This paper asks an uncomfortable question: how long must a model think, at minimum, as problems grow? And the answer, grounded in theory rather than vibes, is not encouraging. ...

February 5, 2026 · 3 min · Zelina
Cover image

When Benchmarks Lie: Teaching Leaderboards to Care About Preferences

Opening — Why this matters now Benchmarks were supposed to be neutral referees. Instead, they’ve become unreliable narrators. Over the past two years, the gap between benchmark leadership and real-world usefulness has widened into something awkwardly visible. Models that dominate leaderboards frequently underperform in deployment. Smaller, specialized models sometimes beat generalist giants where it actually counts. Yet our evaluation rituals barely changed. ...

February 5, 2026 · 4 min · Zelina
Cover image

Ask Once, Query Right: Why Enterprise AI Still Gets Databases Wrong

Opening — Why this matters now Enterprises love to say they are “data‑driven.” In practice, they are database‑fragmented. A single natural‑language question — How many customers in California? — may be answerable by five internal databases, all structurally different, semantically overlapping, and owned by different teams. Routing that question to the right database is no longer a UX problem. It is an architectural one. ...

February 2, 2026 · 4 min · Zelina
Cover image

When Benchmarks Forget What They Learned

Opening — Why this matters now Large language models are getting better at everything — or at least that’s what the leaderboards suggest. Yet beneath the glossy scores lies a quiet distortion: many benchmarks are no longer measuring learning, but recall. The paper you’ve just uploaded dissects this issue with surgical precision, showing how memorization creeps into evaluation pipelines and quietly inflates our confidence in model capability. ...

February 2, 2026 · 3 min · Zelina
Cover image

Triage by Token: When Context Clues Quietly Override Clinical Judgment

Opening — Why this matters now Large language models are quietly moving from clerical assistance to clinical suggestion. In emergency departments (EDs), where seconds matter and triage decisions shape outcomes, LLM-based decision support tools are increasingly tempting: fast, consistent, and seemingly neutral. Yet neutrality in language does not guarantee neutrality in judgment. This paper interrogates a subtle but consequential failure mode: latent bias introduced through proxy variables. Not overt racism. Not explicit socioeconomic labeling. Instead, ordinary contextual cues—how a patient arrives, where they live, how often they visit the ED—nudging model outputs in clinically unjustified ways. ...

January 24, 2026 · 4 min · Zelina
Cover image

Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Opening — Why this matters now Neural theorem proving has entered its industrial phase. With reinforcement learning pipelines, synthetic data factories, and search budgets that would make a chess engine blush, models like DeepSeek‑Prover‑V1.5 are widely assumed to have internalized everything there is to know about formal proof structure. This paper politely disagrees. Under tight inference budgets—no massive tree search, no thousand-sample hail‑Mary—the author shows that simple, almost embarrassingly old‑fashioned structural hints still deliver large gains. Not new models. Not more data. Just better scaffolding. ...

January 23, 2026 · 4 min · Zelina
Cover image

When Models Remember Too Much: The Quiet Problem of Memorization Sinks

Opening — Why this matters now Large language models are getting better at everything—writing, coding, reasoning, and politely apologizing when they hallucinate. Yet beneath these broad performance gains lies a quieter, more structural issue: memorization does not happen evenly. Some parts of the training data exert disproportionate influence, acting as gravitational wells that trap model capacity. These are what the paper terms memorization sinks. ...

January 23, 2026 · 3 min · Zelina
Cover image

When LLMs Read the Room: Predictive Process Monitoring Without the Data Buffet

Opening — Why this matters now Predictive Process Monitoring (PPM) has always promised operational foresight: knowing how long a case will take, whether a costly activity will happen, or when things are about to go wrong. The catch has been brutally consistent — you need a lot of data. Thousands of traces. Clean logs. Stable processes. ...

January 19, 2026 · 5 min · Zelina
Cover image

Attention, But Make It Optional

Opening — When more layers stop meaning more intelligence The scaling era taught us a simple mantra: stack more layers, get better models. Then deployment happened. Suddenly, latency, energy bills, and GPU scarcity started asking uncomfortable questions—like whether every layer in a 40-layer Transformer is actually doing any work. This paper answers that question with unsettling clarity: many attention layers aren’t lazy—they’re deliberately silent. And once you notice that, pruning them becomes less of an optimization trick and more of a design correction. ...

December 27, 2025 · 4 min · Zelina
Cover image

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

Opening — Why this matters now AI coding agents are everywhere—and still, maddeningly unreliable. They pass unit tests they shouldn’t. They hallucinate imports. They invent APIs with confidence that would be admirable if it weren’t so destructive. The industry response has been predictable: bigger models, longer prompts, more retries. This paper proposes something less glamorous and far more effective: stop asking stochastic models to behave like deterministic software engineers. ...

December 27, 2025 · 4 min · Zelina