Cover image

Grading the Doctor: How Health-SCORE Scales Judgment in Medical AI

Opening — Why this matters now Healthcare LLMs have a credibility problem. Not because they cannot answer medical questions—many now ace exam-style benchmarks—but because real medicine is not a multiple-choice test. It is open-ended, contextual, uncertain, and unforgiving. In that setting, how a model reasons, hedges, and escalates matters as much as what it says. ...

February 2, 2026 · 4 min · Zelina
Cover image

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

Opening — Why this matters now The last year has been crowded with so-called deep research agents. Everyone parallelizes. Everyone fans out queries. Everyone promises doctoral-level synthesis at web speed. And yet, the leaderboard keeps telling an inconvenient story: throwing more parallel agents at a problem does not reliably buy depth. The paper “Deep Researcher with Sequential Plan Reflection and Candidates Crossover” enters this debate with a pointed thesis: research is not a map-reduce problem. If you want insight, you need memory, reflection, and the ability to change your mind mid-flight. ...

January 31, 2026 · 4 min · Zelina
Cover image

Picking Less to Know More: When RAG Stops Ranking and Starts Thinking

Opening — Why this matters now Retrieval-Augmented Generation has a dirty secret: it keeps retrieving more context while quietly getting no smarter. As context windows balloon to 100K tokens and beyond, RAG systems dutifully shovel in passages—Top‑5, Top‑10, Top‑100—hoping recall will eventually rescue accuracy. It doesn’t. Accuracy plateaus. Costs rise. Attention diffuses. The model gets lost in its own evidence pile. ...

December 17, 2025 · 4 min · Zelina
Cover image

Benchmarks on Quicksand: Why Static Scores Fail Living Models

Opening — Why this matters now If you feel that every new model release breaks yesterday’s leaderboard, congratulations: you’ve discovered the central contradiction of modern AI evaluation. Benchmarks were designed for stability. Models are not. The paper you just uploaded dissects this mismatch with academic precision—and a slightly uncomfortable conclusion: static benchmarks are no longer fit for purpose. ...

December 15, 2025 · 3 min · Zelina
Cover image

Breaking the Tempo: How TempoBench Reframes AI’s Struggle with Time and Causality

Opening — Why this matters now The age of “smart” AI models has reached an uncomfortable truth: they can ace your math exam but fail your workflow. While frontier systems like GPT‑4o and Claude‑Sonnet solve increasingly complex symbolic puzzles, they stumble when asked to reason through time—to connect what happened, what’s happening, and what must happen next. In a world shifting toward autonomous agents and decision‑chain AI, this isn’t a minor bug—it’s a systemic limitation. ...

November 5, 2025 · 4 min · Zelina
Cover image

The Missing Metric: Measuring Agentic Potential Before It’s Too Late

The Missing Metric: Measuring Agentic Potential Before It’s Too Late In the modern AI landscape, models are not just talkers—they are becoming doers. They code, browse, research, and act within complex environments. Yet, while we’ve become adept at measuring what models know, we still lack a clear way to measure what they can become. APTBench, proposed by Tencent Youtu Lab and Shanghai Jiao Tong University, fills that gap: it’s the first benchmark designed to quantify a model’s agentic potential during pre-training—before costly fine-tuning or instruction stages even begin. ...

November 2, 2025 · 4 min · Zelina
Cover image

Fault Lines & Safety Nets: How RAFFLES Finds the First Domino in Agent Failures

TL;DR Most LLM agent evaluations judge the final answer. RAFFLES flips the lens to where the first causal error actually happened—then iterates with a Judge–Evaluator loop to verify primacy, fault-ness, and non-correction. On the Who&When benchmark, RAFFLES materially outperforms one-shot judges and router-style baselines. For builders, this is a template for root-cause analytics on long-horizon agents, not just scorekeeping. Why we need decisive-fault attribution (not just pass/fail) Modern agent stacks—routers, tool-callers, planners, web surfers, coders—fail in cascades. A harmless-looking plan choice at t=3 can doom execution at t=27. Traditional “LLM-as-a-judge”: ...

September 12, 2025 · 4 min · Zelina
Cover image

Model Portfolio: When LLMs Sit the CFA

If your firm is debating whether to trust an LLM on investment memos, this study is a gift: 1,560 questions from official CFA mock exams across Levels I–III, run on three model archetypes—multimodal generalist (GPT‑4o), deep-reasoning specialist (GPT‑o1), and lightweight cost‑saver (o3‑mini)—both zero‑shot and with a domain‑reasoning RAG pipeline. Below is what matters for adoption, not just leaderboard bragging rights. What the paper really shows Reasoning beats modality for finance. The reasoning‑optimized model (GPT‑o1) dominates across levels; the generalist (GPT‑4o) is inconsistent, especially on math‑heavy Level II. RAG helps where context is long and specialized. Gains are largest at Level III (portfolio cases) and in Fixed Income/Portfolio Management, modest at Level I. Retrieval cannot fix arithmetic. Most errors are knowledge gaps, not reading problems. Readability barely moves accuracy; the bottleneck is surfacing the right curriculum facts and applying them. Cost–accuracy has a sweet spot. o3‑mini + targeted RAG is strong enough for high‑volume workflows; o1 should be reserved for regulated, high‑stakes analysis. Executive snapshot CFA Level GPT‑4o (ZS → RAG) GPT‑o1 (ZS → RAG) o3‑mini (ZS → RAG) Takeaway I 78.6% → 79.4% 94.8% → 94.8% 87.6% → 88.3% Foundations already in‑model; RAG adds little II 59.6% → 60.5% 89.3% → 91.4% 79.8% → 84.3% Level II exposes math + integration gaps; RAG helps smaller models most III 64.1% → 68.6% 79.1% → 87.7% 70.9% → 76.4% Case‑heavy; RAG is decisive, especially for o1 ZS = zero‑shot. Accuracies are from the paper’s aggregated results. ...

September 11, 2025 · 4 min · Zelina
Cover image

Precepts over Predictions: Can LLMs Play Socrates?

TL;DR Most LLM ethics tests score the verdict. AMAeval scores the reasoning. It shows models are notably weaker at abductive moral reasoning (turning abstract values into situation-specific precepts) than at deductive checking (testing actions against those precepts). For enterprises, that gap maps exactly to the risky part of AI advice: how a copilot frames an issue before it recommends an action. Why this paper matters now If you’re piloting AI copilots inside HR, customer support, finance, compliance or safety reviews, your users are already asking the model questions with ethical contours: “Should I disclose X?”, “Is this fair to the customer?”, “What’s the responsible escalation?” ...

August 19, 2025 · 4 min · Zelina
Cover image

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

Financial AI promises speed and scale — but in finance, a single misplaced digit can be the difference between compliance and catastrophe. The FAITH (Framework for Assessing Intrinsic Tabular Hallucinations) benchmark tackles this risk head‑on, probing how well large language models can faithfully extract and compute numbers from the dense, interconnected tables in 10‑K filings. From Idea to Dataset: Masking With a Purpose FAITH reframes hallucination detection as a context‑aware masked span prediction task. It takes real S&P 500 annual reports, hides specific numeric spans, and asks the model to recover them — but only after ensuring three non‑negotiable conditions: ...

August 8, 2025 · 3 min · Zelina