Cover image

Search Me If You Can: Why AI Agent Discovery Needs Receipts

Opening — Why this matters now The AI agent market is beginning to look like an overconfident airport duty-free shop: everything claims to be premium, every label promises capability, and somehow the thing you need is still hard to find. That matters because the next phase of business automation will not be built from one general chatbot sitting politely in a browser tab. It will involve agent ecosystems: finance agents, customer-support agents, coding agents, compliance agents, research agents, scheduling agents, procurement agents, and a thousand microscopic “I can do that” assistants wrapped in glossy product pages. ...

April 28, 2026 · 13 min · Zelina

Where to Go Deeper Beyond This Academy

A curated guide to textbooks, authors, websites, and papers for readers who want to study transformer internals, attention math, fine-tuning, GPU optimization, and benchmarking in more depth.

April 23, 2026 · 8 min · Michelle
Cover image

The Art of Interrupting AI: When Knowing Isn’t Talking

Opening — Why this matters now The current generation of AI models can see, hear, and respond. In theory, they should also be able to participate. In practice, they often behave like that one person in a meeting who either interrupts too early—or never speaks at all. This gap is no longer academic. As omni-modal models move into real-time assistants, customer service agents, and even trading copilots, the question is shifting from “Can the model understand?” to something more uncomfortable: ...

March 18, 2026 · 4 min · Zelina
Cover image

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

Opening — Why this matters now For years, AI progress has been narrated through a familiar ritual: introduce a new benchmark, top it with a new model, declare victory, repeat. But as large language models graduate from single-shot answers to multi-step agentic workflows, that ritual is starting to crack. If AI systems are now expected to design experiments, debug failures, iterate on ideas, and judge their own results, then accuracy on static datasets is no longer the right yardstick. ...

February 9, 2026 · 3 min · Zelina
Cover image

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

Opening — Why this matters now Embodied AI is having its deployment moment. Robots are promised for homes, agents for physical spaces, and multimodal models are marketed as finally “understanding” the real world. Yet most of these claims rest on benchmarks designed far away from kitchens, hallways, mirrors, and cluttered tables. This paper makes an uncomfortable point: if you evaluate agents inside the environments they will actually operate in, much of that apparent intelligence collapses. ...

February 7, 2026 · 4 min · Zelina
Cover image

First Proofs, No Training Wheels

Opening — Why this matters now AI models are now fluent in contest math, symbolic manipulation, and polished explanations. That’s the easy part. The harder question—the one that actually matters for science—is whether these systems can do research when the answer is not already in the training set. The paper First Proof arrives as a deliberately uncomfortable experiment: ten genuine research-level mathematics questions, all solved by humans, none previously public, and all temporarily withheld from the internet. ...

February 7, 2026 · 3 min · Zelina
Cover image

AgenticPay: When LLMs Start Haggling for a Living

Opening — Why this matters now Agentic AI has moved beyond polite conversation. Increasingly, we expect language models to act: negotiate contracts, procure services, choose suppliers, and close deals on our behalf. This shift quietly transforms LLMs from passive tools into economic actors. Yet here’s the uncomfortable truth: most evaluations of LLM agents still resemble logic puzzles or toy auctions. They test reasoning, not commerce. Real markets are messy—private constraints, asymmetric incentives, multi-round bargaining, and strategic patience all matter. The paper behind AgenticPay steps directly into this gap. ...

February 6, 2026 · 4 min · Zelina
Cover image

When Papers Learn to Draw: AutoFigure and the End of Ugly Science Diagrams

Opening — Why this matters now AI can already write papers, review papers, and in some cases get papers accepted. Yet one stubborn artifact has remained conspicuously human: the scientific figure. Diagrams, pipelines, conceptual schematics—these are still hand-crafted, visually inconsistent, and painfully slow to produce. For AI-driven research agents, this isn’t cosmetic. It’s a structural failure. ...

February 4, 2026 · 4 min · Zelina
Cover image

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...

February 3, 2026 · 4 min · Zelina
Cover image

When Benchmarks Forget What They Learned

Opening — Why this matters now Large language models are getting better at everything — or at least that’s what the leaderboards suggest. Yet beneath the glossy scores lies a quiet distortion: many benchmarks are no longer measuring learning, but recall. The paper you’ve just uploaded dissects this issue with surgical precision, showing how memorization creeps into evaluation pipelines and quietly inflates our confidence in model capability. ...

February 2, 2026 · 3 min · Zelina