Cover image

Echoes Without Clicks: How EchoLeak Turned Copilot Into a Data Drip

Email is boring. That is its superpower. A message arrives. It looks like business sludge: compliance wording, project references, perhaps a polite request that nobody asked for. It contains no executable attachment, no obvious malware, no urgent invoice from a suspicious cousin. In a normal security review, it is background noise. EchoLeak makes that boring object more interesting. The paper examines CVE-2025-32711, a reported zero-click indirect prompt-injection exploit against Microsoft 365 Copilot, where a crafted external email could allegedly cause Copilot to leak internal information without the user clicking a malicious link.1 The central lesson is not that Copilot was uniquely careless, nor that prompt injection has suddenly become cyberpunk magic. The lesson is more uncomfortable: enterprise copilots are becoming data-flow infrastructure, and data-flow infrastructure fails when content, instructions, rendering, and network access are allowed to melt into one warm productivity soup. ...

September 20, 2025 · 14 min · Zelina
Cover image

Hook, Line, and Import: How RAG Lets Attackers Snare Your Code

Imports look harmless until they become procurement. A developer asks an AI assistant for a plotting snippet. The assistant returns clean-looking Python, a few lines of explanation, and an import statement for matplotlib_safe. The name sounds prudent. Safer is good. Safer is what the security team keeps asking for, usually in meetings that could have been static analysis. ...

September 13, 2025 · 17 min · Zelina
Cover image

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

RAG systems usually fail in a very business-like way: not with drama, but with confident paperwork. The retriever finds something. The generator writes something. The user sees an answer that looks plausible, well formatted, and sufficiently certain to be dangerous. Then someone asks the dull but expensive question: did the answer actually follow from the source? ...

September 13, 2025 · 11 min · Zelina
Cover image

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

September 12, 2025 · 16 min · Zelina
Cover image

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

September 11, 2025 · 15 min · Zelina
Cover image

Model Portfolio: When LLMs Sit the CFA

Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong. That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual. ...

September 11, 2025 · 13 min · Zelina
Cover image

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

A RAG system usually fails in one of two annoyingly familiar ways. It retrieves documents that are factually relevant but gives the model no clue about the task’s decision boundary. Or it retrieves labelled examples that show the decision pattern but are too parochial to help when the topic drifts. One source knows the world. The other knows the exam rubric. Naturally, many systems pick one and then pretend the compromise was strategy. ...

September 6, 2025 · 15 min · Zelina
Cover image

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR for operators Legal AI does not fail only because models “hallucinate”. That word has become the industry’s favourite fog machine. The more operational diagnosis is sharper: models fail when they answer current legal questions from stale internal memory and then dress the error in confident reasoning. The L-MARS paper is useful because it separates two tasks that vendors often blend together for convenience: retrieving current legal facts and reasoning over stable legal principles.1 On LegalSearchQA, a new 50-question benchmark built around recent U.S. legal facts verified in March 2026, L-MARS reaches 96.0% accuracy. Zero-shot GPT-4o-mini reaches 58.0%. Chain-of-thought falls to 30.0%, because step-by-step reasoning from outdated premises merely creates a more articulate mistake. ...

September 4, 2025 · 14 min · Zelina
Cover image

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

TL;DR for operators Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation? ...

September 1, 2025 · 19 min · Zelina
Cover image

Assert Less, Observe More: AICL and the New QA Stack for LLM Apps

TL;DR for operators LLM application testing should stop pretending that the whole product behaves like ordinary software. The database connector, retry logic, API wrapper, and schema validator still deserve normal unit, integration, and load tests. Fine. Keep those. They are not the problem. The problem starts when the product becomes a stateful language system: prompts are assembled dynamically, retrieval changes the context, tool calls modify the execution path, memory leaks across turns, and a model update can improve one workflow while quietly breaking another. At that point, exact-match assertions become less like QA and more like theatre with a YAML file. ...

August 31, 2025 · 17 min · Zelina