Cover image

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget. That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration. ...

September 19, 2025 · 14 min · Zelina
Cover image

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

A customer asks your AI assistant to choose between two mortgage options. An employee asks whether to quit. A student says, very politely, “Please guide me, but don’t give me the answer.” A lonely user suggests the chatbot feels like a best friend. The easy product answer is: be helpful. The harder answer is: helpful to what? ...

September 14, 2025 · 16 min · Zelina
Cover image

From Blobs to Blocks: Componentizing LLM Output for Real Work

Every office has the same tiny tragedy. Someone asks an AI system for a useful draft. The model produces five decent paragraphs and one mildly deranged sentence that sounds as if it escaped from a conference keynote. The user wants to fix only that sentence. Instead, the interface offers the usual bargain: copy everything into another editor and lose the live connection to the conversation, or ask the model to revise the answer and watch it “helpfully” disturb the parts that were already fine. ...

September 14, 2025 · 16 min · Zelina
Cover image

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

September 12, 2025 · 16 min · Zelina
Cover image

HyFedRAG: Caching Privacy into Federated RAG

Hospital search is rarely a search problem in the clean, consumer-internet sense. The useful information is not sitting in one tidy index, wearing a name badge, waiting to be embedded. It is scattered across clinical notes, relational databases, knowledge graphs, departmental systems, hospital networks, and legal boundaries. Naturally, this is where people decide to add a large language model and call it “modernisation.” Brave. ...

September 12, 2025 · 15 min · Zelina
Cover image

Mind the Gap: How OSC Turns Agent Chatter into Compound Intelligence

Teams fail quietly before they fail visibly. The procurement analyst missed a constraint. The legal reviewer assumed a definition. The finance model used a different baseline. Everyone produced competent work. The final report still wobbled because the collaboration layer never asked the obvious question: who knows what, who misunderstands what, and which disagreement is worth resolving before the answer is assembled? ...

September 11, 2025 · 16 min · Zelina
Cover image

Parallel Minds, Shorter Time: ParaThinker’s Native Thought Width

A familiar enterprise AI failure looks less like stupidity and more like stubbornness. Ask a model to solve a hard problem, and it may begin confidently in the wrong direction. Then it keeps going. It adds details. It self-reflects. It spends tokens. It may even apologise to itself internally, which is apparently what we call progress now. But the core path does not change. The model is not merely short on compute. It is trapped inside its own first guess. ...

September 11, 2025 · 15 min · Zelina
Cover image

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

A RAG system usually fails in one of two annoyingly familiar ways. It retrieves documents that are factually relevant but gives the model no clue about the task’s decision boundary. Or it retrieves labelled examples that show the decision pattern but are too parochial to help when the topic drifts. One source knows the world. The other knows the exam rubric. Naturally, many systems pick one and then pretend the compromise was strategy. ...

September 6, 2025 · 15 min · Zelina
Cover image

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

Diagnosis is where AI systems start to look clever, then suddenly start charging consultancy rates. Give a model a handful of symptoms, incident logs, customer complaints, or audit traces, and ask it what explains them. It will usually produce something plausible. Sometimes several plausible things. Occasionally an entire decorative shrubbery of plausible things. The practical question is not whether the model can invent an explanation. That bar is underground. The harder question is whether it can find the simplest explanation that accounts for the evidence without adding unnecessary machinery. ...

September 6, 2025 · 15 min · Zelina
Cover image

Cache Me If You Can: Designing Databases for Swarms of AI Agents

A data analyst asks a database a question. An AI agent interrogates it. That distinction sounds theatrical until the query logs arrive. The human analyst usually knows roughly where to look, asks a small number of targeted questions, waits for answers, adjusts, and eventually presents a result. The agent is less graceful. It checks schemas, samples columns, guesses joins, inspects distinct values, tries partial SQL, abandons it, starts again, validates, retries, and occasionally recruits more agents to repeat the exercise in parallel. It is not being stupid. It is compensating for a missing sense of the underlying data. ...

September 4, 2025 · 16 min · Zelina