Cover image

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Opening — Why this matters now Emotional support from AI has quietly moved from novelty to expectation. People vent to chatbots after work, during grief, and in moments of burnout—not to solve equations, but to feel understood. Yet something subtle keeps breaking trust. The responses sound caring, but they are often wrong in small, revealing ways: the time is off, the location is imagined, the suggestion doesn’t fit reality. Empathy without grounding turns into polite hallucination. ...

February 1, 2026 · 4 min · Zelina
Cover image

TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses. ...

January 12, 2026 · 4 min · Zelina
Cover image

Think Before You Sink: Streaming Hallucinations in Long Reasoning

Opening — Why this matters now Large language models have learned to think out loud. Chain-of-thought (CoT) reasoning has become the default solution for math, planning, and multi-step decision tasks. The industry applauded: more transparency, better answers, apparent interpretability. Then reality intervened. Despite elegant reasoning traces, models still reach incorrect conclusions—sometimes confidently, sometimes catastrophically. Worse, the mistakes are no longer obvious. They creep in quietly, spread across steps, and survive superficial self-corrections. What we call “hallucination” has grown up. And our detection methods have not. ...

January 6, 2026 · 4 min · Zelina
Cover image

The Gospel of Faithful AI: How FaithAct Rewrites Reasoning

Opening — Why this matters now Hallucination has become the embarrassing tic of multimodal AI — a confident assertion untethered from evidence. In image–language models, this manifests as phantom bicycles, imaginary arrows, or misplaced logic that sounds rational but isn’t real. The problem is not stupidity but unfaithfulness — models that reason beautifully yet dishonestly. ...

November 12, 2025 · 3 min · Zelina
Cover image

Fork, Fuse, and Rule: XAgents’ Multipolar Playbook for Safer Multi‑Agent AI

TL;DR XAgents pairs a multipolar task graph (diverge with SIMO, converge with MISO) with IF‑THEN rule guards to plan uncertain tasks and suppress hallucinations. In benchmarks spanning knowledge and logic QA, it outperforms SPP, AutoAgents, TDAG, and AgentNet while using ~29% fewer tokens and ~45% less memory than AgentNet on a representative task. For operators, the practical win is a recipe to encode SOPs as rules on top of agent teams—without giving up adaptability. ...

September 19, 2025 · 4 min · Zelina
Cover image

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

The big idea RAG pipelines are only as reliable as their weakest link: generation that confidently asserts things the sources don’t support. HALT‑RAG proposes an unusually pragmatic fix: don’t fine‑tune a big model—ensemble two strong, frozen NLI models, add lightweight lexical features, train a tiny task‑adapted meta‑classifier, and calibrate it so you can abstain when uncertain. The result isn’t just accuracy; it’s a governable safety control you can dial to meet business risk. ...

September 13, 2025 · 4 min · Zelina
Cover image

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

Financial AI promises speed and scale — but in finance, a single misplaced digit can be the difference between compliance and catastrophe. The FAITH (Framework for Assessing Intrinsic Tabular Hallucinations) benchmark tackles this risk head‑on, probing how well large language models can faithfully extract and compute numbers from the dense, interconnected tables in 10‑K filings. From Idea to Dataset: Masking With a Purpose FAITH reframes hallucination detection as a context‑aware masked span prediction task. It takes real S&P 500 annual reports, hides specific numeric spans, and asks the model to recover them — but only after ensuring three non‑negotiable conditions: ...

August 8, 2025 · 3 min · Zelina
Cover image

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

“I See What I Want to See” Modern multimodal large language models (MLLMs)—like GPT-4V, Gemini, and LLaVA—promise to “understand” images. But what happens when their eyes lie? In many real-world cases, MLLMs generate fluent, plausible-sounding responses that are visually inaccurate or outright hallucinated. That’s a problem not just for safety, but for trust. A new paper titled “Understanding, Localizing, and Mitigating Hallucinations in Multimodal Large Language Models” introduces a systematic approach to this growing issue. It moves beyond just counting hallucinations and instead offers tools to diagnose where they come from—and more importantly, how to fix them. ...

August 5, 2025 · 3 min · Zelina
Cover image

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

When ChatGPT makes up a statistic or misstates a date, it’s annoying. But when a financial assistant claims the wrong interest expense or misattributes a revenue source, it could move markets or mislead clients. This is the stark reality FRED confronts head-on. FRED—short for Financial Retrieval-Enhanced Detection and Editing—is a framework fine-tuned to spot and fix factual errors in financial LLM outputs. Developed by researchers at Pegasi AI, it isn’t just another hallucination detection scheme. It’s an auditor with a domain-specific brain. ...

July 29, 2025 · 3 min · Zelina
Cover image

Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations. Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure. ...

July 29, 2025 · 4 min · Zelina