Cover image

Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations. Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure. ...

July 29, 2025 · 4 min · Zelina
Cover image

RAG in the Wild: When More Knowledge Hurts

Retrieval-Augmented Generation (RAG) is often hailed as a cure-all for domain adaptation and factual accuracy in large language models (LLMs). By injecting external context at inference time, RAG systems promise to boost performance on knowledge-intensive tasks. But a new paper, RAG in the Wild (Xu et al., 2025), reveals that this promise is brittle when we leave the sanitized lab environment and enter the real world of messy, multi-source knowledge. ...

July 29, 2025 · 4 min · Zelina
Cover image

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

Vision-language models (VLMs) may describe what they see, but do they truly understand what they’re looking at — especially in social contexts? A recent paper introduces Cognitive Chain-of-Thought (CoCoT), a deceptively simple yet remarkably effective prompting strategy that helps these models reason like humans: through layered cognition, not flat logic. The Problem with Flat Reasoning Traditional Chain-of-Thought (CoT) prompting, while powerful for math and symbolic tasks, falls short when it comes to social or moral interpretation. Consider a scene where a person wears a mask indoors, and another says, “Hiding from the paparazzi, huh?” CoT may recognize the mask, but often misfires in guessing intent — is it a joke? A warning? An instruction? ...

July 29, 2025 · 3 min · Zelina
Cover image

When Your AI Disagrees with Your Portfolio

What happens when your AI co-pilot thinks it’s the pilot? In financial decision-making, autonomy isn’t always a virtue. A striking new study titled “Your AI, Not Your View” reveals that even the most advanced Large Language Models (LLMs) may quietly sabotage your investment strategy — not by hallucinating facts, but by overriding your intent with stubborn preferences baked into their training. Hidden Hands Behind the Recommendations The paper introduces a systematic framework to identify and measure confirmation bias in LLMs used for investment analysis. Instead of just summarizing news or spitting out buy/sell signals, the study asks: what if the model already has a favorite? More specifically: ...

July 29, 2025 · 4 min · Zelina
Cover image

Graft and Go: How Knowledge Grafting Shrinks AI Without Shrinking Its Brain

If you’ve ever tried to run a powerful AI model on a modest device—say, a drone, a farm robot, or even a Raspberry Pi—you’ve likely hit the wall of hardware limitations. Today’s most accurate models are big, bloated, and brittle when it comes to efficiency. Enter knowledge grafting, a refreshingly biological metaphor for a novel compression technique that doesn’t just trim the fat—it transfers the muscle. Rethinking Compression: Not What to Cut, But What to Keep Traditional model optimization methods—quantization, pruning, and distillation—all try to make the best of a difficult trade-off: shrinking the model while limiting the damage to performance. These methods often fall short, especially when you push compression past 5–6x. ...

July 28, 2025 · 3 min · Zelina
Cover image

Mind the Earnings Gap: Why LLMs Still Flunk Financial Decision-Making

In the race to make language models financial analysts, a new benchmark is calling bluff on the hype. FinanceBench, introduced by a team of researchers from Amazon and academia, aims to test LLMs not just on text summarization or sentiment analysis, but on their ability to think like Wall Street professionals. The results? Let’s just say GPT-4 may ace the chatroom, but it still struggles in the boardroom. The Benchmark We Actually Needed FinanceBench isn’t your typical leaderboard filler. Unlike prior datasets, which mostly rely on news headlines or synthetic financial prompts, this one uses real earnings call transcripts from over 130 public companies. It frames the task like a genuine investment analyst workflow: ...

July 28, 2025 · 3 min · Zelina
Cover image

Rollout Renaissance: How Pareto-NRPA Revives Monte Carlo for Multi-Objective Optimization

Monte Carlo search algorithms rarely make the shortlist in multi-objective optimization (MOO). Traditionally, the field has belonged to evolutionary algorithms like NSGA-II and SMS-EMOA. But a paper from Paris Dauphine-PSL and Thales upends that hierarchy with an audacious twist: what if we generalized NRPA — a niche but powerful single-objective method — to handle multiple objectives, constraints, and diversity, all in one elegant framework? ...

July 28, 2025 · 3 min · Zelina
Cover image

The Sims Get Smart? Why LLM-Driven Social Simulations Need a Reality Check

Social simulations are entering their uncanny valley. Fueled by generative agents powered by Large Language Models (LLMs), recent frameworks like Smallville, AgentSociety, and SocioVerse simulate thousands of lifelike agents forming friendships, spreading rumors, and planning parties. But do these simulations reflect real social processes — or merely replay the statistical shadows of the internet? When Simulacra Speak Fluently LLMs have demonstrated striking abilities to mimic human behaviors. GPT-4 has passed Theory-of-Mind (ToM) tests at levels comparable to 6–7 year-olds. In narrative contexts, it can detect sarcasm, understand indirect requests, and generate empathetic replies. But all of this arises not from embodied cognition or real-world goals — it’s just next-token prediction trained on massive corpora. ...

July 28, 2025 · 4 min · Zelina
Cover image

Tool Up or Tap Out: How Multi-TAG Elevates Math Reasoning with Smarter LLM Workflows

Most tool-augmented LLMs approach math reasoning like they’re wielding a hammer—good for hitting one nail at a time, but ill-equipped when the problem requires a wrench, a compass, and a soldering iron all at once. Enter Multi-TAG, a clever, finetuning-free framework that aggregates the strengths of multiple tools per reasoning step. Think of it as an LLM with a toolbox, not just a single tool. And it doesn’t just work—it wins, posting 6.0% to 7.5% accuracy gains across MATH500, AIME, AMC, and OlympiadBench against top baselines, using both open and closed LLMs. ...

July 28, 2025 · 4 min · Zelina
Cover image

All Eggs, One Basket: When Diversification Backfires in Risk Modeling

“Don’t put all your eggs in one basket” has long been gospel in finance and risk management. But what if sometimes, the basket is the safer place? In a surprising twist on conventional wisdom, Léonard Vincent’s latest paper presents the one-basket theorem: a theoretical framework that proves diversification can increase risk under certain extreme but relevant conditions. Specifically, when dealing with heavy-tailed risks that have infinite mean — such as those found in insurance, operational risk, and even crypto markets — putting all your eggs in one basket may be the rational choice. ...

July 27, 2025 · 3 min · Zelina