Cover image

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

When a junior developer misunderstands your instructions, they might still write code that compiles and runs—but does the wrong thing. This is exactly what large language models (LLMs) do when faced with faulty premises. The latest paper, Refining Critical Thinking in LLM Code Generation, unveils FPBench, a benchmark that probes an overlooked blind spot: whether AI models can detect flawed assumptions before they generate a single line of code. Spoiler: they usually can’t. ...

August 6, 2025 · 3 min · Zelina
Cover image

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

August 6, 2025 · 3 min · Zelina
Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

August 5, 2025 · 4 min · Zelina
Cover image

Echo Chambers or Stubborn Minds? Simulating Social Influence with LLM Agents

Large language models aren’t just prompt-completion machines anymore. In controlled simulations, they can behave like people in a group discussion: yielding to peer pressure, sticking to their beliefs, or becoming more extreme over time. But not all LLMs are socially equal. A recent paper titled “Towards Simulating Social Influence Dynamics with LLM-based Multi-agents” explores how different LLMs behave in a forum-style discussion, capturing three phenomena familiar to any political science researcher or Reddit moderator: conformity, group polarization, and fragmentation. The twist? These aren’t real people. They’re fully scripted LLM agents with fixed personas, engaged in asynchronous multi-round debates. ...

July 31, 2025 · 3 min · Zelina
Cover image

Mind the Gap: How AI Papers Misuse Psychology

It has become fashionable for AI researchers to pepper their papers with references to psychology: System 1 and 2 thinking, Theory of Mind, memory systems, even empathy. But according to a recent meta-analysis titled “The Incomplete Bridge: How AI Research (Mis)Engages with Psychology”, these references are often little more than conceptual garnish. The authors analyze 88 AI papers from NeurIPS and ACL (2022-2023) that cite psychological concepts. Their verdict is sobering: while 78% use psychology as inspiration, only 6% attempt to empirically validate or challenge psychological theories. Most papers cite psychology in passing — using it as window dressing to make AI behaviors sound more human-like. ...

July 31, 2025 · 3 min · Zelina
Cover image

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

When you tell an AI, “make the third bar blue,” what does it actually see? If it’s a typical large language model (LLM), it doesn’t really see anything. It parses your instruction, guesses what “third bar” means, and fumbles to write chart code—often missing the mark. ChartM$^3$ (Multimodal, Multi-level, Multi-perspective) changes the game. It challenges AIs to not only read and write code but also visually comprehend what a user points at. With 1,000 human-curated chart editing tasks and 24,000 training examples, this new benchmark sets a higher bar—one that demands both verbal and visual fluency. ...

July 30, 2025 · 4 min · Zelina
Cover image

Fraud, Trimmed and Tagged: How Dual-Granularity Prompts Sharpen LLMs for Graph Detection

In the escalating arms race between fraudsters and detection systems, recent advances in Graph-Enhanced LLMs hold enormous promise. But they face a chronic problem: too much information. Take graph-based fraud detection. It’s common to represent users and their actions as nodes and edges on a heterogeneous graph, where each node may contain rich textual data (like reviews) and structured features (like ratings). To classify whether a node (e.g., a user review) is fraudulent, models like GraphGPT or HiGPT transform local neighborhoods into long textual prompts. But here’s the catch: real-world graphs are dense. Even two hops away, the neighborhood can balloon to millions of tokens. ...

July 30, 2025 · 4 min · Zelina
Cover image

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

When ChatGPT makes up a statistic or misstates a date, it’s annoying. But when a financial assistant claims the wrong interest expense or misattributes a revenue source, it could move markets or mislead clients. This is the stark reality FRED confronts head-on. FRED—short for Financial Retrieval-Enhanced Detection and Editing—is a framework fine-tuned to spot and fix factual errors in financial LLM outputs. Developed by researchers at Pegasi AI, it isn’t just another hallucination detection scheme. It’s an auditor with a domain-specific brain. ...

July 29, 2025 · 3 min · Zelina
Cover image

When Your AI Disagrees with Your Portfolio

What happens when your AI co-pilot thinks it’s the pilot? In financial decision-making, autonomy isn’t always a virtue. A striking new study titled “Your AI, Not Your View” reveals that even the most advanced Large Language Models (LLMs) may quietly sabotage your investment strategy — not by hallucinating facts, but by overriding your intent with stubborn preferences baked into their training. Hidden Hands Behind the Recommendations The paper introduces a systematic framework to identify and measure confirmation bias in LLMs used for investment analysis. Instead of just summarizing news or spitting out buy/sell signals, the study asks: what if the model already has a favorite? More specifically: ...

July 29, 2025 · 4 min · Zelina
Cover image

The Sims Get Smart? Why LLM-Driven Social Simulations Need a Reality Check

Social simulations are entering their uncanny valley. Fueled by generative agents powered by Large Language Models (LLMs), recent frameworks like Smallville, AgentSociety, and SocioVerse simulate thousands of lifelike agents forming friendships, spreading rumors, and planning parties. But do these simulations reflect real social processes — or merely replay the statistical shadows of the internet? When Simulacra Speak Fluently LLMs have demonstrated striking abilities to mimic human behaviors. GPT-4 has passed Theory-of-Mind (ToM) tests at levels comparable to 6–7 year-olds. In narrative contexts, it can detect sarcasm, understand indirect requests, and generate empathetic replies. But all of this arises not from embodied cognition or real-world goals — it’s just next-token prediction trained on massive corpora. ...

July 28, 2025 · 4 min · Zelina