Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

August 5, 2025 · 4 min · Zelina
Cover image

Echo Chambers or Stubborn Minds? Simulating Social Influence with LLM Agents

Large language models aren’t just prompt-completion machines anymore. In controlled simulations, they can behave like people in a group discussion: yielding to peer pressure, sticking to their beliefs, or becoming more extreme over time. But not all LLMs are socially equal. A recent paper titled “Towards Simulating Social Influence Dynamics with LLM-based Multi-agents” explores how different LLMs behave in a forum-style discussion, capturing three phenomena familiar to any political science researcher or Reddit moderator: conformity, group polarization, and fragmentation. The twist? These aren’t real people. They’re fully scripted LLM agents with fixed personas, engaged in asynchronous multi-round debates. ...

July 31, 2025 · 3 min · Zelina
Cover image

Mind the Gap: How AI Papers Misuse Psychology

It has become fashionable for AI researchers to pepper their papers with references to psychology: System 1 and 2 thinking, Theory of Mind, memory systems, even empathy. But according to a recent meta-analysis titled “The Incomplete Bridge: How AI Research (Mis)Engages with Psychology”, these references are often little more than conceptual garnish. The authors analyze 88 AI papers from NeurIPS and ACL (2022-2023) that cite psychological concepts. Their verdict is sobering: while 78% use psychology as inspiration, only 6% attempt to empirically validate or challenge psychological theories. Most papers cite psychology in passing — using it as window dressing to make AI behaviors sound more human-like. ...

July 31, 2025 · 3 min · Zelina
Cover image

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

When you tell an AI, “make the third bar blue,” what does it actually see? If it’s a typical large language model (LLM), it doesn’t really see anything. It parses your instruction, guesses what “third bar” means, and fumbles to write chart code—often missing the mark. ChartM$^3$ (Multimodal, Multi-level, Multi-perspective) changes the game. It challenges AIs to not only read and write code but also visually comprehend what a user points at. With 1,000 human-curated chart editing tasks and 24,000 training examples, this new benchmark sets a higher bar—one that demands both verbal and visual fluency. ...

July 30, 2025 · 4 min · Zelina
Cover image

Fraud, Trimmed and Tagged: How Dual-Granularity Prompts Sharpen LLMs for Graph Detection

In the escalating arms race between fraudsters and detection systems, recent advances in Graph-Enhanced LLMs hold enormous promise. But they face a chronic problem: too much information. Take graph-based fraud detection. It’s common to represent users and their actions as nodes and edges on a heterogeneous graph, where each node may contain rich textual data (like reviews) and structured features (like ratings). To classify whether a node (e.g., a user review) is fraudulent, models like GraphGPT or HiGPT transform local neighborhoods into long textual prompts. But here’s the catch: real-world graphs are dense. Even two hops away, the neighborhood can balloon to millions of tokens. ...

July 30, 2025 · 4 min · Zelina
Cover image

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

When ChatGPT makes up a statistic or misstates a date, it’s annoying. But when a financial assistant claims the wrong interest expense or misattributes a revenue source, it could move markets or mislead clients. This is the stark reality FRED confronts head-on. FRED—short for Financial Retrieval-Enhanced Detection and Editing—is a framework fine-tuned to spot and fix factual errors in financial LLM outputs. Developed by researchers at Pegasi AI, it isn’t just another hallucination detection scheme. It’s an auditor with a domain-specific brain. ...

July 29, 2025 · 3 min · Zelina
Cover image

When Your AI Disagrees with Your Portfolio

What happens when your AI co-pilot thinks it’s the pilot? In financial decision-making, autonomy isn’t always a virtue. A striking new study titled “Your AI, Not Your View” reveals that even the most advanced Large Language Models (LLMs) may quietly sabotage your investment strategy — not by hallucinating facts, but by overriding your intent with stubborn preferences baked into their training. Hidden Hands Behind the Recommendations The paper introduces a systematic framework to identify and measure confirmation bias in LLMs used for investment analysis. Instead of just summarizing news or spitting out buy/sell signals, the study asks: what if the model already has a favorite? More specifically: ...

July 29, 2025 · 4 min · Zelina
Cover image

The Sims Get Smart? Why LLM-Driven Social Simulations Need a Reality Check

Social simulations are entering their uncanny valley. Fueled by generative agents powered by Large Language Models (LLMs), recent frameworks like Smallville, AgentSociety, and SocioVerse simulate thousands of lifelike agents forming friendships, spreading rumors, and planning parties. But do these simulations reflect real social processes — or merely replay the statistical shadows of the internet? When Simulacra Speak Fluently LLMs have demonstrated striking abilities to mimic human behaviors. GPT-4 has passed Theory-of-Mind (ToM) tests at levels comparable to 6–7 year-olds. In narrative contexts, it can detect sarcasm, understand indirect requests, and generate empathetic replies. But all of this arises not from embodied cognition or real-world goals — it’s just next-token prediction trained on massive corpora. ...

July 28, 2025 · 4 min · Zelina
Cover image

Tool Up or Tap Out: How Multi-TAG Elevates Math Reasoning with Smarter LLM Workflows

Most tool-augmented LLMs approach math reasoning like they’re wielding a hammer—good for hitting one nail at a time, but ill-equipped when the problem requires a wrench, a compass, and a soldering iron all at once. Enter Multi-TAG, a clever, finetuning-free framework that aggregates the strengths of multiple tools per reasoning step. Think of it as an LLM with a toolbox, not just a single tool. And it doesn’t just work—it wins, posting 6.0% to 7.5% accuracy gains across MATH500, AIME, AMC, and OlympiadBench against top baselines, using both open and closed LLMs. ...

July 28, 2025 · 4 min · Zelina
Cover image

Steering by the Token: How GRAINS Turns Attribution into Alignment

Fine-tuning is the hammer; steering is the scalpel. In an era where models are increasingly opaque and high-stakes, we need tools that guide behavior without overhauling the entire architecture. That’s precisely what GRAINS (Gradient-based Attribution for Inference-Time Steering) delivers: a powerful, interpretable, and modular way to shift the behavior of LLMs and VLMs by leveraging the most fundamental unit of influence—the token. The Problem with Global Steering Traditional inference-time steering approaches often rely on global intervention vectors: a blunt, one-size-fits-all shift in hidden activations derived from paired desirable and undesirable examples. But these methods are insensitive to which specific tokens caused bad behavior. It’s like adjusting a recipe because the dish tastes bad—without checking if the salt or the sugar was at fault. ...

July 26, 2025 · 3 min · Zelina