Cover image

From Wallets to Warlords: How AI Agents Are Colonizing Web3

When ChatGPT meets Ethereum, something stranger than fiction emerges: self-improving wallets, token-trading bots with personality, and agents that vote in DAOs like digital lobbyists. A recent systematic study of 133 Web3-AI agent projects has finally mapped this chaotic frontier — and the findings suggest we’re just witnessing the first skirmishes of a much bigger transformation. The Two Poles of the Web3-AI Ecosystem The paper identifies four major project categories: Category Project Count Avg Market Cap Example Projects AI Agent Incubation 56 $88M Singularity, Eliza OS Infrastructure 34 $188M NEAR, Fetch.ai Financial Services 55 $57M Nexo, Griffain, Wayfinder Creative & Virtual 28 $85M Botto, Hytopia Two clear dynamics emerge: ...

August 6, 2025 · 4 min · Zelina
Cover image

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

When a junior developer misunderstands your instructions, they might still write code that compiles and runs—but does the wrong thing. This is exactly what large language models (LLMs) do when faced with faulty premises. The latest paper, Refining Critical Thinking in LLM Code Generation, unveils FPBench, a benchmark that probes an overlooked blind spot: whether AI models can detect flawed assumptions before they generate a single line of code. Spoiler: they usually can’t. ...

August 6, 2025 · 3 min · Zelina
Cover image

Open-Source, Open Risk? Testing the Limits of Malicious Fine-Tuning

When OpenAI released the open-weight model gpt-oss, it did something rare: before letting the model into the wild, its researchers pretended to be bad actors. This wasn’t an ethical lapse. It was a safety strategy. The team simulated worst-case misuse by fine-tuning gpt-oss to maximize its dangerous capabilities in biology and cybersecurity. They called this process Malicious Fine-Tuning (MFT). And the results offer something the AI safety debate sorely lacks: empirical grounding. ...

August 6, 2025 · 4 min · Zelina
Cover image

Reasoning with Both Eyes Open: Why Multimodal Chain-of-Thought Still Trips Up LLMs

If today’s AI models can ace bar exams, explain astrophysics, and generate functional code from a napkin sketch, why do they still fail at seemingly simple questions that require looking and thinking? A new benchmark called MCORE (Multimodal Chain-of-Reasoning Evaluation) answers that question with a resounding: because reasoning across modalities is hard—and we’re not as far along as we thought. Beyond Pattern Matching: What MCORE Tests The majority of multimodal evaluations today rely on either: ...

August 6, 2025 · 3 min · Zelina
Cover image

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

What if an LLM could learn not by reading more, but by thinking harder? That’s the radical premise behind Self-Questioning Language Models (SQLM), a framework that transforms large language models from passive learners into active generators of their own training data. No curated datasets. No labeled answers. Just a prompt — and a model that gets smarter by challenging itself. From Self-Play in Robotics to Reasoning in Language The inspiration for SQLM comes from asymmetric self-play, a technique used in robotics where one agent proposes tasks and another learns to solve them. Here, that paradigm is adapted to LLMs: ...

August 6, 2025 · 3 min · Zelina
Cover image

Add to Cart, Add to Power: What Happens When AI Shops for You

When humans stop shopping and AI takes over, the cart becomes a new battleground. A recent study titled “What Is Your AI Agent Buying?” introduces a benchmark framework called ACES to simulate AI-mediated e-commerce environments, and the results are far more consequential than a simple switch from user clicks to agent decisions. The ACES Sandbox: Agentic E-Commerce Under the Microscope ACES (Agentic e-Commerce Simulator) offers a controlled environment that pairs state-of-the-art vision-language-model (VLM) agents with a mock shopping website. This setup enables causal measurement of how different product attributes (price, rating, reviews) and platform levers (position, tags, sponsorship) influence agentic decision-making. ...

August 5, 2025 · 4 min · Zelina
Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

August 5, 2025 · 4 min · Zelina
Cover image

Graphs, Gains, and Guile: How FinKario Outruns Financial LLMs

In the world of financial AI, where speed meets complexity, most systems are either too slow to adapt or too brittle to interpret the nuanced messiness of real-world finance. Enter FinKario, a new system that combines event-enhanced financial knowledge graphs with a graph-aware retrieval strategy — and outperforms both specialized financial LLMs and institutional strategies in real-world backtests. The Retail Investor’s Dilemma While retail traders drown in information overload, professional research reports contain rich insights — but they’re long, unstructured, and hard to parse. Most LLM-based tools don’t fully exploit these reports. They either extract static attributes (e.g., stock ticker, sector, valuation) or respond to isolated queries without contextual awareness. ...

August 5, 2025 · 3 min · Zelina
Cover image

Love in the Time of Context: Why LLMs Still Don't Get You

Personalization is the love language of AI. But today’s large language models (LLMs) are more like well-meaning pen pals than mind-reading confidants. They remember your name, maybe your writing style — but the moment the context shifts, they stumble. The CUPID benchmark, introduced in a recent COLM 2025 paper, shows just how wide the gap still is between knowing the user and understanding them in context. Beyond Global Preferences: The Rise of Contextual Alignment Most LLMs that claim to be “personalized” assume you have stable, monolithic preferences. If you like bullet points, they’ll always give you bullet points. If you once asked for formal tone, they’ll keep things stiff forever. ...

August 5, 2025 · 4 min · Zelina
Cover image

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

“I See What I Want to See” Modern multimodal large language models (MLLMs)—like GPT-4V, Gemini, and LLaVA—promise to “understand” images. But what happens when their eyes lie? In many real-world cases, MLLMs generate fluent, plausible-sounding responses that are visually inaccurate or outright hallucinated. That’s a problem not just for safety, but for trust. A new paper titled “Understanding, Localizing, and Mitigating Hallucinations in Multimodal Large Language Models” introduces a systematic approach to this growing issue. It moves beyond just counting hallucinations and instead offers tools to diagnose where they come from—and more importantly, how to fix them. ...

August 5, 2025 · 3 min · Zelina