Cover image

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

Financial AI promises speed and scale — but in finance, a single misplaced digit can be the difference between compliance and catastrophe. The FAITH (Framework for Assessing Intrinsic Tabular Hallucinations) benchmark tackles this risk head‑on, probing how well large language models can faithfully extract and compute numbers from the dense, interconnected tables in 10‑K filings. From Idea to Dataset: Masking With a Purpose FAITH reframes hallucination detection as a context‑aware masked span prediction task. It takes real S&P 500 annual reports, hides specific numeric spans, and asks the model to recover them — but only after ensuring three non‑negotiable conditions: ...

August 8, 2025 · 3 min · Zelina
Cover image

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

In AI development, removing humans from the training loop has long been a holy grail — not because people aren’t valuable, but because human labeling is expensive, slow, and fundamentally limited. R-Zero, a new framework from Tencent AI Seattle Lab, takes a decisive step in that direction: no seed dataset, no human annotations, and no external verifier. Just two AI roles — Challenger and Solver — locked in an evolutionary arms race. ...

August 8, 2025 · 3 min · Zelina
Cover image

Mind the Gap: How Tool Graph Retriever Fixes LLMs’ Missing Links

In enterprise AI automation, the devil isn’t in the details—it’s in the dependencies. As LLM-powered agents gain access to hundreds or thousands of external tools, they face a simple but costly problem: finding all the right tools for the job. Most retrieval systems focus on semantic similarity—matching user queries to tool descriptions—but ignore a crucial fact: some tools can’t work without others. The result? A task that seems perfectly matched to a retrieved tool still fails, because a prerequisite tool never made it into the context window. Tool Graph Retriever (TGR) aims to solve this by making dependencies first-class citizens in retrieval. ...

August 8, 2025 · 3 min · Zelina
Cover image

The Diligent but Brittle Student Inside Every LLM

If you put a large language model in a classroom for a year, what kind of student would it become? According to Simulating Human-Like Learning Dynamics with LLM-Empowered Agents, the answer isn’t flattering: most base LLMs act like “diligent but brittle surface learners”—hardworking, seemingly capable, but unable to generalize deeply. From Psych Lab to AI Lab Educational psychology has spent decades classifying learners into profiles like deep learners (intrinsically motivated, reflective, conceptual) and surface learners (extrinsically motivated, test-oriented, shortcut-prone). The authors built LearnerAgent, a multi-agent framework grounded in these theories, and dropped four AI ‘students’ into a simulated high school English class: ...

August 8, 2025 · 3 min · Zelina
Cover image

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates

When AI Plays Lawmaker: Lessons from NomicLaw’s Multi-Agent Debates Large Language Models are increasingly touted as decision-making aides in policy and governance. But what happens when we let them loose together in a legislative sandbox? NomicLaw — an open-source multi-agent simulation inspired by the self-amending game Nomic — offers a glimpse into how AI agents argue, form alliances, and shape collective rules without human scripts. The Experiment NomicLaw pits LLM agents against legally charged vignettes — from self-driving car collisions to algorithmic discrimination — in a propose → justify → vote loop. Each agent crafts a legal rule, defends it, and votes on a peer’s proposal. Scoring is simple: 10 points for a win, 5 for a tie. Two configurations were tested: ...

August 8, 2025 · 3 min · Zelina
Cover image

Forecast First, Ask Later: How DCATS Makes Time Series Smarter with LLMs

When it comes to forecasting traffic patterns, weather, or financial activity, the prevailing wisdom in machine learning has long been: better models mean better predictions. But a new approach flips this assumption on its head. Instead of chasing ever-more complex architectures, the DCATS framework (Data-Centric Agent for Time Series), developed by researchers at Visa, suggests we should first get our data in order—and let a language model do it. The Agentic Turn in AutoML DCATS builds on the trend of integrating Large Language Model (LLM) agents into AutoML pipelines, but with a twist. While prior systems like AIDE focus on automating model design and hyperparameter tuning, DCATS delegates a more fundamental task to its LLM agent: curating the right data. ...

August 7, 2025 · 3 min · Zelina
Cover image

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

If you’ve ever tried to automate your own software workflows using AI, you’ll know the hard part isn’t reasoning — it’s clicking the right button in a sea of ambiguous icons, drop-downs, and obscure UIs. For agents tasked with navigating GUIs like humans do, the real challenge isn’t logic — it’s context. Enter SEAgent: a self-evolving computer-use agent that doesn’t just learn to operate software — it teaches itself how to learn, using nothing but screenshots, feedback from its own past mistakes, and a clever curriculum. ...

August 7, 2025 · 4 min · Zelina
Cover image

Scalpels Not Sledgehammers: A New Era of Precision Editing for LLMs

Most LLM editing approaches operate like sledgehammers—bluntly rewriting model weights and praying generalization holds. But a new method, Latent Knowledge Scalpel (LKS), dares to be surgical. Rather than changing the model itself, it targets how the model thinks—rewriting entity representations in the hidden layers, like swapping memories without touching the brain. From Entities to Knowledge Blocks The authors begin with a provocative observation: the internal representation (embedding) of an entity like “Alfred Nobel” doesn’t just encode a name, but a structured, meaningful knowledge block (KB). These latent vectors reflect factual associations like birthplace or occupation, and remarkably, they retain semantic and syntactic structures. For instance, swapping Nobel’s KB with that of “Shelley” shifts the model’s predicted birthplace from Sweden to England—even though the prompt wasn’t changed. ...

August 7, 2025 · 4 min · Zelina
Cover image

Shattering the Spectrum: How PRISM Revives Signal Processing in Time-Series AI

In the race to conquer time-series classification, most modern models have sprinted toward deeper Transformers and wider convolutional architectures. But what if the real breakthrough came not from complexity—but from symmetry? Enter PRISM (Per-channel Resolution-Informed Symmetric Module), a model that merges classical signal processing wisdom with deep learning, and in doing so, delivers a stunning blow to overparameterized AI. PRISM’s central idea is refreshingly simple: instead of building a massive model to learn everything from scratch, start by decomposing the signal like a physicist would—using symmetric FIR filters at multiple temporal resolutions, applied independently per channel. Like a prism splitting light into distinct wavelengths, PRISM separates time-series data into spectral components that are clean, diverse, and informative. ...

August 7, 2025 · 3 min · Zelina
Cover image

The Forest Within: How Galaxy Reinvents LLM Agents with Self-Evolving Cognition

In a field where many agents act like well-trained dogs, obediently waiting for commands, Galaxy offers something more radical: a system that watches, thinks, adapts, and evolves—without needing to be told. It’s not just an intelligent personal assistant (IPA); it’s an architecture that redefines what intelligence means for LLM-based agents. Let’s dive into why Galaxy is a leap beyond chatty interfaces and into cognition-driven autonomy. 🌳 Beyond Pipelines: The Cognition Forest At the heart of Galaxy lies the Cognition Forest, a structured semantic space that fuses cognitive modeling and system design. Each subtree represents a facet of agent understanding: ...

August 7, 2025 · 4 min · Zelina