Cover image

Atom by Atom, Better Research: How Fine-Grained Rewards Make Agentic Search Smarter

If you’ve ever watched a web agent swing from elegant reasoning to face‑plants on basic facts, you’ve met the limits of outcome‑only training. Atom‑Searcher proposes a simple but radical fix: stop treating the whole reasoning trace as one monolith. Instead, break it down into Atomic Thoughts—the minimal, functional units of reasoning—and supervise them directly with a Reasoning Reward Model (RRM). Then blend those process‑level rewards with the final answer score using a decaying curriculum. The result? More stable training, deeper search behavior, and better generalization across in‑ and out‑of‑domain QA. ...

August 19, 2025 · 5 min · Zelina
Cover image

When Collusion Cuts Prices: The Counterintuitive Economics of Algorithmic Bidding

Most warnings about algorithmic collusion tell the same story: sellers using AI to set prices end up coordinating—without explicit communication—to keep prices higher than competition would allow. This is what regulators fear: supra-competitive prices, reduced consumer welfare, and harder-to-detect anti-competitive behavior. A new study, however, flips the narrative on its head. By analyzing multi-dimensional decision-making—where reinforcement learning (RL) agents set both prices and advertising bids on a platform like Amazon—the authors uncover a surprising outcome: in markets with high consumer search costs, algorithmic “collusion” can lower prices below competitive benchmarks. ...

August 13, 2025 · 3 min · Zelina
Cover image

Search When It Hurts: How UR² Teaches Models to Retrieve Only When Needed

Most “smart” RAG stacks are actually compulsive googlers: they fetch first and think later. UR² (“Unified RAG and Reasoning”) flips that reflex. It trains a model to reason by default and retrieve only when necessary, using reinforcement learning (RL) to orchestrate the dance between internal knowledge and external evidence. Why this matters for builders: indiscriminate retrieval is the silent cost center of LLM systems—extra latency, bigger bills, brittle answers. UR² shows a way to make retrieval selective, structured, and rewarded, yielding better accuracy on exams (MMLU‑Pro, MedQA), real‑world QA (HotpotQA, Bamboogle, MuSiQue), and even math. ...

August 11, 2025 · 5 min · Zelina
Cover image

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

In AI development, removing humans from the training loop has long been a holy grail — not because people aren’t valuable, but because human labeling is expensive, slow, and fundamentally limited. R-Zero, a new framework from Tencent AI Seattle Lab, takes a decisive step in that direction: no seed dataset, no human annotations, and no external verifier. Just two AI roles — Challenger and Solver — locked in an evolutionary arms race. ...

August 8, 2025 · 3 min · Zelina
Cover image

From GUI Novice to Digital Native: How SEAgent Teaches Itself Software Autonomously

If you’ve ever tried to automate your own software workflows using AI, you’ll know the hard part isn’t reasoning — it’s clicking the right button in a sea of ambiguous icons, drop-downs, and obscure UIs. For agents tasked with navigating GUIs like humans do, the real challenge isn’t logic — it’s context. Enter SEAgent: a self-evolving computer-use agent that doesn’t just learn to operate software — it teaches itself how to learn, using nothing but screenshots, feedback from its own past mistakes, and a clever curriculum. ...

August 7, 2025 · 4 min · Zelina
Cover image

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

What if an LLM could learn not by reading more, but by thinking harder? That’s the radical premise behind Self-Questioning Language Models (SQLM), a framework that transforms large language models from passive learners into active generators of their own training data. No curated datasets. No labeled answers. Just a prompt — and a model that gets smarter by challenging itself. From Self-Play in Robotics to Reasoning in Language The inspiration for SQLM comes from asymmetric self-play, a technique used in robotics where one agent proposes tasks and another learns to solve them. Here, that paradigm is adapted to LLMs: ...

August 6, 2025 · 3 min · Zelina
Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

August 5, 2025 · 4 min · Zelina
Cover image

From Charts to Circuits: How TINs Rewire Technical Analysis for the AI Era

In a field where LSTMs, transformers, and black-box agents often dominate the conversation, a new framework dares to ask: What if our old tools weren’t wrong, just under-optimized? That’s the central premise behind Technical Indicator Networks (TINs) — a novel architecture that transforms traditional technical analysis indicators into interpretable, trainable neural networks. Indicators, Meet Neural Networks Rather than discarding hand-crafted indicators like MACD or RSI, the TIN approach recasts them as neural network topologies. A Moving Average becomes a linear layer. MACD? A cascade of two EMAs with a subtractive node and a smoothing layer. RSI? A bias-regularized division circuit. The resulting neural networks aren’t generic function approximators; they’re directly derived from the mathematical structure of the indicators themselves. ...

August 3, 2025 · 3 min · Zelina
Cover image

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

Large language models have come a long way in learning to say “no.” When asked to give instructions for illegal acts or harmful behavior, modern LLMs are generally aligned to refuse. But a new class of attacks—logit manipulation—sidesteps this safety net entirely. Instead of tricking the model through prompts, it intervenes after the prompt is processed, modifying token probabilities during generation. This paper introduces Strategic Deflection (SDeflection), a defense that doesn’t rely on refusal at all. Instead, it teaches the model to elegantly pivot: providing a safe, semantically adjacent answer that appears cooperative but never fulfills the malicious intent. Think of it not as a shield, but as judo—redirecting the force of the attack instead of resisting it head-on. ...

July 31, 2025 · 3 min · Zelina
Cover image

Stacking Alpha: How HARLF's Three-Tier Reinforcement Learner Beats the Market

The idea of merging language models and financial algorithms isn’t new — but HARLF takes it a step further by embedding them in a hierarchical reinforcement learning (HRL) framework that actually delivers. With a stunning 26% annualized ROI and a Sharpe ratio of 1.2, this isn’t just another LLM-meets-finance paper. It’s a blueprint for how sentiment and structure can be synergistically harnessed. From FinBERT to Fortune: Integrating Text with Tickers Most financial LLM pipelines stop at score generation: classify sentiment and call it a signal. But HARLF builds a full sentiment pipeline using FinBERT, generating monthly sentiment scores from scraped Google News articles for each of 14 assets. These scores aren’t just inputs — they form a complete observation vector that includes: ...

July 27, 2025 · 3 min · Zelina