Cover image

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

What if an LLM could learn not by reading more, but by thinking harder? That’s the radical premise behind Self-Questioning Language Models (SQLM), a framework that transforms large language models from passive learners into active generators of their own training data. No curated datasets. No labeled answers. Just a prompt — and a model that gets smarter by challenging itself. From Self-Play in Robotics to Reasoning in Language The inspiration for SQLM comes from asymmetric self-play, a technique used in robotics where one agent proposes tasks and another learns to solve them. Here, that paradigm is adapted to LLMs: ...

August 6, 2025 · 3 min · Zelina
Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents. ...

August 5, 2025 · 4 min · Zelina
Cover image

From Charts to Circuits: How TINs Rewire Technical Analysis for the AI Era

In a field where LSTMs, transformers, and black-box agents often dominate the conversation, a new framework dares to ask: What if our old tools weren’t wrong, just under-optimized? That’s the central premise behind Technical Indicator Networks (TINs) — a novel architecture that transforms traditional technical analysis indicators into interpretable, trainable neural networks. Indicators, Meet Neural Networks Rather than discarding hand-crafted indicators like MACD or RSI, the TIN approach recasts them as neural network topologies. A Moving Average becomes a linear layer. MACD? A cascade of two EMAs with a subtractive node and a smoothing layer. RSI? A bias-regularized division circuit. The resulting neural networks aren’t generic function approximators; they’re directly derived from the mathematical structure of the indicators themselves. ...

August 3, 2025 · 3 min · Zelina
Cover image

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

Large language models have come a long way in learning to say “no.” When asked to give instructions for illegal acts or harmful behavior, modern LLMs are generally aligned to refuse. But a new class of attacks—logit manipulation—sidesteps this safety net entirely. Instead of tricking the model through prompts, it intervenes after the prompt is processed, modifying token probabilities during generation. This paper introduces Strategic Deflection (SDeflection), a defense that doesn’t rely on refusal at all. Instead, it teaches the model to elegantly pivot: providing a safe, semantically adjacent answer that appears cooperative but never fulfills the malicious intent. Think of it not as a shield, but as judo—redirecting the force of the attack instead of resisting it head-on. ...

July 31, 2025 · 3 min · Zelina
Cover image

Stacking Alpha: How HARLF's Three-Tier Reinforcement Learner Beats the Market

The idea of merging language models and financial algorithms isn’t new — but HARLF takes it a step further by embedding them in a hierarchical reinforcement learning (HRL) framework that actually delivers. With a stunning 26% annualized ROI and a Sharpe ratio of 1.2, this isn’t just another LLM-meets-finance paper. It’s a blueprint for how sentiment and structure can be synergistically harnessed. From FinBERT to Fortune: Integrating Text with Tickers Most financial LLM pipelines stop at score generation: classify sentiment and call it a signal. But HARLF builds a full sentiment pipeline using FinBERT, generating monthly sentiment scores from scraped Google News articles for each of 14 assets. These scores aren’t just inputs — they form a complete observation vector that includes: ...

July 27, 2025 · 3 min · Zelina
Cover image

When Learning Goes Rogue: Fixing RL Biases in Economic Simulations

Reinforcement Learning (RL) has become a seductive tool for economists seeking to simulate adaptive behavior in dynamic, uncertain environments. But when it comes to modeling firms in equilibrium labor markets, this computational marriage reveals some serious incompatibilities. In a recent paper, Zhang and Chen expose two critical mismatches that emerge when standard RL is naively applied to simulate economic models — and offer a principled fix that merges the best of RL and economic theory. ...

July 27, 2025 · 4 min · Zelina
Cover image

Can You Spot the Bot? Why Detectability, Not Deception, Is the New AI Frontier

In an age where generative models can ace SATs, write novels, and mimic empathy, it’s no longer enough to ask, “Can an AI fool us?” The better question is: Can we still detect it when it does? That’s the premise behind the Dual Turing Test, a sharp reframing of the classic imitation game. Rather than rewarding AI for successfully pretending to be human, this framework challenges judges to reliably detect AI—even when its responses meet strict quality standards. ...

July 26, 2025 · 4 min · Zelina
Cover image

Think Twice, Then Speak: Deliberative Searcher and the Future of Reliable LLMs

When a large language model (LLM) answers your question with a high degree of confidence, do you trust it? What if it’s wrong—but still confident? The stakes are high in real-world applications, from legal guidance to enterprise decision support. Yet today’s LLMs remain notoriously unreliable in aligning their confidence with correctness. The paper Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints (Yin et al., 2025) offers a bold response: rewire LLMs to be reasoning-primary and information-secondary. Instead of front-loading search and passively absorbing evidence, Deliberative Searcher acts more like a prudent investigator: it thinks, self-assesses, retrieves external information only when needed, and calibrates its confidence step-by-step. Crucially, it learns this behavior through a custom constrained reinforcement learning regime. ...

July 23, 2025 · 3 min · Zelina
Cover image

Simulate First, Invest Later: How Diffusion Models Are Reinventing Portfolio Optimization

What if you could simulate thousands of realistic futures for the market, all conditioned on what’s happening today—and then train an investment strategy on those futures? That’s the central idea behind a bold new approach to portfolio optimization that blends score-based diffusion models with reinforcement learning, and it’s showing results that beat classic benchmarks like the S&P 500 and traditional Markowitz portfolios. ...

July 20, 2025 · 4 min · Zelina
Cover image

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

In the arms race to align large language models (LLMs), supervised fine-tuning (SFT) and reinforcement learning (RL) are often painted as competing paradigms. SFT is praised for its stability and simplicity; RL is heralded for its theoretical soundness and alignment fidelity. But what if this dichotomy is an illusion? A recent preprint from Chongli Qin and Jost Tobias Springenberg makes a bold and elegant claim: SFT on curated data is not merely supervised learning—it is actually optimizing a lower bound on the RL objective. ...

July 18, 2025 · 4 min · Zelina