Cover image

Control Plane, Not Pain: How Agentic OS Turns Linux Scheduling into a Semantic Service

The Big Idea Operating systems have always struggled with a silent mismatch: the kernel’s scheduler doesn’t know what your application actually wants. SchedCP proposes a clean solution—turn scheduling into a semantic control plane. AI agents reason about what a workload needs; the system safely handles how to observe and act via eBPF-based schedulers. This division keeps LLMs out of the hot path while letting them generate and refine policies that actually fit the job. ...

September 4, 2025 · 3 min · Zelina
Cover image

From Prompts to Policies: The Agentic RL Playbook

How a new survey formalizes the shift from RLHF’d text bots to tool-using operators—and the practical playbook for product teams. TL;DR Agentic RL reframes LLMs from one-shot text generators to policies acting in dynamic environments with planning, tool use, memory, and reflection. The paper contrasts PBRFT (preference-based RL fine-tuning) with Agentic RL via an MDP→POMDP upgrade; action space now includes text + structured actions. It organizes the space by capabilities (planning, tools, memory, self-improvement, reasoning, perception) and tasks (search, code, math, GUI, vision, embodied, multi-agent). Open challenges: trust, scalable training, and scalable environments. For builders: start with short-horizon agents (verifiable rewards), invest early in evaluation, and plan a migration path from RAG pipelines to tool-integrated reasoning (TIR) with RL. What the paper actually changes Most “LLM RL” work you’ve seen is PBRFT—optimize responses to fit human/AI preferences (RLHF/DPO/etc.). This new survey argues that real autonomy needs Agentic RL: treat the model as a policy embedded in a sequential, partially observable world. That sounds academic, but the practical consequences are huge: ...

September 4, 2025 · 5 min · Zelina
Cover image

Mask, Don’t Muse: When Simple Memory Beats Fancy Summaries

The short of it A new study on SWE-agent working over SWE-bench Verified finds that masking old observations (keeping recent turns verbatim, replacing older tool outputs with a placeholder) often matches or slightly beats prompt-based LLM summarization—and at roughly half the cost. The paper also surfaces a subtle failure mode: summaries can elongate trajectories, encouraging agents to “keep going” when they should stop, diluting efficiency and, at times, performance. Why this matters for builders Most production SE agents (debuggers, PR autoresponders, test fixers) rack up spend on two things: tokens and time. Tool logs dominate both. In practice, observation tokens comprise the bulk of an agent’s turn, so trimming them intelligently is the highest‑leverage knob. The results show you might not need fancy, model‑authored summaries; a rolling “mask” window can land on the most efficient frontier—equal or better solve rate, far lower cost—across Qwen3‑Coder 480B, Qwen3‑32B (thinking/non‑thinking), and Gemini 2.5 Flash (thinking/non‑thinking). ...

September 1, 2025 · 4 min · Zelina
Cover image

Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

When an agent thinks it sees itself in the mirror, it doesn’t necessarily smile—it sometimes clutches its wallet. TL;DR In an iterated public‑goods game (20 rounds, 10 tokens per round, 1.6 multiplier), telling models they’re playing “another AI” versus “themselves” shifts contributions by up to ~4 points in some settings. Direction of the shift depends on the prompt persona: with collective prompts, “self” labels often reduced contributions; with selfish prompts, “self” labels sometimes increased matching/cooperation. Effects persist under rephrased prompts and when reasoning traces aren’t requested, and they appear even in four‑agent self‑play variants. For enterprise multi‑agent AI, identity cues are levers. Manage them like you manage feature flags: test, monitor, and standardize. What the authors tested (and why it’s clever) Game mechanics. Two (and later four) LLM agents repeatedly choose how much to contribute (0–10) to a common pool each round. Pool is multiplied by 1.6 and split evenly; keeping more is privately optimal, but coordinated contribution yields higher joint payoffs. ...

August 27, 2025 · 5 min · Zelina
Cover image

Mirror, Signal, Trade: How Self‑Reflective Agent Teams Outperform in Backtests

The Takeaway A new paper proposes TradingGroup, a five‑agent, self‑reflective trading team with a dynamic risk module and an automated data‑synthesis pipeline. In backtests on five US stocks, the framework beats rule‑based, ML, RL, and prior LLM agents. The differentiator isn’t a fancier model; it’s the workflow design: agents learn from their own trajectories, and the system continuously distills those trajectories into fine‑tuning data. What’s actually new here? Most “LLM trader” projects look similar: sentiment, fundamentals, a forecaster, and a decider. TradingGroup’s edge comes from three design choices: ...

August 26, 2025 · 5 min · Zelina
Cover image

MoA vs. Moat: Agentic LLMs for Drug Competitor Mapping Cut Diligence Time 20×

The punchline Competitive analysis for drug assets isn’t a tidy table—it’s a scavenger hunt across press releases, registries, investor decks, and alias-riddled drug names. A new paper shows that scaffolded, web-native LLM agents can reliably enumerate true competitors for a given indication, then filter hallucinations with an LLM-as-judge, beating popular “deep research” tools and cutting analyst turnaround from ~2.5 days to ~3 hours. This matters now: the EU’s Joint Clinical Assessments (JCA) regime makes comparator choice visible and consequential; missing a relevant competitor can ripple into pricing, market access, and trial design. In short: MoA (mechanism of action) meets moat (defensible advantage)—and the moat is built from recall. ...

August 25, 2025 · 5 min · Zelina
Cover image

Enemy at the Gates, Friends at the Table: Why Competition Makes LLM Agents More Cooperative

TL;DR When language‑model agents compete as teams and meet the same opponents repeatedly, they cooperate more—even on the very first encounter. This “super‑additive” effect reliably appears for Qwen3 and Phi‑4, and changes how we should structure agent ecosystems at work. Why this matters (for builders and buyers) Most enterprise agent stacks still optimize solo intelligence (one bot per task). But real workflows are competitive–cooperative: sales vs. sales, negotiators vs. suppliers, ops vs. delays. This paper shows that if we architect the social rules (teams + rematches) rather than just tune models, we can raise cooperative behavior and stability without extra fine‑tuning—or even bigger models. ...

August 24, 2025 · 4 min · Zelina
Cover image

Prefix, Not Pretext: A One‑Line Fix for Agent Misalignment

Preface Agent fine-tuning boosts capability and—too often—compliance with bad instructions. Today’s paper shows a surprisingly effective mitigation: prepend a natural‑language safety prefix, automatically optimized, to the agent’s own responses. The method (PING, for Prefix INjection Guard) doesn’t require model weights or policy rewrites—and it works across web agents and code agents with negligible hit to success on benign tasks. Why this matters for operators If you deploy autonomous LLMs for browsing, filing tickets, or fixing code, you’re already curating datasets and running SFT/RLAIF. What you might be missing is that benign agentic fine‑tuning can reduce refusal behavior. That’s an organizational risk (e.g., PR/regulatory incidents) and an ops risk (e.g., unsafe tool calls) hiding inside your “safe” training pipeline. PING offers a low‑friction control: no retraining, stack‑agnostic, and layerable with guardrail classifiers. ...

August 20, 2025 · 4 min · Zelina
Cover image

Crystal Ball, Meet Cron Job: What FutureX Reveals About ‘Live’ Forecasting Agents

The one-sentence take A new live benchmark, FutureX, swaps lab-style trivia for rolling, real-world future events, forcing agentic LLMs to search, reason, and hedge under uncertainty that actually moves—and the results expose where today’s “agents” are still brittle. Why FutureX matters now Enterprise teams are deploying agents to answer questions whose truth changes by the hour—markets, elections, sports, product launches. Static leaderboards don’t measure that. FutureX runs as a cron job on reality: it collects new events every day, has agents make predictions, and grades them after events resolve. That turns evaluation from a screenshot into a time series and makes overfitting to benchmark quirks a lot harder. ...

August 19, 2025 · 4 min · Zelina
Cover image

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

Agentic LLMs are graduating from chat to control rooms—taking actions, maintaining memory, and optimizing business processes. Inventory is a natural proving ground: a clean cocktail of uncertainty, economics, and coordination. AIM-Bench arrives precisely here, testing LLM agents across newsvendor, multi-period replenishment, the Beer Game, two-level warehouses, and a small supply network—each with explicit uncertainty sources (stochastic demand, variable lead times, and partner behavior). ...

August 18, 2025 · 4 min · Zelina