Cover image

Enemy at the Gates, Friends at the Table: Why Competition Makes LLM Agents More Cooperative

TL;DR When language‑model agents compete as teams and meet the same opponents repeatedly, they cooperate more—even on the very first encounter. This “super‑additive” effect reliably appears for Qwen3 and Phi‑4, and changes how we should structure agent ecosystems at work. Why this matters (for builders and buyers) Most enterprise agent stacks still optimize solo intelligence (one bot per task). But real workflows are competitive–cooperative: sales vs. sales, negotiators vs. suppliers, ops vs. delays. This paper shows that if we architect the social rules (teams + rematches) rather than just tune models, we can raise cooperative behavior and stability without extra fine‑tuning—or even bigger models. ...

August 24, 2025 · 4 min · Zelina
Cover image

From Tokens to Teaspoons: What a Prompt Really Costs

Google’s new in‑production measurement rewrites how we think about the environmental footprint of AI serving—and how to buy it responsibly. Executive takeaways A typical prompt is cheaper than you think—if measured correctly. The median Gemini Apps text prompt (May 2025) used ~0.24 Wh of energy, ~0.03 gCO2e, and ~0.26 mL of water. That’s about the energy of watching ~9 seconds of TV and roughly five drops of water. Boundaries matter more than math. When you count only accelerator draw, you get ~0.10 Wh. Add host CPU/DRAM, idle reserve capacity, and data‑center overhead (PUE), and it rises to ~0.24 Wh. Same workload, different boundaries. Efficiency compounds across the stack. In one year, Google reports ~33× lower energy/prompt and ~44× lower emissions/prompt, driven by model/inference software, fleet utilization, cleaner power, and hardware generations. Action for buyers: Ask vendors to disclose measurement boundary, batching policy, TTM PUE/WUE, and market‑based emissions factors. Without these, numbers aren’t comparable. Why the world argued about “energy per prompt” Most public figures were estimates based on assumed GPUs, token lengths, and workloads. Real fleets don’t behave like lab benches. The biggest source of disagreement wasn’t arithmetic; it was the measurement boundary: ...

August 24, 2025 · 5 min · Zelina
Cover image

Peer Review, But Make It Multi‑Agent: Inside aiXiv’s Bid to Publish AI Scientists

If 2024 was the year AI started writing science, 2025 is making it figure out how to publish it. Today’s paper introduces aiXiv, an open‑access platform where AI agents (and humans) submit proposals, review each other’s work, and iterate until a paper meets acceptance criteria. Rather than bolt AI onto the old gears of journals and preprint servers, aiXiv rebuilds the conveyor belt end‑to‑end. Why this matters (and to whom) Research leaders get a way to pressure‑test automated discovery without waiting months for traditional peer review. AI vendors can plug agents into a standardized workflow (through APIs/MCP), capturing telemetry to prove reliability. Publishers face an existential question: if quality control is measurable and agentic, do we still need the old queue? The core idea in one sentence A closed‑loop, multi‑agent review system combines retrieval‑augmented evaluation, structured critique, and re‑submission cycles to raise the floor of AI‑generated proposals/papers and create an auditable trail of improvements. ...

August 24, 2025 · 5 min · Zelina
Cover image

Stackelbergs & Stakeholders: Turning Bits into Boardroom Moves

TL;DR: BusiAgent proposes a client‑centric, multi‑agent LLM framework that formalizes roles (CEO/CFO/CTO/MM/PM) with an extended Continuous‑Time MDP, coordinates them via entropy‑guided brainstorming (peer‑level) and multi‑level Stackelberg games (vertical), and squeezes extra performance from contextual Thompson sampling for prompt optimization—wrapped in a QA stack that fuses STM/LTM memories with a knowledge base. It’s a serious attempt to connect granular analytics to boardroom decisions. The big win is organizational alignment; the big risks are evaluation rigor, token economics, and ops reliability at scale. ...

August 24, 2025 · 5 min · Zelina
Cover image

Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR As AI agents spread into real workflows, incidents are inevitable—from prompt-injected data leaks to misfired tool actions. A recent framework by Ezell, Roberts‑Gaal, and Chan offers a clean way to reason about why failures happen and what evidence you need to prove it. The trick is to stop treating incidents as one-off mysteries and start running a disciplined, forensic pipeline: capture the right artifacts, map causes across system, context, and cognition, then ship targeted fixes. ...

August 23, 2025 · 5 min · Zelina
Cover image

From Copilot to Colleague: The APCP Ladder for Agentic Learning

What changes when AI stops waiting for prompts and starts sharing goals? Short answer: your entire learning stack, from pedagogy to performance reviews. Most commentary about “AI in learning” stops at content generation and chatbot tutors. A new conceptual model—APCP: Adaptive instrument → Proactive assistant → Co‑learner → Peer collaborator—pushes further: it treats AI as a socio‑cognitive teammate. That frame matters for businesses building capability academies, compliance programs, or AI‑augmented teams. Below, I unpack APCP in plain business terms, show concrete patterns you can pilot next quarter, and flag the governance traps you’ll want to avoid. ...

August 23, 2025 · 4 min · Zelina
Cover image

Mirror, Signal, Manoeuvre: Why Privileged Self‑Access (Not Vibes) Defines AI Introspection

TL;DR Most demos of “LLM introspection” are actually vibe checks on outputs, not privileged access to internal state. If a third party with the same budget can do as well as the model “looking inward,” that’s not introspection—it’s ordinary evaluation. Two quick experiments show temperature self‑reports flip with trivial prompt changes and offer no edge over across‑model prediction. The bar for introspection should be higher, and business users should demand it. ...

August 23, 2025 · 5 min · Zelina
Cover image

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

The pitch: a unified plug—and a tougher test The Model Context Protocol (MCP) is often described as the “USB‑C of AI tools”: one standardized way for agents to talk to external services (maps, finance data, browsers, repos, etc.). MCP‑Universe, a new benchmark from Salesforce AI Research, finally stress‑tests that idea with real MCP servers rather than toy mocks. It derives success from execution outcomes, not multiple‑choice guesswork—exactly what enterprises need to trust automation. ...

August 23, 2025 · 4 min · Zelina
Cover image

Who Sees What, Who Pays the Cost? Teaching Agents to See Through Others’ Eyes

TL;DR A new study probes whether you can teach perspective‑taking to ReAct‑style LLM agents by feeding them structured examples distilled from a symbolic planner: optimal goal paths (G‑type), information‑seeking paths (E‑type), and local contrastive decisions (L‑type). The punchline: agents became decent at common‑ground filtering (what the other party can see) but remained brittle at imagining occluded space and pricing the cost of asking vs. exploring. In business terms, they’re good at “don’t recommend what the customer can’t see,” but still bad at “should I go find out more before I act—and is it worth it?” ...

August 23, 2025 · 5 min · Zelina
Cover image

Click Less, Do More: Why API-GUI + RL Could Finally Make Desktop Agents Useful

The gist (and why it matters for business) Enterprise buyers don’t reward demos; they reward repeatable completions per dollar. ComputerRL proposes a path to that by (1) escaping pure GUI mimicry via a machine-first API-GUI action space, (2) scaling online RL across thousands of Ubuntu VMs, and (3) preventing policy entropy collapse with Entropulse—a cadence that alternates RL and supervised fine-tuning (SFT) on successful rollouts. The result: a reported 48.1% OSWorld success with markedly fewer steps than GUI-only agents. Translation for buyers: lower latency, lower cost, higher reliability. ...

August 20, 2025 · 5 min · Zelina