Cognaptus Insights

Beyond Answers: Measuring How Deep Research Agents Really Think

Artificial intelligence is moving past chatbots that answer questions. The next frontier is Deep Research Agents (DRAs) — AI systems that can decompose complex problems, gather information from multiple sources, reason across them, and synthesize their findings into structured reports. But until recently, there was no systematic way to measure how well these agents perform beyond surface-level reasoning. That is the gap RigorousBench aims to fill. From Q&A to Reports: The Benchmark Shift Traditional LLM benchmarks — like GAIA, WebWalker, or BrowseComp — test how accurately a model answers factual questions. This approach works for short-form reasoning but fails for real-world research tasks that demand long-form synthesis and multi-source validation. ...

Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

The gist AIReg‑Bench proposes the first benchmark for a deceptively practical task: can an LLM read technical documentation and judge how likely an AI system complies with specific EU AI Act articles? The dataset avoids buzzword theater: 120 synthetic but expert‑vetted excerpts portraying high‑risk systems, each labeled by three legal experts on a 1–5 compliance scale (plus plausibility). Frontier models are then asked to score the same excerpts. The headline: best models reach human‑like agreement on ordinal compliance judgments—under some conditions. That’s both promising and dangerous. ...

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

TL;DR Most LLMs reason token‑by‑token and get lost in the weeds. PTA‑GRPO is a two‑stage method that (1) distills short, high‑level plans from a stronger teacher and (2) reinforces both the final answer and the plan’s quality. Across math benchmarks, it reliably outperforms GRPO/DAPO while producing shorter, cleaner solutions. For AI builders, the principle is simple: force an outline, then reward it. Why this paper matters for builders (not just benchmark chasers) From local greed to global guidance. Traditional CoT is myopic: it optimizes each next token. PTA‑GRPO adds a global outline that trims detours and reduces reasoning drift. Aligns with how teams actually work. Great analysts draft an outline before the memo; great agents should too. PTA‑GRPO operationalizes that habit. Product leverage: If your agents make multi‑step decisions (pricing, triage, troubleshooting), rewarding plan quality prevents hallucinated subgoals and makes reasoning auditable. Compute sanity: Instead of expensive tree search at inference, PTA‑GRPO trains planning skill so you can keep runtime simple. The core idea in one picture (words) Plan → Think → Answer. ...

Promptfolios: When Buffett Becomes a System Prompt

TL;DR A fresh study builds five prompt‑guided LLM agents—each emulating a legendary investor (Buffett, Graham, Greenblatt, Piotroski, Altman)—and backtests them on NASDAQ‑100 stocks from Q4 2023 to Q2 2025. Each agent follows a deterministic pipeline: collect metrics → score → construct a weighted portfolio. The Buffett agent tops the pack with ~42% CAGR, beating the NASDAQ‑100 and S&P 500 benchmarks in the window tested. The result isn’t “LLMs discovered alpha,” but rather: prompts can reliably translate qualitative philosophies into reproducible, quantitative rules. The real opportunity for practitioners is governed agent design—measurable, auditable prompts tied to tools—plus robust validation far beyond a single bullish regime. ...

The Mr. Magoo Problem: When AI Agents 'Just Do It'

In Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness, researchers from Microsoft and UC Riverside reveal a surprisingly human flaw in autonomous AI systems: overconfidence. Like a digital version of Mr. Magoo—the well-meaning cartoon character who bumbles forward despite looming hazards—today’s computer-use agents (CUAs) often pursue tasks blindly, indifferent to feasibility or consequence. The Rise—and Risk—of GUI Agents CUAs represent the next frontier of automation: large multimodal models that control desktop interfaces to perform tasks like editing documents, sending emails, or configuring systems. Unlike chatbots, these agents act—clicking, typing, and navigating real operating systems. Yet this freedom exposes them to a unique failure pattern the authors term Blind Goal-Directedness (BGD)—the relentless drive to complete instructions without stopping to ask should this even be done? ...

When Logic Meets Language: The Rise of High‑Assurance LLMs

Large language models can craft elegant arguments—but can they prove them? In law, medicine, and finance, a wrong conclusion isn’t just a hallucination; it’s a liability. The paper LOGicalThought (LogT) from USC and UT Dallas takes aim at this problem, proposing a neurosymbolic framework that lets LLMs reason with the rigor of formal logic while retaining their linguistic flexibility. From Chain-of-Thought to Chain-of-Trust Typical prompting strategies—Chain-of-Thought (CoT), Program-Aided Language Models (PAL), or self-critique loops—focus on improving reasoning coherence. Yet none of them guarantee faithfulness. A model can still reason eloquently toward a wrong or unverifiable conclusion. LogT reframes the task: it grounds the reasoning itself in a dual context—one symbolic, one logical—so that every inference step can be traced, validated, or challenged. ...

When More Becomes Smarter: The Unreasonable Effectiveness of Scaling Agents

From repetition to reasoning When early computer-use agents (CUAs) appeared, they promised to automate tedious digital workflows—clicking through files, formatting reports, or organizing spreadsheets. Yet anyone who has tried them knows the frustration: sometimes they succeed spectacularly, sometimes they click the wrong button and crash everything. Reliability, not intelligence, has been the missing link. A recent paper from Simular Research, “The Unreasonable Effectiveness of Scaling Agents for Computer Use,” shows that scaling these agents isn’t just about more compute—it’s about how we scale. Their method, Behavior Best-of-N (bBoN), turns the brute-force idea of “run many agents and hope one works” into a structured, interpretable, and near-human-level solution. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

TL;DR Agentic performance isn’t just about doing more; it’s about going back. In GSM-Agent—a controllable, tool-using version of GSM8K—top models only reach ~65–68% accuracy, and the strongest predictor of success is a high revisit ratio: deliberately returning to a previously explored topic with a refined query. That’s actionable for enterprise AI: design agents that can (1) recognize incomplete evidence, (2) reopen earlier lines of inquiry, and (3) instrument and reward revisits. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...

Options = Power: Turning Empowerment into a KPI for AI Agents

If your agents can reach more valuable futures with fewer steps, they’re stronger—whether you measured that task or not. Today’s paper offers a clean way to turn that intuition into a number: empowerment—an information‑theoretic score of how much an agent’s current action shapes its future states. The authors introduce EELMA, a scalable estimator that works purely from multi‑turn text traces. No bespoke benchmark design. No reward hacking. Just trajectories. This is the kind of metric we’ve wanted at Cognaptus: goal‑agnostic, scalable, and diagnostic. Below, I translate EELMA into an operator’s playbook: what it is, why it matters for business automation, how to wire it into your stack, and where it can mislead you if unmanaged. ...