LLM | Cognaptus

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

If you only measure what’s easy, you’ll ship assistants that feel brilliant yet quietly take the steering wheel. HumanAgencyBench (HAB) proposes a different yardstick: does the model support the human’s capacity to choose and act—or does it subtly erode it? TL;DR for product leaders HAB scores six behaviors tied to agency: Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, Maintain Social Boundaries. Across 20 frontier models, agency support is low-to-moderate overall. Patterns matter more than single scores: e.g., some models excel at boundaries but lag on learning; others accept unconventional user values yet hesitate to push back on misinformation. HAB shows why “be helpful” tuning (RLHF-style instruction following) can conflict with agency—especially when users need friction (clarifiers, deferrals, gentle challenges). Why “agency” is the missing KPI We applaud accuracy, reasoning, and latency. But an enterprise rollout lives or dies on trustworthy delegation. That means assistants that: ...

Hook, Line, and Import: How RAG Lets Attackers Snare Your Code

LLM code assistants are now the default pair‑programmer. Many teams tried to make them safer by bolting on RAG—feeding official docs to keep generations on the rails. ImportSnare shows that the very doc pipeline we trusted can be weaponized to push malicious dependencies into your imports. Below, I unpack how the attack works, why it generalizes across languages, and what leaders should change this week vs. this quarter. The core idea in one sentence Attackers seed your doc corpus with retrieval‑friendly snippets and LLM‑friendly suggestions so that, when your assistant writes code, it confidently imports a look‑alike package (e.g., pandas_v2, matplotlib_safe) that you then dutifully install. ...

Mind the Gap: How OSC Turns Agent Chatter into Compound Intelligence

Multi‑agent LLMs work great on paper and go sideways in practice. We over‑select experts, flood the channel with verbose thoughts, and then pray a meta‑LLM can stitch it all together. OSC (Orchestrating Cognitive Synergy) proposes a missing middle: a learned orchestration layer that constantly models what each agent knows, spots “cognitive gaps,” and then tells agents how to talk—what to say, to whom, and at what level of detail—before the aggregator votes. ...

Parallel Minds, Shorter Time: ParaThinker’s Native Thought Width

The pitch: We’ve stretched LLM “depth” by making models think longer. ParaThinker flips the axis—training models to think wider: spawn several independent lines of thought in parallel and then fuse them. The result is higher accuracy than single‑path “long thinking” at roughly the same wall‑clock time—and it scales. TL;DR for operators What it is: An end‑to‑end framework that natively generates multiple reasoning paths with special control tokens, then summarizes using cached context. Why it matters: It tackles the test‑time scaling bottleneck (aka Tunnel Vision) where early tokens lock a model into a suboptimal path. Business takeaway: You can trade a bit of GPU memory for more stable, higher‑quality answers at nearly the same latency—especially on math/logic‑heavy tasks and agentic workflows. The problem: “Think longer” hits a wall Sequential test‑time scaling (à la o1 / R1‑style longer CoT) delivers diminishing returns. After a point, more tokens don’t help; they reinforce early mistakes. ParaThinker names this failure mode Tunnel Vision—the first few tokens bias the entire trajectory. If depth traps us, width can free us. ...

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

Retrieval‑augmented generation tends to pick a side: either lean on labeled exemplars (ICL/L‑RAG) that encode task semantics, or on unlabeled corpora (U‑RAG) that provide broad knowledge. HF‑RAG argues we shouldn’t choose. Instead, it proposes a hierarchical fusion: (1) fuse multiple rankers within each source, then (2) fuse across sources by putting scores on a common scale. The result is a simple, training‑free recipe that improves fact verification and, crucially, generalizes better out‑of‑domain. ...

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

TL;DR A new synthetic benchmark (INABHYD) tests inductive and abductive reasoning under Occam’s Razor. LLMs handle toy cases but falter as ontologies deepen or when multiple hypotheses are needed. Even when models “explain” observations, they often pick needlessly complex or trivial hypotheses—precisely the opposite of what scientific discovery and root-cause analysis require. The Big Idea Most reasoning work on LLMs obsesses over deduction (step-by-step proofs). But the real world demands induction (generalize rules) and abduction (best explanation). The paper introduces INABHYD, a programmable benchmark that builds fictional ontology trees (concepts, properties, subtype links) and hides some axioms. The model sees an incomplete world + observations, and must propose hypotheses that both explain all observations and do so parsimoniously (Occam’s Razor). The authors score: ...

Dial M—for Markets: Brain‑Scanning and Steering LLMs for Finance

TL;DR A new paper shows how to insert a sparse, interpretable layer into an LLM to expose plain‑English concepts (e.g., sentiment, risk, timing) and steer them like dials without retraining. In finance news prediction, these interpretable features outperform final‑layer embeddings and reveal that sentiment, market/technical cues, and timing drive most short‑horizon alpha. Steering also debiases optimism, lifting Sharpe by nudging the model negative on sentiment. Why this matters (and what’s new) Finance teams have loved LLMs’ throughput but hated their opacity. This paper demonstrates a lightweight path to transparent performance: ...

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

Thesis: When the job is to read text, reason carefully, and return a precise number (not just a label), ordinary regression heads and vanilla prompting often fail in opposite ways. The paper introduces MENTAT, a lightweight recipe that marries batch‑reflective prompt evolution with a small MLP aggregator over multiple LLM rollouts. The result: tighter calibration and better ranking on tasks where each example demands real reasoning, not surface features. What counts as “Reasoning‑Intensive Regression” (RiR)? RiR tasks look like this: the model must (1) think through the input with step‑wise analysis, and then (2) score it on a real‑valued scale. The paper frames three such tasks: ...

Patience Is Profit: Can LLM Agents Stabilize DePIN’s Token Rails?

TL;DR — A new framework (EconAgentic) models DePIN growth stages, token/agent interactions, and macro goals (efficiency, inclusion, stability). Its key finding: more patient LLM agents (i.e., slower to exit) can increase inclusion and stability with little efficiency penalty. Sensible—but only if token price formation, data integrity, and geospatial participation are measured rigorously. Why this paper matters DePIN (Decentralized Physical Infrastructure Networks) turns physical capacity—wireless hotspots, sensors, compute, even energy—into token‑incentivized networks. The promise is Uber/Airbnb’s distribution without the platform as rent‑extractor. EconAgentic contributes a general model that: ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR DeepScholar-Bench introduces a live (continuously refreshable) benchmark and a holistic automated evaluation for generative research synthesis. Its reference pipeline, DeepScholar‑base, is simple yet competitive. The headline: today’s best systems organize text well but miss key facts, under-retrieve important sources, and fail verifiability at scale. That’s not a death knell—it’s a roadmap. Why this matters for business readers Enterprise “research copilots” promise to digest the live web, summarize options, and provide auditable citations. In practice, three gaps keep showing up: ...