Cognaptus Insights

Bias in the Warehouse: What AIM-Bench Reveals About Agentic LLMs

Agentic LLMs are graduating from chat to control rooms—taking actions, maintaining memory, and optimizing business processes. Inventory is a natural proving ground: a clean cocktail of uncertainty, economics, and coordination. AIM-Bench arrives precisely here, testing LLM agents across newsvendor, multi-period replenishment, the Beer Game, two-level warehouses, and a small supply network—each with explicit uncertainty sources (stochastic demand, variable lead times, and partner behavior). ...

Consent, Coaxing, and Countermoves: Simulating Privacy Attacks on LLM Agents

When organizations deploy LLM-based agents to email, message, and collaborate on our behalf, privacy threats stop being static. The attacker is now another agent able to converse, probe, and adapt. Today’s paper proposes a simulation-plus-search framework that discovers these evolving risks—and the countermeasures that survive them. The result is a rare, actionable playbook: how attacks escalate in multi-turn dialogues, and how defenses must graduate from rules to identity-verified state machines. ...

Keys to the Kingdom: How LLMs Can Audit Crypto Logic Before It Breaks

We’ve gotten good at spotting API misuse in crypto code (think “don’t use ECB,” “don’t hardcode IVs”). But many production failures don’t come from the obvious API call—they’re born in the logic that surrounds it: the parameter checks, corner-case math, and brittle “optimizations.” That’s where CryptoScope steps in: an LLM-powered framework that reads crypto code like a human auditor, guided by a domain corpus and structured prompts, to uncover logic-level vulnerabilities without executing the code. ...

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

Paging Dr. Model: When AI Runs the Workup

What if the AI didn’t just answer a question—it ordered the right tests, asked for the right observations, and stopped when it had enough to call the case? A new paper introduces DxDirector-7B, a 7B-parameter medical LLM trained to act as the director of care, not the assistant. Instead of waiting for a physician to assemble clean inputs, the model starts from the patient’s vague chief complaint (e.g., “tummy pain and tired”) and then plans the diagnostic pathway, requesting only those clinician actions that software cannot perform (physical exams, labs, imaging). The goal is twofold: maximize diagnostic accuracy and minimize human workload. ...

Patch Tuesday for the Law: Hunting Legal Zero‑Days in AI Governance

TL;DR: Legal zero‑days are previously unnoticed faults in how laws interlock. When triggered, they can invalidate decisions, stall regulators, or nullify safeguards immediately—no lawsuit required. A new evaluation finds current AI models only occasionally detect such flaws, but the capability is measurable and likely to grow. Leaders should treat statutory integrity like cybersecurity: threat model, red‑team, patch. What’s a “legal zero‑day”? Think of a software zero‑day, but in law. It’s not a vague “loophole,” nor normal jurisprudential drift. It’s a precise, latent defect in how definitions, scope clauses, or cross‑references interact such that real‑world effects fire at once when someone notices—e.g., eligibility rules void an officeholder, or a definitional tweak quietly de‑scopes entire compliance obligations. ...

Skip or Split? How LLMs Can Make Old-School Planners Run Circles Around Complexity

TL;DR Classical planners crack under scale. You can rescue them with LLMs in two ways: (1) Inspire the next action, or (2) Predict an intermediate state and split the search. On diverse benchmarks (Blocks, Logistics, Depot, Mystery), the Predict route generally solves more cases with fewer LLM calls, except when domain semantics are opaque. For enterprise automation, this points to a practical recipe: decompose → predict key waypoints → verify with a trusted solver—and only fall back to “inspire” when your domain model is thin. ...

Therapy, Explained: How Multi‑Agent LLMs Turn DSM‑5 Screens into Auditable Logic

TL;DR DSM5AgentFlow uses three cooperating LLM agents—Therapist, Client, and Diagnostician—to simulate DSM‑5 Level‑1 screenings and then generate step‑by‑step diagnoses tied to specific DSM criteria. Experiments across four LLMs show a familiar trade‑off: dialogue‑oriented models sounded more natural, while a reasoning‑oriented model scored higher on diagnostic accuracy. For founders and PMs in digital mental health, the win is auditability: every symptom claim can be traced to a quoted utterance and an explicit DSM clause. The catch: results are built on synthetic dialogues, so ecological validity and real‑world safety remain open. ...

Three’s Company: When LLMs Argue Their Way to Alpha

TL;DR A role‑based, debate‑driven LLM system—AlphaAgents—coordinates three specialist agents (fundamental, sentiment, valuation) to screen equities, reach consensus, and build a simple, equal‑weight portfolio. In a four‑month backtest starting 2024‑02‑01 on 15 tech names, the risk‑neutral multi‑agent portfolio outperformed the benchmark and single‑agent baselines; risk‑averse variants underperformed in a bull run (as expected). The real innovation isn’t the short backtest—it’s the explainable process: constrained tools per role, structured debate, and explicit risk‑tolerance prompts. ...

$Cover image$

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

Large language models can talk through a solution like a star pupil—and still get the answer wrong. A new study of four modern LLMs across arithmetic, algebra, and number theory shows where they stumble (mostly procedural slips), when they recover (with a second agent), and how teams should redesign AI tutors and graders to be trustworthy in the real world. TL;DR for builders Single models still flub arithmetic. Even strong general models mis-add partial products or mis-handle carries. Reasoning-tuned models help—but not always. OpenAI o1 was consistently best; DeepSeek‑R1 “overthought” and missed basics. Two agents beat one. Peer‑review style “dual agents” dramatically raised accuracy, especially on Diophantine equations. Most errors are procedural, not conceptual. Think slips and symbolic manipulations—not deep misunderstandings. Step‑labeling works. A simple rubric (Correct / Procedural / Conceptual / Impasse) localizes faults and boosts formative feedback. What the paper really tested (and why that matters) Most benchmarks hide easy leakage and memorized patterns. Here, the authors build three item models—templates that generate many variants—to stress the models beyond memorization: ...