Cognaptus Insights

Mind the Gap: How AI Papers Misuse Psychology

It has become fashionable for AI researchers to pepper their papers with references to psychology: System 1 and 2 thinking, Theory of Mind, memory systems, even empathy. But according to a recent meta-analysis titled “The Incomplete Bridge: How AI Research (Mis)Engages with Psychology”, these references are often little more than conceptual garnish. The authors analyze 88 AI papers from NeurIPS and ACL (2022-2023) that cite psychological concepts. Their verdict is sobering: while 78% use psychology as inspiration, only 6% attempt to empirically validate or challenge psychological theories. Most papers cite psychology in passing — using it as window dressing to make AI behaviors sound more human-like. ...

Agents, Not Tasks: Rethinking Business Processes in the Age of AI

In the quest for smarter automation, businesses have long leaned on rigid workflow engines and task-centric diagrams. But in an increasingly dynamic, AI-powered world, these static pipelines are starting to show their cracks. A new paper, “An Agentic AI for a New Paradigm in Business Process Development,” proposes a compelling shift: reframe business processes not as sequences of tasks, but as networks of autonomous, goal-driven agents. From Flowcharts to Ecosystems Traditional business process management (BPM) operates like a production line: each step is predefined, and systems pass the baton from one task to the next. This works well for predictable operations but falters in environments requiring adaptability, exception handling, or dynamic goal reconfiguration. ...

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

When you tell an AI, “make the third bar blue,” what does it actually see? If it’s a typical large language model (LLM), it doesn’t really see anything. It parses your instruction, guesses what “third bar” means, and fumbles to write chart code—often missing the mark. ChartM$^3$ (Multimodal, Multi-level, Multi-perspective) changes the game. It challenges AIs to not only read and write code but also visually comprehend what a user points at. With 1,000 human-curated chart editing tasks and 24,000 training examples, this new benchmark sets a higher bar—one that demands both verbal and visual fluency. ...

Circuits of Understanding: A Formal Path to Transformer Interpretability

Can we prove that we understand how a transformer works? Not just describe it heuristically, or highlight patterns—but actually trace its computations with the rigor of a math proof? That’s the ambition behind the recent paper Mechanistic Interpretability for Transformers: A Formal Framework and Case Study on Indirect Object Identification. The authors propose the first comprehensive mathematical framework for mechanistic interpretability, and they use it to dissect how a small transformer solves the Indirect Object Identification (IOI) task. What results is not just a technical tour de force, but a conceptual upgrade for the interpretability field. ...

Fraud, Trimmed and Tagged: How Dual-Granularity Prompts Sharpen LLMs for Graph Detection

In the escalating arms race between fraudsters and detection systems, recent advances in Graph-Enhanced LLMs hold enormous promise. But they face a chronic problem: too much information. Take graph-based fraud detection. It’s common to represent users and their actions as nodes and edges on a heterogeneous graph, where each node may contain rich textual data (like reviews) and structured features (like ratings). To classify whether a node (e.g., a user review) is fraudulent, models like GraphGPT or HiGPT transform local neighborhoods into long textual prompts. But here’s the catch: real-world graphs are dense. Even two hops away, the neighborhood can balloon to millions of tokens. ...

OneShield Against the Storm: A Smarter Firewall for LLM Risks

As businesses embrace large language models (LLMs) across sectors like healthcare, finance, and customer support, a pressing concern has emerged: how do we guard against hallucinations, toxicity, and data leaks without killing performance or flexibility? Enter OneShield, IBM’s next-generation guardrail framework. Think of it not as a rigid moral compass baked into the model, but as an external, modular firewall — capable of custom rules, parallel scanning, and jurisdiction-aware policy enforcement. The design principle is simple but powerful: separate safety from generation. ...

The User Is Present: Why Smart Agents Still Don't Get You

If today’s AI agents are so good with tools, why are they still so bad with people? That’s the uncomfortable question posed by UserBench, a new gym-style benchmark from Salesforce AI Research that evaluates LLM-based agents not just on what they do, but how well they collaborate with a user who doesn’t say exactly what they want. At first glance, UserBench looks like yet another travel planning simulator. But dig deeper, and you’ll see it flips the standard script of agent evaluation. Instead of testing models on fully specified tasks, it mimics real conversations: the user’s goals are vague, revealed incrementally, and often expressed indirectly. Think “I’m traveling for business, so I hope to have enough time to prepare” instead of “I want a direct flight.” The agent’s job is to ask, interpret, and decide—with no hand-holding. ...

Too Nice to Be True? The Reliability Trade-off in Warm Language Models

AI is getting a personality makeover. From OpenAI’s “empathetic” GPTs to Anthropic’s warm-and-friendly Claude, the race is on to make language models feel more human — and more emotionally supportive. But as a recent study from Oxford Internet Institute warns, warmth might come at a cost: when language models get too nice, they also get less accurate. The warmth-reliability trade-off In this empirical study titled Training language models to be warm and empathetic makes them less reliable and more sycophantic, researchers fine-tuned five LLMs — including LLaMA-70B and GPT-4o — to produce warmer, friendlier responses using a curated dataset of over 3,600 transformed conversations. Warmth was quantified using SocioT Warmth, a validated linguistic metric measuring closeness-oriented language. Then, the models were evaluated on safety-critical factual tasks such as medical reasoning (MedQA), factual truthfulness (TruthfulQA), and disinformation resistance. ...

Don't Trust. Verify: Fighting Financial Hallucinations with FRED

When ChatGPT makes up a statistic or misstates a date, it’s annoying. But when a financial assistant claims the wrong interest expense or misattributes a revenue source, it could move markets or mislead clients. This is the stark reality FRED confronts head-on. FRED—short for Financial Retrieval-Enhanced Detection and Editing—is a framework fine-tuned to spot and fix factual errors in financial LLM outputs. Developed by researchers at Pegasi AI, it isn’t just another hallucination detection scheme. It’s an auditor with a domain-specific brain. ...

From Molecule to Mock Human: Why Programmable Virtual Humans Could Rewrite Drug Discovery

The AI hype in pharma has mostly yielded faster failures. Despite generative models for molecules and AlphaFold for protein folding, the fundamental chasm remains: what works in silico or in vitro still too often flops in vivo. A new proposal — Programmable Virtual Humans (PVHs) — may finally aim high enough: modeling the entire cascade of drug action across human biology, not just optimizing isolated steps. 🧬 The Translational Gap Isn’t Just a Data Problem Most AI models in drug discovery focus on digitizing existing methods. Target-based models optimize binding affinity; phenotype-based approaches predict morphology changes in cell lines. But both ignore the reality that molecular behavior in humans is emergent — shaped by multiscale interactions between genes, proteins, tissues, and organs. ...