LLMs | Cognaptus

Forgetting by Design: Turning GDPR into a Systems Problem for LLMs

TL;DR for operators A deletion request is not a prompt. It is not a “please forget” instruction, a fine-tuning vibe, or a compliance-flavoured model apology. The useful idea in Unlearning at Scale: Implementing the Right to be Forgotten in Large Language Models is much less mystical: make training reproducible enough that deletion can be executed like systems recovery.1 The paper treats training as a deterministic program, logs the minimal control inputs needed to replay that program, and then removes the requested data during replay. Under strict preconditions, the resulting parameters are bit-identical, in the training dtype, to the model that would have been produced if the forgotten examples had never been included. ...

Keys to the Kingdom: How LLMs Can Audit Crypto Logic Before It Breaks

TL;DR for operators CryptoScope is not “ChatGPT, please audit my cryptography”. That would be a splendid way to generate confident nonsense with Greek letters. The paper’s useful idea is more disciplined: make the model behave less like a wandering code reviewer and more like a junior cryptographic analyst with a library card, a checklist, and a supervisor. CryptoScope does this by combining three components: a curated cryptographic knowledge base of more than 12,000 entries, a pre-detection step that summarises code and checks algorithm compliance, and a retrieval-augmented final analysis that grounds the model’s reasoning in known failure patterns and implementation guidance.1 ...

$Cover image$

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

TL;DR for operators Math tutoring is not a place where “sounds right” is a harmless product feature. The paper behind this article tests four LLMs—GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1—on generated arithmetic, algebra, and Diophantine tasks, then inspects not only final answers but the intermediate steps where mistakes appear.1 The useful lesson is not “LLMs are bad at math.” That is now almost a decorative sentence. The useful lesson is sharper: some models fail by calculation, some by concept, some by excessive reasoning, and some improve dramatically when another agent challenges the work. For builders of AI tutors, graders, and formative assessment systems, this means reliability should be engineered as a workflow, not purchased as a model label. ...

Speaking Fed with Confidence: How LLMs Decode Monetary Policy Without Guesswork

TL;DR for operators Fedspeak classification is not the same thing as sentiment analysis with better stationery. A sentence about “strong employment” can be dovish in one macro regime and hawkish in another. The paper behind this article tackles that problem by giving an LLM a structured reasoning scaffold: extract economic entities, map their relations, reason through monetary-policy transmission paths, then classify the stance as hawkish, dovish, or neutral.1 ...

Breaking the Question Apart: How Compositional Retrieval Reshapes RAG Performance

TL;DR for operators A standard RAG system often retrieves the most individually relevant chunks. That is useful until the question needs several different pieces of evidence that must work together. Then the system may return five near-duplicates of the most obvious fact and miss the less obvious fact that actually completes the answer. Excellent. We have reinvented the meeting where everyone brings the same slide. ...

From Ballots to Budgets: Can LLMs Be Trusted as Social Planners?

TL;DR for operators This paper asks a deceptively operational question: can an LLM act as a social planner when it must allocate a fixed budget across competing public projects? Not in the inspirational LinkedIn sense. In the literal sense: choose project IDs, stay within budget, maximise community utility, and return a valid allocation. ...

Scalpels Not Sledgehammers: A New Era of Precision Editing for LLMs

TL;DR for operators Large language models age badly. Product names change, policies expire, executives move, medical or legal guidance becomes stale, and some facts in pre-training were never right in the first place. The usual repair options are clumsy: retrain the model, fine-tune it, hide updated facts in prompts, or bolt on retrieval and hope the model behaves. All useful. All annoying in different ways. ...

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

Mind the Gap: How AI Papers Misuse Psychology

TL;DR for operators AI teams love borrowing psychology. It gives messy model behaviour a tidy name: “reasoning,” “empathy,” “Theory of Mind,” “bias,” “motivation,” “attention.” The problem is that a borrowed label is not the same as a valid construct. A new paper, The Incomplete Bridge: How AI Research (Mis)Engages with Psychology, studies this borrowing directly by mapping 1,006 LLM-related papers from major AI venues and the 2,544 psychology papers they cite.1 ...

Beyond Words: Teaching AI to See and Fix Charts with ChartM3

TL;DR for operators ChartM3 is useful because it reframes chart editing as a four-step control problem: identify the visual target, connect that target to code, apply the edit, and avoid damaging everything else. That sounds obvious until one watches a multimodal model obediently edit the wrong pie slice with great confidence. A familiar little tragedy, now with bounding boxes. ...