LLM | Cognaptus

Forgetting That Never Happened: The Shallow Alignment Trap

Forgetting That Never Happened: The Shallow Alignment Trap Forgetfulness is an expensive diagnosis. When an internal AI system performs well on last month’s support taxonomy, then underperforms after being fine-tuned on this month’s compliance cases, the obvious story is simple: the model forgot. That story usually triggers an equally obvious response: replay old data, retrain more broadly, freeze more parameters, or panic politely in a meeting while calling it “model lifecycle management.” ...

Doctor GPT, But Make It Explainable

Triage begins with messy language. A patient does not usually arrive as a clean feature vector. They arrive with “I feel tired,” “my stomach is strange,” “I have fever but not always,” or the classic: “I searched online and now I am either fine or dying.” Traditional diagnostic models are not built for this level of human poetry. They prefer structured fields, stable vocabularies, and the fantasy that symptoms behave like dropdown menus. ...

Prompt-to-Parts: When Language Learns to Build

The compiler is the interesting part Blocks are easy to understand. That is why this paper is more interesting than it first looks. At the surface, Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions is a paper about using large language models to generate LEGO-style assemblies from natural language prompts.1 It shows a medieval castle, an International Space Station model, a modular multitool kit, and an image-to-parts helicopter conversion. Naturally, the tempting summary is: “LLMs can now design LEGO models.” ...

ID Crisis, Resolved: When Semantic IDs Stop Fighting Hash IDs

Catalogs have a boring problem. Most items are nearly invisible. A platform may have millions of products, posts, videos, restaurants, songs, or ads, but user interaction is never evenly distributed. A small number of head items collect enough clicks, saves, purchases, and dwell time to become statistically legible. The rest live in the long tail, where the system is expected to recommend them intelligently despite barely having seen them. Very democratic. Very inconvenient. ...

When AI Becomes the Reviewer: Pairwise Judgment at Scale

A committee has one expensive problem before it has any philosophical problem: too many proposals, too little time, and no clean way to know whether Proposal 17 was actually better than Proposal 42. So the usual system does what institutions often do when the task is too large to compare directly. It fragments the work. A few reviewers score a few proposals. Their scores are averaged. A ranked list appears. Everyone pretends the number is more stable than the process that produced it. ...

Rule of Thumb, Meet Rule of Code: How DeepRule Rewrites Retail Optimization

A store manager does not usually make assortment and pricing decisions inside a clean optimization textbook. More often, the decision lives in a less glamorous place: a sales spreadsheet, a distributor agreement, an approval memo, last month’s exception report, a half-remembered rule about which customer can handle which category, and one person in the room saying, “This SKU always works in that region.” Retail intelligence, in other words, often begins as a pile of semi-structured clues wearing a business-casual disguise. ...

CLOZE Encounters: When LLMs Start Editing Medical Ontologies

Hospitals already have the raw material for better medical knowledge systems. It is sitting inside discharge summaries, nursing notes, radiology reports, ECG interpretations, and all the other clinical prose that makes electronic health records look deceptively “digital” while still behaving like a very expensive filing cabinet. The awkward part is that clinical notes are both valuable and dangerous. Valuable, because they contain granular observations that structured fields often miss. Dangerous, because they contain protected health information, idiosyncratic phrasing, and enough local context to make naïve automation look clever right up to the moment it quietly corrupts a downstream system. ...

Evolving Minds: How LLMs Teach Themselves Through Adversarial Cooperation

Training data is the quiet tax on modern AI. Someone has to write the examples, verify the answers, clean the failures, and pretend the spreadsheet is a strategy. Reinforcement learning makes that tax even more visible: if a model is supposed to improve through feedback, then the organisation must either provide ground-truth answers, hire evaluators, or build verifiers that can tell success from nonsense. ...

Paths, Not Parrots: When RL Makes LLMs Plan—and When It Doesn’t

A workflow agent usually looks clever right up to the moment one service is down, one permission changes, or one customer case arrives with the wrong sort of mess attached. Then the question becomes painfully simple: did the model learn a plan, or did it learn the usual route? That distinction is the centre of Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective, an ICLR 2026 paper by Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen.1 The paper is not another victory lap for reinforcement learning. It is more useful than that. It asks what, mechanically, changes when a language model is trained for planning with reinforcement learning rather than supervised fine-tuning. ...

Pipes by Prompt, DAGs by Design: Why Hybrid Beats Hero Prompts

The demo is easy. The DAG is not. Pipeline automation has a wonderfully deceptive user story. A business analyst writes: “Take this customer file, clean the locations, geocode the addresses, add weather data, then save the enriched output.” An LLM replies with a Python file. The file looks plausible. There are imports. There is an Airflow DAG. There are operators. There are dependencies. A demo audience nods approvingly. ...