LLMs | Cognaptus

No Structure, No Glory: Why AI Cognition Has to Be Shown, Not Named

TL;DR for operators AI systems are now sold with labels that sound increasingly cognitive: reasoning, planning, agency, memory, autonomy, sometimes even the more theatrical hints of machine consciousness. Lovely. The marketing department has discovered philosophy. The useful question is not whether the label feels exciting. It is whether the system realizes an internal organization that could actually support the claimed capability. ...

The Code Agent Wasn’t Self-Correcting. The Test Harness Was.

TL;DR for operators Code agents do not become reliable because they are asked politely to “fix the bug.” They become more useful when they are placed inside a loop that can run their output, return structured failure evidence, and decide how many further attempts are worth buying. That is the practical point of Zhang and Kothari’s paper, Unlocking LLM Code Correction with Iterative Feedback Loops.1 The authors evaluate four LLMs across Python and Java using LeetCode problems, then move from ordinary one-shot performance to an automated correction loop: generate code, execute it, feed back compiler/runtime/testcase information, and repeat up to ten iterations. ...

Feedback, Not Freefall: Why LLM Writing Tools Need a Teacher in the Loop

Feedback is expensive. Anyone who has managed a classroom, a content team, a training programme, or a junior analyst cohort knows the pattern. The first draft is rarely the problem. The problem is the second draft, because the second draft requires specific feedback, delivered in language the learner can act on, without exhausting the person giving it. Multiply that by thirty students, ten assignments, uneven ability levels, and a calendar that refuses to become more generous. Suddenly “just give everyone personalised feedback” becomes one of those ideas beloved by people who do not have to do it. ...

Small Moves, Big Models: The Quiet Discipline of Bounded AI

Everyone wants the grand AI replacement story. The model eats the stack, digests the workflow, and emits profit. Very tidy. Also, usually nonsense. The more interesting pattern emerging in applied AI is smaller, less theatrical, and considerably more useful: the model is not the system. It is an intervention inside the system. It edits one field. It predicts one missing signal. It routes one candidate generator. It enters through a side door, preferably wearing a badge. ...

Fine-Tuned, Fine Print: Why Post-Training Teaches Models What to Trust

Enterprise AI has entered its “sure, but can it use the evidence?” phase. That is progress, technically. It is also where many deployment stories begin to get expensive. The first generation of business LLM adoption was satisfied if a model could produce a fluent answer. The next generation asks something more demanding: can the model use retrieved documents, compliance policies, tool outputs, customer records, analyst notes, and human feedback in the right way? ...

OCR and the City: Why Document AI Still Needs Eyes

A document lands in an intake queue. It might be an invoice, a memo, a form, a résumé, or one of those corporate artifacts whose layout says more than the words do. Someone wants the system to classify it instantly, because every downstream workflow—routing, extraction, compliance, archiving—depends on that first label. The fashionable answer is: send it to a large language model. Extract the text, paste it into a prompt, ask for one label, and let the machine be clever. This is attractive because it feels general. It is also how many automation projects quietly turn a visual problem into a text problem, then act surprised when the system starts calling file folders “proposals” because the word proposal appeared somewhere on the page. ...

Talk Is Cheap, Until It Trains ASR

Talk Is Cheap, Until It Trains ASR Call centers are very good at producing audio. They are much worse at producing clean, labeled, domain-matched, multi-speaker training data. That distinction matters. A business may have thousands of hours of customer calls, branch conversations, medical consultations, field-service recordings, or internal support audio. But most of it is noisy, consent-constrained, poorly transcribed, unevenly distributed across accents and topics, and inconveniently full of humans doing human things: interrupting, pausing, talking over each other, drifting off-topic, and using domain-specific shorthand as if the ASR model had attended the onboarding session. ...

Expert Witness: How MoE Translation Models Can Lose Weight Without Losing the Plot

Translation is one of those AI workloads where scale is both a blessing and a tax. A large language model can translate with impressive robustness, follow instructions, preserve formatting, and handle messy inputs better than many older systems. Then the bill arrives. The model is not only carrying translation ability; it is also carrying mathematical reasoning, factual memory, coding patterns, roleplay habits, tool-use affordances, and several other things that are not exactly required to turn German into English. ...

Think Meter, Not Think Bigger: The New Control Layer for AI Reasoning

Most companies do not actually want an AI system that “thinks longer.” They want one that knows when extra thinking is worth the bill. That distinction is becoming more important. Reasoning models are moving from demo-stage math puzzles into document review, financial research, compliance analysis, customer support escalation, and agentic workflows. In these settings, reasoning has three costs: latency, compute, and misplaced confidence. A model that spends 30 seconds producing an elegant wrong answer has not reasoned. It has performed expensive theatre. Very fluent theatre, admittedly. ...

$Cover image$

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

A spreadsheet error rarely announces itself with dramatic music. It usually arrives politely. A pricing model gives a clean answer. A compliance calculator writes a confident explanation. A financial assistant produces a neat derivation with enough intermediate steps to look reassuring. The result is formatted, fluent, and possibly wrong. That is the uncomfortable business lesson behind Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges, a 2026 survey of roughly 120 studies on LLM mathematical reasoning.1 The paper is not introducing one new benchmark, one heroic model, or one more leaderboard trophy to place on the already overcrowded mantelpiece. Its useful contribution is more structural: it connects datasets, representations, training methods, tool use, verifiers, and evaluation metrics into one reasoning pipeline. ...