RAG | Cognaptus

Hook, Line, and Import: How RAG Lets Attackers Snare Your Code

Imports look harmless until they become procurement. A developer asks an AI assistant for a plotting snippet. The assistant returns clean-looking Python, a few lines of explanation, and an import statement for matplotlib_safe. The name sounds prudent. Safer is good. Safer is what the security team keeps asking for, usually in meetings that could have been static analysis. ...

Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

RAG systems usually fail in a very business-like way: not with drama, but with confident paperwork. The retriever finds something. The generator writes something. The user sees an answer that looks plausible, well formatted, and sufficiently certain to be dangerous. Then someone asks the dull but expensive question: did the answer actually follow from the source? ...

Branching Out of the Middle: How a ‘Tree of Agents’ Fixes Long-Context Blind Spots

Contracts are not polite. They hide the important clause on page 83, define the crucial exception on page 17, and bury the fatal cross-reference in an appendix nobody wanted to read. Annual reports behave similarly. So do medical SOPs, litigation files, policy manuals, technical logs, and most documents produced by institutions that have discovered both Microsoft Word and committees. ...

Graph and Circumstance: Maestro Conducts Reliable AI Agents

A broken AI agent often looks deceptively close to working. It answers most questions. It calls the right tool sometimes. It follows the instruction until the conversation gets long, the retrieval query gets vague, or the arithmetic becomes just difficult enough for the model to start doing spreadsheet theatre. The usual repair is prompt editing. Add a stern sentence. Add a role. Add an example. Add “think step by step,” because apparently the machine needed a motivational poster. ...

Model Portfolio: When LLMs Sit the CFA

Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong. That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual. ...

Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

A RAG system usually fails in one of two annoyingly familiar ways. It retrieves documents that are factually relevant but gives the model no clue about the task’s decision boundary. Or it retrieves labelled examples that show the decision pattern but are too parochial to help when the topic drifts. One source knows the world. The other knows the exam rubric. Naturally, many systems pick one and then pretend the compromise was strategy. ...

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR for operators Legal AI does not fail only because models “hallucinate”. That word has become the industry’s favourite fog machine. The more operational diagnosis is sharper: models fail when they answer current legal questions from stale internal memory and then dress the error in confident reasoning. The L-MARS paper is useful because it separates two tasks that vendors often blend together for convenience: retrieving current legal facts and reasoning over stable legal principles.1 On LegalSearchQA, a new 50-question benchmark built around recent U.S. legal facts verified in March 2026, L-MARS reaches 96.0% accuracy. Zero-shot GPT-4o-mini reaches 58.0%. Chain-of-thought falls to 30.0%, because step-by-step reasoning from outdated premises merely creates a more articulate mistake. ...

Numbers Need Narration: Making LLMs Do Reasoning‑Intensive Regression

TL;DR for operators Many AI workflows do not need a yes-or-no judgment. They need a number: how well did this answer follow the instruction, how far did this reasoning trace remain valid, how much better is answer A than answer B, how strong is this essay, how risky is this case, how close is this support call to escalation? ...

Assert Less, Observe More: AICL and the New QA Stack for LLM Apps

TL;DR for operators LLM application testing should stop pretending that the whole product behaves like ordinary software. The database connector, retry logic, API wrapper, and schema validator still deserve normal unit, integration, and load tests. Fine. Keep those. They are not the problem. The problem starts when the product becomes a stateful language system: prompts are assembled dynamically, retrieval changes the context, tool calls modify the execution path, memory leaks across turns, and a model update can improve one workflow while quietly breaking another. At that point, exact-match assertions become less like QA and more like theatre with a YAML file. ...

Put It on the GLARE: How Agentic Reasoning Makes Legal AI Actually Think

TL;DR for operators GLARE is useful because it attacks the boring but expensive failure mode in legal AI: the model jumps to the familiar label, decorates the guess with legal-sounding prose, and hopes nobody asks whether a nearby charge would have fit better. The paper proposes an agentic legal judgment prediction framework that does three things in sequence: it expands the set of candidate charges, retrieves precedents with explicit reasoning paths rather than just similar facts, and performs targeted legal search when the model detects a knowledge gap.1 That mechanism matters more than the branding. GLARE is not “RAG, but with legal documents.” It is closer to a small operating procedure for legal reasoning: widen the hypothesis space, compare alternatives, then fetch the missing premise. ...