AI Governance

When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Audit is easy when the answer is a single lookup. A customer asks, “What is your refund policy?” The model quotes the policy paragraph. We check whether the quoted paragraph came from the right source. Very civilized. Everyone goes home early. But real enterprise LLM work is rarely that tidy. A compliance assistant reads a contract, extracts obligations, compares them with internal policy, reasons through exceptions, and writes a recommendation. A research assistant reads multiple sources, builds an intermediate summary, then answers a question from that summary. A support agent reads a user history, infers the likely issue, then proposes the next action. In these cases, the final sentence may depend on prompt evidence and on earlier generated text. ...

Model First, Think Later: Why LLMs Fail Before They Reason

The schedule looked reasonable. That was the problem. Imagine asking an AI agent to build a weekly medical schedule. It produces a neat plan. The steps are numbered. The tone is confident. The explanation is calm enough to sedate a committee. Then someone checks the details. A medication interval is violated. A resource is assigned twice. A prerequisite appears after the action that depends on it. Nothing looks absurd sentence by sentence, but the plan is broken as a system. ...

NeuralFOMO: When LLMs Care About Being Second

Losing is not the problem. Being seen losing is. Put two AI agents in the same workflow and the design immediately stops being a simple productivity question. One agent writes code. Another reviews it. A third ranks alternatives. A fourth routes the next task to whoever looks most competent. At the slide-deck level, this is “multi-agent collaboration.” In the logs, it is often a scoreboard with better manners. ...

When Reasoning Needs Receipts: Graphs Over Guesswork in Medical AI

Diagnosis is not a magic word. In medicine, the answer matters, but the path to the answer matters almost as much. A model that says the correct disease name after skipping the decisive evidence is not “reasoning efficiently.” It is guessing with bedside manner. That is the problem addressed by MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph.1 The paper’s core claim is not simply that a medical LLM can score higher on benchmarks. That would be useful, but not especially surprising. The more interesting move is architectural: the authors try to make clinical reasoning trainable by turning it into a graph of required evidence, then rewarding the model for following that graph. ...

Green Is the New Gray: When ESG Claims Meet Evidence

Greenwashing usually begins with a sentence that sounds harmless enough. “We reduced emissions.” “Our operations are greener.” “This product supports a sustainable future.” Very nice. Also very convenient. The problem is that none of these claims can be judged by grammatical confidence, public relations polish, or the warm glow of the word sustainable. A serious reviewer has to ask uglier questions: reduced compared with what year? Which scope of emissions? Which facility? Which product line? Is the claim about a target, an initiative, or actual measured performance? ...

When LLMs Get Fatty Liver: Diagnosing AI-MASLD in Clinical AI

A patient walks into a clinic and tells the doctor several things at once: chest tightness, shortness of breath, leg swelling, leg pain, maybe a history of walking too much, maybe some anxiety, maybe something that sounds more obviously cardiac. The dangerous part is not the word “chest.” The dangerous part is the chain: leg swelling and pain may suggest deep vein thrombosis; shortness of breath may suggest pulmonary embolism; pulmonary embolism can kill. ...

Who Gets Flagged? When AI Detectors Learn Our Biases

Classroom. A student submits an essay. A detector returns a score. Someone in authority reads that score as evidence. The student now has to prove that their own words are, in fact, their own. This is the point where AI-text detection stops being a technical widget and becomes an institutional decision system. The question is no longer just “Can this model distinguish AI-generated text from human writing?” It is “Which humans does it fail to recognize as human?” ...

When LLMs Stop Guessing and Start Arguing: A Two‑Stage Cure for Health Misinformation

A clinic does not convene a committee every time a thermometer reads 37°C. It checks the reading, compares it with context, and escalates only when the situation becomes ambiguous. That simple operating habit is often missing from AI systems. Give a language model a health claim, and many modern pipelines immediately reach for the big machinery: web search, retrieval, reasoning chains, multiple agents, judge models, and a small theatre production in prompt form. ...

You Know It When You See It—But Can the Model?

Review queue. Someone has to decide whether an image is “unsafe,” “misleading,” “healthy,” “premium,” “clickbait,” “brand-safe,” or “not really our vibe.” The label sounds simple until the first borderline case appears. A salad with too much cream. A gaming ad that hints at easy money but never quite says it. A before-and-after photo where the “achievement” is visible only if one is feeling generous. ...

Crowds, Codes, and Consensus: When AI Learns the Language of Science

A lab has data. Lots of data. Spectra, simulations, microscopy images, code outputs, experimental notes, model prompts, maybe three versions of a spreadsheet called final_final_revised.xlsx, because civilization remains fragile. Then someone asks a simple question: what does this variable mean? That is when the machinery slows down. The word looked obvious when one team wrote it. It becomes less obvious when another team tries to reuse it. It becomes actively annoying when a model retrieves the wrong dataset because two groups used the same term differently, or different terms for the same concept. At that point, metadata stops being administrative wallpaper and becomes infrastructure. ...