AI Auditing

Silent Errors, Loud Consequences: ASMR-Bench and the Coming Era of AI Auditors

Code review is supposed to be the sober adult in the room. A researcher writes code. A reviewer checks the code. A suspicious bug gets caught before it becomes a chart, a memo, a product decision, or—if everyone is having a particularly expensive week—a board presentation. That model works reasonably well when the failure is accidental and the reviewer has more patience than the author. It becomes less reassuring when the author is an AI research agent, the codebase is messy, the experiment is expensive to rerun, and the suspicious line looks less like a bug than a perfectly normal design choice. ...

Audit the Bots: When AI Judges the Work of Other AI

A bot finishes a task on a computer. It says the file was downloaded, the form was submitted, the setting was changed, or the report was edited. Now comes the awkward part. Do we believe it? For traditional automation, the answer was usually procedural. Check a database field. Inspect a log. Verify an API response. Confirm that a rule fired. Robotic process automation was brittle, yes, but at least its brittleness often left a trail. The machine followed a script; the script touched known systems; the success condition could usually be hard-coded by someone patient enough to suffer through enterprise software. ...

Stack Overflow for Ethics: Governing AI with Feedback, Not Faith

Dashboards are where good intentions go to look responsible. A company launches an AI triage assistant, lending model, recommender, or eligibility system. The governance slide deck is very respectable. Fairness is mentioned. Transparency is mentioned. Human oversight is mentioned, usually beside a tasteful icon of a person holding a clipboard. Everyone nods. Six months later, users have learned to rubber-stamp the recommendation, one subgroup’s error rate has drifted, appeals are piling up, and nobody can say whether the system is still operating inside the boundaries that were promised at launch. ...

When Democracy Meets the Algorithm: Auditing Representation in the Age of LLMs

When Democracy Meets the Algorithm: Auditing Representation in the Age of LLMs Agenda-setting is where participation quietly becomes power. Anyone can invite hundreds of people to submit questions. That part is now cheap. The difficult part arrives ten minutes later, when an expert panel has time to answer only seven of them, and someone has to decide which seven. This is the small administrative hinge on which democratic legitimacy loves to swing. A moderator chooses. A platform ranks. An LLM summarises. Everyone else is told, usually with a straight face, that their concerns were “reflected”. ...

Graphing the Invisible: How Community Detection Makes AI Explanations Human-Scale

Graphing the Invisible: How Community Detection Makes AI Explanations Human-Scale Auditors like lists. Models, inconveniently, do not behave like lists. A credit model may tell you that income mattered, education mattered, job type mattered, age mattered, and postcode-adjacent variables mattered. A fraud model may produce the same kind of feature ranking, only with device fingerprints and transaction timings instead of employment history. The dashboard looks satisfyingly crisp: bars, scores, explanations, probably a tasteful shade of corporate blue. Then the real question arrives: which of these variables are acting together? ...

Automate All the Things? Mind the Blind Spots

A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper. ...