Model Introspection

Audit. That is the word enterprises prefer when they want something to sound measurable, serious, and safely boring. You audit model outputs. You audit prompts. You audit logs. You audit whether the assistant said the forbidden thing, leaked the private thing, or hallucinated the regulatory thing. The problem is that models are not only output machines. They are also representation machines. Between the input and the final answer, they build intermediate signals, suppress some of them, amplify others, and then hand management a neat little sentence pretending the whole internal mess never happened. ...

Model Introspection

Mind Reading Machines: When AI Knows Something Is Wrong (But Not What)

The Model That Knows It Knows: When Introspection Hides in the Logits