Token Probabilities

Mistakes are easy to audit after the fact. That is why most AI evaluation still behaves like a mildly disappointed teacher: wait for the final answer, mark it right or wrong, and pretend the interesting part happened at the end. But in real LLM workflows, the damage often starts earlier. A model begins with a plausible line of reasoning, then drifts. It changes route without noticing. It over-explains a wrong intermediate step. It doubles back, patches the logic, and sometimes recovers. Other times it gracefully walks into a wall, with the confidence of a consultant holding a laser pointer. ...