Cover image

Competency Gaps: When Benchmarks Lie by Omission

Scores are comforting. That is their main commercial advantage. A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on. ...

December 27, 2025 · 16 min · Zelina
Cover image

Dial M—for Markets: Brain‑Scanning and Steering LLMs for Finance

TL;DR for operators This paper is not mainly about whether an LLM can forecast stock moves from news. That storyline is already crowded, noisy, and full of people discovering that backtests look unusually handsome when nobody has yet met execution costs. The more useful contribution is different: it shows a way to inspect and adjust the internal concepts an LLM activates while processing financial text. ...

September 1, 2025 · 17 min · Zelina
Cover image

How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

TL;DR for operators Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.1 The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer? ...

August 1, 2025 · 16 min · Zelina