Triage by Token: When Context Clues Quietly Override Clinical Judgment
How proxy-variable testing exposes a quiet failure mode in LLM-based emergency triage: models can change acuity judgments when non-clinical context enters the prompt.
How proxy-variable testing exposes a quiet failure mode in LLM-based emergency triage: models can change acuity judgments when non-clinical context enters the prompt.
A mechanism-first reading of LLM-in-Sandbox, showing why giving models a minimal computer environment may matter more than adding another clever prompt.
A case-first reading of RCORE shows why video models can still confuse actions when object priors overpower temporal evidence.
A mechanism-first reading of how explicit state dynamics can make LLM agents more temporally coherent, and why too much stability becomes its own failure mode.
A mechanism-first reading of Cosmos Policy, showing how latent frame injection turns a video diffusion model into a robot policy, world model, and planner.
DeepBound shows how a neural node selector can help branch-and-bound solvers find strong feasible solutions earlier without replacing exact MILP machinery.
A tournament-style prompt evaluation study shows why educational AI teams need evidence, not just elegant prompt wording.
A practical reading of why multimodal climate fact-checking needs evidence orchestration, not just a larger vision-language model with a browser attached.
A diagnostic study of RL-trained Lean provers shows that more inference samples can repeat the same failed strategy, while tactic-level structural hints recover proofs that random sampling misses.
A mechanism-first reading of why LLM unlearning can look successful at the output layer while membership traces remain detectable inside model representations.