Mind the Markov Gap: How a Lightweight Agent Outsmarts Heavy LLMs in Open-Vocabulary Vision
A camera on a factory line does not need to write an essay before deciding whether a part is cracked. That sounds obvious. Yet a surprising amount of recent AI architecture quietly assumes the opposite: when vision systems become uncertain, bring in a large language model, ask it to generate richer descriptions, then run the detector again. Sometimes this works. It also turns a detection problem into a small committee meeting, and committee meetings are rarely known for real-time throughput. ...