When Agents Browse Back: Why Multimodal Search Still Fails the Real Web
Search looks easy until the answer is hiding in a caption, a cropped image region, a second web page, and one annoyingly necessary intermediate clue. That is the problem BrowseComp-V3 is trying to measure.1 Not whether a multimodal model can recognize an object in an image. Not whether a chatbot can summarize the first search result. Not whether a web agent can click around long enough to look busy. The benchmark asks a more operationally relevant question: can an AI system browse the open web, combine text and visual evidence across multiple steps, and still arrive at the right answer? ...