VLMs | Cognaptus

Add to Cart, Add to Power: What Happens When AI Shops for You

When humans stop shopping and AI takes over, the cart becomes a new battleground. A recent study titled “What Is Your AI Agent Buying?” introduces a benchmark framework called ACES to simulate AI-mediated e-commerce environments, and the results are far more consequential than a simple switch from user clicks to agent decisions. The ACES Sandbox: Agentic E-Commerce Under the Microscope ACES (Agentic e-Commerce Simulator) offers a controlled environment that pairs state-of-the-art vision-language-model (VLM) agents with a mock shopping website. This setup enables causal measurement of how different product attributes (price, rating, reviews) and platform levers (position, tags, sponsorship) influence agentic decision-making. ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

Vision-language models (VLMs) may describe what they see, but do they truly understand what they’re looking at — especially in social contexts? A recent paper introduces Cognitive Chain-of-Thought (CoCoT), a deceptively simple yet remarkably effective prompting strategy that helps these models reason like humans: through layered cognition, not flat logic. The Problem with Flat Reasoning Traditional Chain-of-Thought (CoT) prompting, while powerful for math and symbolic tasks, falls short when it comes to social or moral interpretation. Consider a scene where a person wears a mask indoors, and another says, “Hiding from the paparazzi, huh?” CoT may recognize the mask, but often misfires in guessing intent — is it a joke? A warning? An instruction? ...

Steering by the Token: How GRAINS Turns Attribution into Alignment

Fine-tuning is the hammer; steering is the scalpel. In an era where models are increasingly opaque and high-stakes, we need tools that guide behavior without overhauling the entire architecture. That’s precisely what GRAINS (Gradient-based Attribution for Inference-Time Steering) delivers: a powerful, interpretable, and modular way to shift the behavior of LLMs and VLMs by leveraging the most fundamental unit of influence—the token. The Problem with Global Steering Traditional inference-time steering approaches often rely on global intervention vectors: a blunt, one-size-fits-all shift in hidden activations derived from paired desirable and undesirable examples. But these methods are insensitive to which specific tokens caused bad behavior. It’s like adjusting a recipe because the dish tastes bad—without checking if the salt or the sugar was at fault. ...

Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

It’s no secret that Vision-Language Models (VLMs) have dazzled us with their prowess—excelling at image captioning, chart understanding, and even medical diagnostics. But beneath the glitter of benchmark wins, a deeper flaw lurks: these models often suffer from what Berman and Deng (Princeton) have sharply diagnosed as “tunnel vision.” Their new paper, VLMs Have Tunnel Vision, introduces a battery of tasks that humans can breeze through but that leading VLMs—from Gemini 2.5 Pro to Claude Vision 3.7—fail to solve even marginally above chance. These tasks aren’t edge cases or contrived puzzles. They simulate basic human visual competencies like comparing two objects, following a path, and making discrete visual inferences from spatially distributed evidence. The results? A sobering reminder that state-of-the-art perception doesn’t equate to understanding. ...