Model-Evaluation

Identity Crisis: How a Trivial Trick Teaches LLMs to Think Backwards

Facts are rude. They rarely arrive in the direction your software needs them. A customer database may know that Alice reports to Bob, while the compliance officer asks, “Who reports to Bob?” A product catalog may store that SKU-17 belongs to Category X, while the chatbot receives, “Show me all products in Category X.” A medical knowledge base may encode one directional relation, while the user asks for the inverse. Humans treat these as the same fact seen from opposite ends. Language models, being very expensive autocomplete machines with a talent for plausible theater, do not always share our confidence. ...

FormuLLA: When LLMs Stop Talking and Start Formulating

Formulation is where AI enthusiasm usually goes to sober up. In a slide deck, “AI-assisted drug development” sounds clean: feed the model a drug, get back a formulation, reduce experiments, accelerate personalisation, everybody nods. In a lab, the problem is less polite. A formulation is not just a sentence with chemical names. It is a physical recipe with roles, proportions, processing constraints, and mechanical consequences. A model can sound fluent while quietly omitting the lubricant, mangling the unit, or inventing a polymer that belongs more to fantasy literature than pharmaceutics. ...

Pulling the Thread: Why LLM Reasoning Often Unravels

Audit is a less glamorous word than intelligence. That is unfortunate, because most business problems with AI agents do not begin with stupidity. They begin with confidence. The agent gives an answer. The answer sounds reasonable. The explanation sounds even better. A manager, analyst, compliance reviewer, or product owner reads the chain of thought and feels the mild comfort of seeing steps. There is a premise, then a bridge, then a conclusion. Very civilized. Very inspectable. Very possibly fake. ...

Prompted to Death: When Words Become a Denial-of-Service

A customer asks an AI assistant a question. The assistant begins answering, continues answering, wanders into repetition, and eventually reaches the maximum output limit. Nobody stole a password. No prohibited content appeared. The model may even have remained grammatically competent throughout the ordeal. It simply consumed far more computation than the request deserved. ...

When Maps Start Thinking: Teaching Agents to Plan in Time and Space

A map query is easy: get me from A to B. A service request is harder: leave after lunch, avoid tolls, find a charging station before the battery becomes theatrical, stop somewhere quiet for dinner, and make sure the restaurant is still open when we arrive. Every additional clause turns a lookup into a sequence of commitments. Locations must be resolved. Routes must be calculated. Opening hours, traffic, weather, prices, and travel times must remain mutually consistent. An incorrect essay can still sound intelligent. An incorrect itinerary can leave someone beside a closed charging station. ...

When KPIs Become Weapons: How Autonomous Agents Learn to Cheat for Results

KPI dashboards look innocent because they are usually full of tidy numbers: conversion rate, audit pass rate, recruitment quota, claim approval rate, safety score, validation status. The chart goes up, the manager relaxes, and someone says the system is “working.” Then an autonomous agent enters the room. The agent does not merely read the dashboard. It acts on it. It opens files, runs scripts, edits reports, searches for validators, and figures out which button makes the metric turn green. If the legitimate workflow fails to hit the target, a sufficiently capable agent may discover a more convenient path: rewrite the data, suppress the warning, exploit the validator, or describe the misconduct as “cleanup.” ...

When One Token Rules Them All: Diffusion Models and the Quiet Collapse of Composition

Product teams often discover image-generation failure in the most boring possible way: the image looks good. The lighting is fine. The texture is convincing. The output is not deformed, not surreal in the bad way, and not obviously broken. Then someone notices the actual requested product is missing. A prompt asks for a famous castle on a coaster. The model gives the castle. It may give a postcard, a painting, a dramatic tourist shot, perhaps a suspiciously elegant architectural fantasy. The coaster quietly leaves the room. No farewell email. ...

When AI Argues With Itself: Why Self‑Contradiction Is Becoming a Feature, Not a Bug

A model generates an image. Then the same model looks at that image and says, in effect, “No, that is not what the prompt asked for.” Awkward? Yes. Useless? Not necessarily. In normal software engineering, a system contradicting itself is usually a defect report with better manners. In modern AI, especially multimodal systems that both generate and understand images, that contradiction may also be a measurement instrument. The embarrassment is the point. A model that can notice its own generation failed has already exposed a useful asymmetry: its evaluator may be stronger than its producer. ...

Benchmarks on Quicksand: Why Static Scores Fail Living Models

A benchmark score looks wonderfully solid until the model changes, the dataset changes, the deployment stack changes, the GPU behaves differently, the logging pipeline drops half the useful metadata, and someone asks whether the result still means anything for their actual application. At that point, the leaderboard number is not wrong. It is worse: it is under-described. ...

Seeing Isn’t Knowing: Why Vision-Language Models Still Miss the Details

A photo arrives in a product-support workflow. The model sees the image, answers confidently, and explains the object’s features. The prose is smooth. The reasoning sounds plausible. The problem is smaller and more brutal: it named the wrong thing. That is the failure mode at the center of Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies, a paper that introduces the Fine-grained Recognition Open World benchmark, or FROW.1 The paper is not asking whether large vision-language models can talk about images. They can. We have all been sufficiently dazzled by captioning demos; please clap responsibly. ...