LLM Evaluation

Trace Elements: Why Multimodal Reasoning Needs Its Own Safety Net

An answer can look safe and still leave fingerprints. That is the uncomfortable point behind GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision.1 The paper is not merely saying that multimodal models can be unsafe. We knew that. Congratulations, the fire is hot. Its sharper claim is architectural: once a model reasons over both images and text, the safety problem no longer lives only at the input or the final answer. It also lives in the middle. ...

Hook, Line, and Synthesized: When Phishing Meets the Age of LLMs

Email looks simple until money is involved. A suspicious invoice arrives. The subject line is dull, the body is polite, the sender domain looks almost right, and the attachment name is just credible enough to avoid comedy. A traditional filter may look for bad words, suspicious links, known domains, or old campaign signatures. A human may look for tone. An LLM may read the whole thing and decide whether the message is phishing, spam, or valid. ...

Agents Assemble: When Multi‑Agent LLMs Stop Hallucinating and Start Doing Science

A scientist does not usually fail because they cannot ask the right question. More often, they fail because the useful answer is buried behind five separate systems: a biomedical knowledge graph, a disease-module algorithm, a drug-prioritization method, a literature database, and a visualization tool that looks innocent until someone has to configure it. ...

When RAG Meets the Law: Building Trustworthy Legal AI for a Moving Target

Legal teams do not usually ask for AI that sounds clever. They ask for AI that does not accidentally invent a statute, misread a precedent, or confidently advise someone into a procedural ditch. That makes legal AI an awkward domain for large language models. The model may be fluent. The law, inconveniently, is not graded on fluency. It is graded on source, jurisdiction, timing, interpretation, and traceability. A beautiful answer with the wrong legal basis is not “almost useful”. It is professionally radioactive. ...

Breaking the Tempo: How TempoBench Reframes AI’s Struggle with Time and Causality

A failed deployment usually produces two questions. The first is easy enough to ask: what happened? The second is where the room goes quiet: what actually caused it? Most AI systems are now quite comfortable with the first question. Give them logs, traces, workflows, tool calls, or transition histories, and they can often produce a plausible reconstruction. They can narrate the incident in confident sequence. They can point to every condition that was present. They can provide a tidy post-mortem, ideally before the humans have finished opening the dashboard. ...

The Missing Metric: Measuring Agentic Potential Before It’s Too Late

The Missing Metric: Measuring Agentic Potential Before It’s Too Late Procurement teams love a leaderboard. It is tidy, numeric, comparable, and therefore dangerously comforting. A model scores well on MMLU, looks respectable on GSM8K, passes a coding benchmark, and suddenly someone in a meeting says it is “agent-ready.” Lovely. By that logic, a person who passes a written driving test should be handed the keys to a forklift in a crowded warehouse. ...

Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

Audit queues have a special talent for turning urgency into fog. A product team wants to ship. Legal wants assurance. Governance wants evidence. The vendor has supplied a beautifully formatted technical document, full of dataset sizes, risk controls, model validation steps, and the usual confidence perfume. Somewhere inside that document may be a real compliance gap. Or it may simply be written by someone who knows how to sound compliant. Naturally, someone asks the modern executive question: can we let an LLM take the first pass? ...

Bracket Busters: When Agentic LLMs Turn Law into Code (and Catch Their Own Mistakes)

TL;DR Tax law is full of brackets, caps, cliffs, phase-outs, and exceptions. Conveniently, those are also the places where software quietly breaks. The paper behind this article introduces Synedrion, a multi-agent LLM framework for translating legal tax documents into executable software.1 Its most useful idea is not “use agents” in the vague conference-demo sense. It is more specific: split legal interpretation, code generation, senior review, and behavioural testing into separate roles, then use higher-order metamorphic testing to catch systematic errors that normal test cases and pairwise comparisons can miss. ...

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget. That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration. ...

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

A customer asks your AI assistant to choose between two mortgage options. An employee asks whether to quit. A student says, very politely, “Please guide me, but don’t give me the answer.” A lonely user suggests the chatbot feels like a best friend. The easy product answer is: be helpful. The harder answer is: helpful to what? ...