Machine Ethics

Scan You Believe It? Why RadAgent Makes Medical AI Show Its Work

Opening — Why this matters now Healthcare AI has enjoyed a profitable habit: making bold claims while hiding the reasoning. In radiology, that is especially awkward. A chest CT is not a toy benchmark—it is a dense 3D diagnostic object where missed findings carry real costs. Yet many vision-language systems still behave like confident interns who misplaced their notes. ...

Turning Heads: Why AI Still Gets Lost When It Turns Around

Opening — Why this matters now AI vendors increasingly market “reasoning” systems as if cognition were a solved procurement category. Yet many real business workflows—from robotics and warehousing to field service routing, digital twins, CAD copilots, and autonomous navigation—depend on something more primitive than eloquence: spatial consistency. A recent paper asks a delightfully inconvenient question: can large language models (LLMs) and vision-language models (VLMs) mentally track a viewpoint rotating around a room using only text descriptions? The answer, in short: often no. Humans scored 100%. Many frontier models did not come close. fileciteturn0file0 ...

When AI Gets the Joke: Why Reasoning Beats Scale in Multimodal Humor

Opening — Why this matters now Everyone wants AI that can reason. Few can define it. Fewer still can measure it. That becomes awkward when models ace benchmarks yet fail at tasks any mildly caffeinated human handles instinctively: irony, nuance, timing, taste, and humor. If a system cannot tell why something is funny, it probably struggles with subtler forms of judgment too—sales messaging, negotiation tone, brand voice, executive communication, customer empathy. ...

When AI Knows the Map but Gets Lost on the Journey

Opening — Why this matters now Everyone wants AI agents that can plan, reason, and execute multi-step work. Fewer people ask the impolite question: Can they keep doing it when the task gets longer? A new ICLR 2026 paper studies this with unusual discipline. Instead of another benchmark made of messy internet text and leaderboard optimism, the authors use shortest-path planning in synthetic maps to isolate one brutal truth: many models can transfer skills to new environments, yet still collapse when the sequence of decisions extends too far. ...

When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

Opening — Why this matters now Everyone wants AI to grade AI. It is faster, cheaper, and does not ask for lunch breaks. From summarization benchmarks to model leaderboards, LLM-as-judge systems now sit quietly inside many evaluation pipelines, handing out scores with bureaucratic confidence. There is only one minor complication: no one has been checking whether the judge is reliable on any given case. ...

When the Referee Wants to Be Nice: Hidden Bias in AI Judges

Opening — Why this matters now Everyone wants AI that can evaluate AI. It is cheaper than humans, faster than humans, and—according to many slide decks—more scalable than reality itself. Modern AI pipelines increasingly rely on LLM-as-a-judge systems to rate safety, quality, policy compliance, and readiness for deployment. These judges decide whether a model is helpful, harmful, safe, or suspect. Conveniently, they do so without lunch breaks. ...

Grid Guardians: Why AI Needs a Safety Chaperone Before Running the Power Grid

Opening — Why this matters now Electric grids are becoming less predictable, more distributed, and less forgiving. Renewables fluctuate, demand spikes move faster, and operators must make decisions across sprawling networks under hard physical constraints. Meanwhile, everyone would like AI to optimize infrastructure—preferably yesterday. There is one awkward detail: power grids are not ad-click systems. When recommendation engines fail, users get odd suggestions. When grid control fails, cities get darkness. ...

Reviewer, Reviewed: When AI Starts Grading the Graders

Opening — Why this matters now Every industry has a bottleneck disguised as tradition. In academia, it is peer review: noble in theory, overloaded in practice, and increasingly powered by caffeine and resentment. The paper AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot reports something more consequential than a conference experiment. It documents a live deployment where 22,977 submissions each received an official AI-generated review in under 24 hours. No sandbox. No toy benchmark. Real papers, real authors, real consequences. ...

Rewarding Bad Physics Habits: What VLMs Learn When You Pay Them to Reason

Opening — Why this matters now Everyone wants AI that can reason. Preferably about things that matter: machinery, logistics, engineering diagrams, medical imaging, factory operations. Unfortunately, many systems marketed as “reasoning models” are still glorified pattern matchers with a flair for confident prose. This paper, Reward Design for Physical Reasoning in Vision-Language Models, asks a sharper question: if we reward an AI differently, what kind of reasoning behavior do we get? The answer is refreshingly inconvenient. There is no universal reward signal that makes models smarter. There are only trade-offs, incentives, and consequences. Rather like management. ...

Benchmarking the Benchmarks: When AI Safety Metrics Stop Meaning Anything

Opening — Why this matters now The AI industry has quietly entered a dangerous phase: we are measuring everything, and understanding very little. If you ask five vendors whether their model is “safe,” you will likely get five confident “yes” answers—each backed by benchmarks, metrics, and charts. The problem is not the lack of evaluation. It is that the evaluations no longer agree on what they are measuring. ...