Machine Ethics

Signal Over Noise: Why Multimodal RL Needs to Know What to Ignore

Opening — Why this matters now Multimodal models have become the new default. Text, audio, video—feed it all in and let the transformer figure it out. The assumption is elegant: more signals, more intelligence. Reality is less polite. In production systems, signals are often missing, delayed, degraded, or irrelevant. Yet most RL post-training pipelines treat multimodal trajectories as if they were drawn from a single, homogeneous distribution. Every rollout is mixed together. Every reward is normalized together. Every gradient update assumes the model needed all modalities. ...

When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Opening — Why This Matters Now Multimodal large language models (MLLMs) can caption images, describe scenes, and even explain memes with unsettling confidence. Yet ask them a textbook-level geometry problem involving orthographic projections or cube folding, and their composure dissolves. According to the newly proposed MathSpatial framework, humans solve structured spatial reasoning tasks with 96%+ accuracy, while most leading MLLMs struggle below 60%. Even frontier systems plateau far below human baselines. ...

Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Opening — Why This Matters Now For the past two years, “AI coding agents” have been quietly conquering GitHub pull requests. Benchmarks like SWE-Bench climbed past 70% resolution rates. Investors applauded. Model sizes ballooned. Everyone nodded approvingly. Then the models walked into a terminal. On Terminal-Bench, where agents must actually interact with Linux environments—resolving dependencies, fixing broken libraries, debugging system configurations—even 100B+ parameter models struggle to reach 40% success. The gap is not incremental. It’s structural. ...

Checklist Capital: Reinforcing Agents Without Verifiable Rewards

Opening — Why This Matters Now If you’re building agentic systems in 2026, you’ve likely encountered the same uncomfortable truth: most real business objectives are not cleanly verifiable. Was the assistant helpful? Did it ask the right clarification question before calling an API? Did it respect budget constraints while still offering alternatives? These are not “exact match” problems. They are judgment problems. ...

Game On, Agents: When Multimodality Meets the Godot Engine

Opening — Why This Matters Now Coding agents can now refactor repositories, resolve GitHub issues, and pass respectable slices of SWE-Bench. Very impressive. Also slightly misleading. Because real-world work is rarely unimodal. Modern software systems are visual, stateful, asset-heavy, and context-rich. They blend code, media, physics, user interface layers, and dynamic runtime behavior. If we want agents that meaningfully automate creative and technical workflows—not just patch scripts—we need to evaluate them in environments where multimodality is structural, not decorative. ...

Lost in Translation: When 14% WER Hides a 44% Failure Rate

Opening — Why this matters now Speech recognition systems proudly advertise single-digit Word Error Rates (WER). Investors nod. Product teams ship. Procurement signs off. And then a user says: “I’m on Arguello.” In controlled benchmarks, modern ASR systems look nearly flawless. In real deployments—ride-hailing, emergency dispatch, mobility services—they frequently mis-transcribe the one token that anchors the entire request: the street name. ...

No More ‘Trust Me, Bro’: Statistical Parsing Meets Verifiable Reasoning

Opening — Why This Matters Now Large language models can write poetry, draft contracts, and explain quantum mechanics. They can also invent citations, reverse cause and effect, and assert nonsense with unnerving confidence. In low-stakes environments, that’s charming. In high-stakes domains—finance, compliance, medicine, law—it’s disqualifying. The core problem is not fluency. It’s verification. The paper “Statistical Parsing for Logical Information Retrieval” (Coppola, 2026) proposes something unfashionable yet quietly radical: reintroduce formal logic into NLP—but do it in a way that scales with computation, not linguists. ...

$Cover image$

Proof Over Probabilities: Why AI Oversight Needs a Judge That Can Do Math

Opening — Why This Matters Now AI agents are no longer politely answering questions. They are booking flights, moving money, editing codebases, sharing files, and occasionally hallucinating with confidence that would impress a venture capitalist. As agents gain autonomy, the question shifts from “Can they do it?” to “Can we trust what they did?” ...

See, Plan, Snap: Why AI Can Think in Blocks but Can’t Drop Them

Opening — Why this matters now AI agents are learning to use computers the way humans do: by looking at screens and clicking things. In demos, they book flights, fill forms, navigate desktops, and even write code. The narrative is simple: “LLMs can now operate software.” But here’s the inconvenient question: can they actually build something through a graphical interface? ...

Think Like a Scientist: When LLMs Stop Guessing and Start Reasoning

Opening — Why This Matters Now We are entering an era where AI doesn’t just predict outcomes — it proposes laws. From materials discovery to climate modeling, the promise of symbolic regression is intoxicating: feed in data, and out comes an interpretable equation. Not a black box. Not a neural blob. A formula. Large language models (LLMs) have recently joined this race. Armed with broad scientific priors, they can synthesize candidate expressions that would take classical evolutionary search hours to stumble upon. ...