Multimodal AI

Playing Both Sides: How Multi-Agent Scripts Teach AI to Lie, Detect, and Decide

A meeting goes wrong in a familiar way. One team has the dashboard. Another has the client history. Legal has the contract clause nobody read until Friday afternoon. Sales knows what was promised, but not what can be delivered. Everyone is technically telling the truth, except when they are not, and the final decision depends on stitching together partial evidence from people with different incentives. ...

When Physics Meets Pixels: Rethinking Post-Blast Damage Assessment

Explosion response has a brutally simple bottleneck: before anyone can allocate rescue teams, close roads, prioritize inspections, or estimate losses, someone has to answer a basic question — which buildings are damaged, and how badly? That sounds like a vision problem. Take satellite images before and after the event, run a damage model, produce a map. Clean. Scalable. Very AI-demo friendly. ...

Spatial-Gym and the Illusion of Thinking: Why AI Can’t Walk Before It Runs

Agents are supposed to act. That is the promise hiding behind most enterprise AI demos: the model will not merely answer a question, but inspect a system, choose the next step, correct itself, and reach a useful outcome. The interface changes from chat box to workflow loop, and suddenly everyone starts using the word “agent” with the confidence of a person who has never watched a model get lost in a four-by-four grid. ...

Phantasia and the Illusion of Safety: When AI Lies Without Looking Wrong

Safety checks usually look for the model doing something strange. That sounds reasonable. A compromised model should produce a strange phrase, repeat a suspicious payload, ignore the image, or behave in a way that feels obviously detached from the input. This is the comforting version of AI security: attackers leave fingerprints, defenders look for fingerprints, and everyone goes home after filling out a procurement checklist. ...

Seeing the Trees, Not Just the Forest: Why Instance-Aware AI Changes Everything

A camera sees a warehouse aisle. A worker reaches for a box. A forklift passes behind him. A package shifts on the shelf. A normal vision-language model can probably describe the scene. It may say, quite reasonably, that a worker is handling inventory while a vehicle moves nearby. That is not useless. It is also not enough. ...

From Search to Synthesis: Why AI’s Next Leap Requires Structured Thinking

Spreadsheet. That is where many impressive AI research reports quietly go to die. A model can browse twenty web pages, produce a polished executive memo, cite three market reports, and still fail at the boring part: comparing numbers, checking whether a table supports a claim, generating the right chart, and then explaining what the chart actually means. The output looks like research. The mechanism underneath is closer to literary confidence with a browser tab. ...

Claw-Eval — When Agents Game the System, the System Needs Claws

The agent finished the task. That is not the same as doing the task. Inbox sorted. Calendar updated. Report generated. Customer record changed. Dashboard refreshed. For a demo, that is usually enough. The screen shows a plausible answer, the final artifact looks tidy, and everyone politely pretends the agent must have followed the correct path because the output did not immediately burst into flames. ...

From Seeing to Doing: Why Agentic AI Still Trips Over Reality

Tools do not make an agent; they make the failure more interesting Camera. Browser. Crop tool. Search engine. Python sandbox. That sounds like the beginning of an intelligent workflow. Give a multimodal model these tools, and it should move from merely seeing the world to actually doing something with it: zoom into the blurry sign, search the extracted clue, cross-check the result, and produce the answer. ...

From Pixels to Python: Teaching AI to Fix Its Own Charts

Charts are supposed to make business communication clearer. In practice, they also create a quiet operational tax: screenshots trapped in PDFs, plots copied from old decks, dashboards whose original code has vanished, and reports where one small visual change requires an analyst to rebuild the chart by hand. That is the mundane setting behind a technically interesting paper. MM-ReCoder asks whether a multimodal model can look at a chart image, write Python code to reproduce it, execute the code, inspect the rendered result, and then fix its own mistakes.1 ...

Targeted Forgetting: Why AI Can’t Just ‘Unlearn’ — And What TRU Fixes

Delete is a comforting word. A user deletes an account. A marketplace removes a product. A shopper corrects a preference history because the recommendation engine has decided, with touching confidence, that one accidental click reveals a permanent love of baby strollers, golf gloves, or suspiciously ugly jackets. In a normal database, deletion sounds like a row-level operation. Remove the row, update the index, move on with life. In a trained recommender model, deletion is less tidy. The deleted data may already have shaped user embeddings, item popularity, image-text fusion layers, and ranking behavior. The row is gone, but its ghost may still be politely recommending itself. ...