AI Agents

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Dashboard numbers are seductive because they look obedient. Revenue goes up, traffic dips, latency spikes, inventory turns over, temperature drifts, volatility clusters. Put the sequence into a chart and the pattern seems almost polite. Then someone asks an LLM what happened. The model answers fluently. It may even sound like an analyst who has seen too many quarterly review decks and has developed a protective layer of confidence. But fluency is not temporal understanding. A model can describe a curve, name a trend, and still fail to understand which segment comes next, whether a transformation is correct, or whether a discontinuity is an error or a legitimate feature of the process. ...

FadeMem: When AI Learns to Forget on Purpose

Memory is easy to sell. Give an AI agent a bigger context window. Add a vector database. Store every user preference, meeting note, support ticket, and half-correct instruction that ever passed through the system. Then call it “persistent memory,” because apparently a drawer full of old receipts is now intelligence. The problem is that agents do not fail only because they forget. They also fail because they remember too much, too flatly, and too obediently. Old facts compete with new ones. Repeated but trivial details crowd out rare but important constraints. Retrieval brings back something semantically similar but temporally wrong. The agent sounds confident because the database found something. Very helpful. Very dangerous. ...

When Empathy Needs a Map: Benchmarking Tool‑Augmented Emotional Support

Empathy is easy to fake for one sentence. A chatbot can say “that sounds exhausting” without knowing anything about you, your situation, your city, your time zone, or whether the advice it is about to give is physically possible. That is the awkward part of emotional support AI: the tone can be soft while the facts are made of air. A very caring assistant can still recommend a midnight walk at 3 p.m., suggest a closed café, or confidently invent local details because it wants to be helpful. The kindness is real enough in style. The grounding is not. ...

MemCtrl: Teaching Small Models What Not to Remember

MemCtrl: Teaching Small Models What Not to Remember A robot assistant walks through a room. It sees a chair from the front. Then from the side. Then from a slightly worse angle. Then the same chair again, because the camera moved while the robot hesitated. In theory, all of this is “context.” In practice, it is mostly noise wearing a productivity badge. ...

Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

A research request usually begins with a deceptively harmless sentence: “Can you give me the full picture?” Then comes the usual enterprise ritual. Someone breaks the topic into pieces. One person checks competitors. Another checks regulation. Another reads technical reports. Another searches recent news. Everyone works quickly. Everyone returns with fragments. Then one unlucky analyst stitches the fragments into a report and pretends the seams are a design choice. ...

CAR-bench: When Agents Don’t Know What They Don’t Know

A car assistant sounds simple until it touches the car. “Turn on the fan.” “Open the sunroof.” “Change my destination to Barcelona.” “Send an email before I arrive.” None of these requests looks philosophically difficult. They are not graduate-level math problems. They do not require poetic reasoning, legal interpretation, or a 128k-token context window stuffed with PDFs. They require the assistant to do something much less glamorous: check the state of the world, follow a few policies, use the right tools, and avoid pretending when something is missing. ...

Optimizing Agentic Workflows: When Agents Learn to Stop Thinking So Much

The most expensive sentence in agentic AI is “Let me think” Every enterprise agent has a little theatre inside it. A user asks for something routine: find a customer record, check a document, submit a form, update a profile, send a message. The agent pauses, reasons, chooses a tool, receives an observation, reasons again, chooses another tool, receives another observation, and continues until the task is finished or the budget is quietly set on fire. ...

When Rewards Learn to Think: Teaching Agents How They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point. Traditional reinforcement learning sees one thing: wrong. That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture. ...

World Models Meet the Office From Hell

Office software has a special talent: it says “success” at the exact moment something has gone wrong somewhere else. A ticket is updated. A role is assigned. An asset is transferred. The API returns a cheerful confirmation. The agent, bless its silicon heart, declares victory. Then a background workflow fires. A user’s clearance changes. Another workflow reacts to that clearance change. A different record is silently updated. A constraint is now violated. The agent does not notice, because the agent saw the office equivalent of a green checkmark and mistook it for reality. ...