Cover image

CAR-bench: When Agents Don’t Know What They Don’t Know

A car assistant sounds simple until it touches the car. “Turn on the fan.” “Open the sunroof.” “Change my destination to Barcelona.” “Send an email before I arrive.” None of these requests looks philosophically difficult. They are not graduate-level math problems. They do not require poetic reasoning, legal interpretation, or a 128k-token context window stuffed with PDFs. They require the assistant to do something much less glamorous: check the state of the world, follow a few policies, use the right tools, and avoid pretending when something is missing. ...

January 30, 2026 · 17 min · Zelina
Cover image

Optimizing Agentic Workflows: When Agents Learn to Stop Thinking So Much

The most expensive sentence in agentic AI is “Let me think” Every enterprise agent has a little theatre inside it. A user asks for something routine: find a customer record, check a document, submit a form, update a profile, send a message. The agent pauses, reasons, chooses a tool, receives an observation, reasons again, chooses another tool, receives another observation, and continues until the task is finished or the budget is quietly set on fire. ...

January 30, 2026 · 16 min · Zelina
Cover image

When Interfaces Guess Back: Implicit Intent Is the New GUI Bottleneck

The problem starts with a very ordinary sentence “Order my usual lunch.” For a human assistant, this sentence is not empty. It carries history. It points to an app, a restaurant, a branch, a meal, maybe a delivery address, maybe a payment method. For a conventional GUI agent, it is a trap wearing casual clothes. ...

January 15, 2026 · 15 min · Zelina
Cover image

Click, Fail, Learn: Why BEPA Might Be the First GUI Agent That Actually Improves

Clicking is easy. Clicking correctly, after the screen has changed, after a pop-up appears, after the previous attempt failed, and after the agent has only fifteen steps before the evaluator gives up — that is where GUI automation stops looking like a demo and starts looking like work. This is the problem behind BEPA, short for Bi-Level Expert-to-Policy Assimilation, introduced in the arXiv paper From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation.1 The paper is about training end-to-end GUI agents, but its practical message is broader: expert workflows are not automatically useful training data. They have to be translated into something the learner can actually perform. ...

January 12, 2026 · 18 min · Zelina
Cover image

When Your Agent Knows It’s Lying: Detecting Tool-Calling Hallucinations from the Inside

The expensive part of an AI agent making things up is not always the sentence it writes. Sometimes it is the API call it sends. A chatbot can hallucinate a policy clause and embarrass itself. An agent can hallucinate a function call and move money, query the wrong data, calculate the wrong dose, bypass an audit trail, or quietly pretend it used a tool when it actually guessed. That is a different species of failure. The output may still look tidy. The JSON may still parse. The function name may even exist. The problem is that the agent has selected the wrong action in a system that treats actions as real. ...

January 9, 2026 · 15 min · Zelina
Cover image

Agents Gone Rogue: Why Multi-Agent AI Quietly Falls Apart

A workflow looks stable on Monday. The planner assigns tasks. The research agent gathers evidence. The calculator checks numbers. The compliance agent says no to the obviously bad idea, which is rude but useful. The whole multi-agent system feels less like a chatbot and more like a small digital department with unusually poor lunch habits. ...

January 8, 2026 · 17 min · Zelina
Cover image

When the Chain Watches the Brain: Governing Agentic AI Before It Acts

Approval is boring. That is why most automation diagrams hide it. A user request arrives, a sensor emits a signal, an AI agent reasons through the situation, a tool call fires, and something in the real world changes. A stock level is replenished. A traffic light is adjusted. A healthcare alert is escalated. In the clean version of the diagram, the agent looks wonderfully autonomous. In the operational version, someone eventually asks the unpleasant question: who allowed this thing to act? ...

December 28, 2025 · 19 min · Zelina
Cover image

Breaking Rules, Not Systems: How Penalties Make Autonomous Agents Behave

Emergency is a terrible product requirement. It sounds simple in a meeting: “The agent should follow policy, except when the situation is urgent.” Wonderful. Very human. Also almost useless. A delivery robot should not enter a restricted zone. Unless the package is critical medicine. A warehouse agent should not skip safety checks. Unless a fire alarm requires rerouting. A self-driving system should obey traffic norms. Unless an emergency trip makes delay costly. But “unless urgent” does not tell the agent which rule can bend, which rule must hold, and which shortcut turns the system from flexible into reckless. ...

December 4, 2025 · 15 min · Zelina
Cover image

Cutting Through the Noise: How Programmatic Pruning Turns Web Agents into Real Operators

Clicking the right button should not be an intelligence test. For humans, a webpage is usually manageable. We scan the visible screen, ignore the footer, dismiss the newsletter trap, and find the search box without treating every hidden <div> as a philosophical object. Web agents are less lucky. They see a modern page as a swollen mixture of visible text, invisible attributes, nested containers, event handlers, accessibility metadata, layout debris, cookie banners, product cards, promotional links, and enough frontend residue to make “just use the DOM” sound like a mild punishment. ...

November 28, 2025 · 18 min · Zelina
Cover image

Tentacles of Thought: Why Six Is the New One in Multimodal AI

Maps are easy until someone asks the system to reason over them. A person looking at a maze does not merely “see” it. They clean up the visual clutter, identify obstacles, locate the start and goal, infer the grid structure, compute a path, and then translate that path into actions. Some of this is perception. Some is spatial reasoning. Some is symbolic logic. Some is visual transformation. The sequence matters. The order matters. And no, asking one large multimodal model to “think carefully” is not quite the same thing, however confidently the demo smiles. ...

November 21, 2025 · 13 min · Zelina