AI Agents

From Prompt Engineering to Context Engineering: Why Typed Graphs Beat Chatty Agents in the Lab

A lab workflow is a terrible place to discover that your AI agent has been “remembering” chemistry as a conversation. That sounds unkind. It is also the point. In a casual chatbot, losing track of context means an awkward answer. In computational chemistry, losing track of context can mean a wrong molecular geometry, a missing imaginary-frequency check, an invalid charge or multiplicity, or a pKa estimate that looks numerically confident while being scientifically useless. The model did not necessarily become stupid. The workflow around it treated state as text. ...

Peak Performance: Why Alignment Needs a Sense of Timing

A support ticket does not usually fail because every message was bad. More often, it fails because one reply arrived at exactly the wrong moment: the bot misunderstood a frustrated customer, repeated a stale answer, missed the escalation point, and then ended the interaction with something sterile enough to pass a benchmark but useless enough to make the customer leave. The average quality may look acceptable. The experience still feels broken. ...

Agents That Hire Themselves: Why OpenSage Signals the End of Hand-Crafted AI Workflows

Workflow diagrams age badly. A process that looked clean in January usually becomes a small archaeological site by March: one more exception, one more conditional branch, one more “temporary” manual approval that survives longer than the intern who added it. This is how many AI-agent projects quietly become ordinary software projects with a chatbot sitting on top, smiling politely while humans keep repairing the plumbing. ...

Lost in the Links: When World Knowledge Isn’t Enough

Links look harmless. One click from one Wikipedia page to another. Then another. Then another. No robotics. No messy browser UI. No customer database. No procurement workflow with three inconsistent Excel files and one person named Mike who “usually knows where that form is.” Just hyperlinks. That is why LLM-WikiRace is useful. It strips agentic AI down to a small, irritating question: when a model knows a lot about the world, can it use that knowledge step by step without getting lost?1 ...

Mind the Drift: Why Stateful AI Guardrails Beat Bigger Models

A chatbot rarely fails in one clean dramatic explosion. More often, it is nudged. First, the user asks for a harmless explanation. Then a role-play frame. Then a historical analogy. Then a translation. Then a “purely fictional” operational detail. By the time the final request arrives, the model has already been walked across the room. The last prompt is not the attack. It is the receipt. ...

The Reliability Gap: Why Smarter AI Agents Still Fail When It Matters

A customer service agent gets the refund policy right on Monday, wrong on Tuesday, and confidently wrong on Wednesday. A coding agent passes the benchmark, then casually rewrites the wrong file in production. A workflow agent behaves perfectly in a demo, then becomes confused when the API returns the same fields in a different order. ...

When the Muse Has a GPU: Teaching a Machine to Write Poetry

Poetry is a useful place to test the limits of AI, partly because the task is so easy to misunderstand. A bad poem can be fluent. A decent poem can be vague. A machine can produce both before breakfast, along with a motivational LinkedIn post and three flavors of executive summary. That is not the interesting part. ...

Hunt Globally, Miss Nothing: Why Tree-Based AI Agents Beat ‘Run-It-Longer’ Research

Deals are not usually lost because nobody wrote a beautiful market summary. They are lost because the right asset sat in a regional announcement, under a local-language alias, attached to a company page, trial registry, conference PDF, or corporate filing that nobody searched properly. Then, six months later, the same asset appears in a large-pharma partnership press release, and everyone acts surprised. The surprise is often very well-formatted. That does not make it useful. ...

It Takes Two to Think: Why AI’s Future May Be Social Before It’s Smart

Conversation is usually treated as the interface layer of AI. The user asks. The model answers. The chatbot smiles politely, perhaps too politely, and everyone pretends that a slightly longer prompt is the same thing as a better thinking system. This is convenient, measurable, and occasionally profitable. It is also probably too shallow. ...

Potential Energy: What Chain-of-Thought Is Really Doing Inside Your LLM

The familiar ritual: ask it to think longer When an LLM gives a weak answer, the standard reflex is now almost ceremonial: ask it to think step by step. The model writes more. The answer often improves. The benchmark number rises. Everyone feels temporarily reassured. This habit has become so normal that many teams treat chain-of-thought as if it were a small reasoning engine bolted onto the model: more intermediate steps, more deliberate thought, more correctness. A comforting story. Also, like many comforting stories in AI, not quite what the evidence says. ...