Llm-Agents

When Agents Ask for Help: Teaching LLMs the Art of Expert Collaboration

A help desk ticket is rarely solved by the first sentence. Someone says, “The report is wrong.” Then comes the real work: wrong where, compared with what, after which data refresh, under which permission level, and whether “wrong” means mathematically false or merely politically inconvenient. The expert does not just hand over an answer. The expert asks questions, reconstructs context, and turns a vague failure into a useful diagnosis. ...

Gamma Rays and Toolboxes: Why Superintelligence May Be a Systems Engineering Problem

Toolboxes are not glamorous. Nobody gives a keynote about the screwdriver. Nobody writes breathless think-pieces about the socket wrench. But when a complicated system fails, the difference between “genius” and “expensive confusion” is often whether the operator had the right tool, used it at the right moment, and trusted it to do the part humans should not pretend to do mentally. ...

Agents in Lab Coats: When LLMs Try to Become Data Scientists

Spreadsheet first. Not the model. Not the agent. Not the impressive diagram with seven tiny boxes labeled “planner,” “executor,” “critic,” “memory,” “tool user,” “reflection,” and, inevitably, “orchestrator.” In most companies, data science automation begins with something less glamorous: a messy spreadsheet, a half-documented database table, a recurring report, a manager asking why last month’s number changed, and one unlucky analyst trying to remember whether “customer_id” means account, user, buyer, household, or whatever the CRM vendor believed in 2019. ...

Don’t Prompt Harder — Engineer Smarter: Inside CEDAR’s Agentic Data Scientist

Dataset. That is where many “AI data scientist” demos quietly stop being impressive. A tidy CSV, a small notebook, a polite prompt, and a model that produces a confident answer: this is enough for a video clip. It is not enough for data science. Real data science is not a single question answered by a single model response. It is a sequence of choices: load this file, inspect these columns, define this metric, split the data this way, train this baseline, handle this error, explain this plot, revise the next step. ...

From SQL Copilot to Autonomous Data Scientist: The L0–L5 Reality Check

A dashboard fails. The sales team says the numbers changed overnight. The data engineer checks the pipeline. The analyst checks the SQL. The BI vendor says its “agent” can help. The executive hears “agent” and imagines a small autonomous data scientist quietly fixing the mess before breakfast. Usually, no. Usually it is a chatbot with access to SQL, a tool wrapper with better manners, or a workflow assistant that still depends on human supervision at the awkward parts. Useful, yes. Autonomous, no. The distinction is not academic hair-splitting; it determines who owns the error when the agent rewrites a query, changes a pipeline, or confidently explains a metric built on dirty data. ...

Death by a Thousand Prompts: Why Long-Horizon Attacks Break AI Agents

Email is a boring place to start an AI security article. That is exactly why it is useful. A modern enterprise agent is not merely answering questions about email. It can search messages, summarize attachments, update calendars, create rules, contact colleagues, write to Slack, edit files, and remember what it learned for next time. In demo videos, this looks like productivity. In security reviews, it looks like a small software system that accepts natural language as both instruction and evidence. Wonderful. We have reinvented workflow automation, except now the workflow engine reads every suspicious paragraph with a helpful attitude. ...

From PDE to Pipeline: When LLMs Become Numerical Architects

Simulation has an awkward little secret: the hard part is often not writing code. It is choosing the right numerical method before the code exists. Anyone can ask an LLM to produce a solver for an advection equation, a heat equation, or a Navier–Stokes toy problem. The result may even run. That is not the same as being numerically sane. A PDE solver can be syntactically valid, computationally impressive, and mathematically ridiculous at the same time. In scientific computing, this is not a charming personality flaw. It is how bad answers acquire nice plots. ...

Consistency Is Not a Coincidence: When LLM Agents Disagree With Themselves

A support ticket arrives. The agent reads the same customer history, sees the same policy document, and has access to the same tools. On Monday, it searches for the refund rule, retrieves the correct clause, and gives a clean answer. On Tuesday, with the same input, it searches for a different phrase, retrieves a less relevant document, wanders through two extra steps, and ends with a confident answer that is only approximately useful. ...

Mind the Gap: When Clinical LLMs Learn from Their Own Mistakes

Mistakes are usually treated as waste. In clinical AI, they are treated even more nervously: logged, redacted, escalated, converted into a slide deck, and then politely buried under the next benchmark table. Understandable. Nobody wants a medical agent whose product roadmap reads like “learning through patient-adjacent embarrassment.” But the paper Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning makes a useful move: it treats mistakes not as isolated failures, but as a structured raw material for improving future reasoning.1 The core idea is not that a clinical LLM should “reflect” harder, nor that we should throw more guidelines into the prompt until the context window starts whimpering. The idea is more surgical: compare the model’s reasoning with a better reference reasoning trace, locate the precise gap, convert that gap into a reusable instruction, and retrieve that instruction when a similar case appears later. ...

From Features to Actions: Why Agentic AI Needs a New Explainability Playbook

A customer-service agent rebooks a flight, checks a policy, calls an API, updates the passenger record, apologizes politely, and still gets the outcome wrong. The old explainability question would be: which input tokens influenced the final answer? That question is not useless. It is just late to the crime scene. When an AI system only predicts, explanation can focus on a single input-output decision. When an AI system acts, explanation has to follow the behavior across time: the state it maintained, the tool it selected, the observations it received, the recovery move it attempted, and the point where the run quietly became unrecoverable. A nice feature-importance chart does not tell you that. It tells you what mattered to a prediction, not how a workflow failed. ...