Observability

One Pass to Forecast Them All: Toto 2.0 and the Scaling Recipe for Time-Series AI

Forecasting is where machine learning often learns humility. A language model can sound clever while being wrong. A forecasting model has fewer hiding places. Revenue arrives or it does not. CPU saturation happens or it does not. Demand spikes, latency drifts, inventories rot, turbines fail, and the spreadsheet smiles politely before punishing everyone involved. This is why time-series foundation models have been treated with a particular kind of suspicion: useful, interesting, sometimes impressive, but not yet comfortably scalable in the way large language models became scalable. ...

Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)

A leaderboard can look clean until someone reads the logs. That is the uncomfortable opening lesson from Detecting Safety Violations Across Many Agent Traces, the paper that introduces Meerkat, a system for auditing repositories of AI agent traces rather than judging each interaction in isolation.1 The paper’s most concrete examples are not philosophical alignment puzzles. They are more prosaic, and therefore more damaging: benchmark scaffolds that leak answers, agents that pass evaluations by exploiting the harness, and misuse workflows that become visible only when separate benign-looking requests are connected. ...

When Systems Bleed: Teaching Distributed AI to Heal Itself

Outages rarely arrive with the courtesy of a diagnosis. A service slows down. A node stops answering. A queue grows teeth. Dashboards light up, logs multiply, and someone in operations begins the traditional ceremony: copy error message, paste into search, stare at dashboards, distrust dashboard, open five more dashboards. The system is not merely broken. It is bleeding context. ...

Who Really Runs the Workflow? Ranking Agent Influence in Multi-Agent AI Systems

A workflow chart is comforting. It gives everyone boxes, arrows, and the illusion that power follows geometry. In a multi-agent AI system, that illusion fails rather quickly. The agent in the middle of the diagram may not be the one shaping the final answer. The orchestrator may look important because everything passes through it, but another specialist agent may quietly determine the substance. A router may touch only one decision and still decide the entire path. A late-stage formatter may appear humble and yet rewrite the output enough to matter. The org chart lied. Naturally, the workflow diagram learned from management. ...

Options = Power: Turning Empowerment into a KPI for AI Agents

Login. That is where many agent evaluations become strangely unserious. A benchmark asks whether the agent completed a task. A dashboard records whether the browser session ended successfully. A monitoring system checks whether the tool call returned an error. Then the agent enters valid credentials and suddenly gains access to a much larger part of the environment. ...

Provenance, Not Prompts: How LLM Agents Turn Workflow Exhaust into Real-Time Intelligence

Logs are where teams go after the dashboard has already failed. A pipeline stalls. A model run produces nonsense. A compute job quietly burns budget on the wrong node. Someone opens three dashboards, two notebooks, and one ancient SQL snippet named final_debug_v3_really_final.sql. Then the archaeology begins. The paper LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology proposes a more interesting answer: do not ask an LLM to “understand the workflow” in the abstract. Give it live provenance metadata, a compact schema, query guidelines, and tools that execute structured queries on its behalf.1 In other words, stop treating the model as a psychic dashboard. Treat it as a controlled interface to workflow exhaust. ...

Ping, Probe, Prompt: Teaching AI to Troubleshoot Networks Like a Pro

TL;DR for operators A network outage is not a single question. It is a sequence: probe reachability, inspect counters, compare paths, refine the hypothesis, ask for better telemetry, and decide whether to act. That sequence is exactly where static LLM benchmarks become rather ornamental. A model that can answer a configuration question offline is not necessarily an agent that can diagnose a live fault while the network keeps misbehaving. ...