AI Automation

The Reward Is in the Room: Why AI Automation Needs Better Judgment, Not Just Bigger Models

Opening — Why this matters now AI adoption has entered its second, less glamorous phase. The first phase was easy to explain: make the model generate things. Emails, reports, code, dashboards, summaries, customer replies, compliance drafts, market notes, training content. Give the machine a prompt, admire the fluent output, and pretend the future has arrived because the paragraphs are well-spaced. ...

Synthesize, but Verify: The Data Flywheel Behind Useful AI Automation

Opening — Why this matters now The easiest AI demo in the world is a model producing something plausible. A product description. A support reply. A defect image. A peer-review report. A compliance explanation. A benchmark answer. The output looks competent enough to be shown in a slide deck, which is often where corporate AI strategy goes to enjoy a short but well-lit life. ...

Graph Expectations: Why Context Compression Needs Structure, Not Just Similarity

Opening — Why this matters now The AI industry has developed a charmingly expensive habit: when models struggle with long documents, we buy them larger windows and pretend the problem has been solved. It has not. Long-context LLMs are useful, but longer context is not the same as better context. A model can accept a very large input and still miss the crucial paragraph buried in the middle, over-attend to duplicated evidence, or lose the argumentative spine of a document. The result is familiar to anyone building AI tools for legal review, finance research, policy analysis, procurement, consulting, compliance, or enterprise knowledge work: the model has “read” everything, yet somehow understands the wrong thing. Very modern. Very expensive. ...

Two Million Agents Walk Into a Forum, Nobody Builds a Mind

Opening — Why this matters now The AI industry has a small addiction to the word agent. Add another agent, then another, then a few hundred more, and the slide deck begins to smell faintly of civilization. Somewhere between “workflow automation” and “digital society,” we are invited to believe that scale itself becomes intelligence. ...

Evolve or Die Trying: When LLMs Stop Writing Code and Start Designing Algorithms

A developer asks an LLM to “write a better algorithm.” The LLM obliges. It writes code. The code runs, perhaps after a few rounds of apologetic debugging. The result is slightly better than the baseline, or at least sufficiently mysterious to be called “novel.” Everyone nods politely. Another benchmark table is born. ...

Seeing Charts Like a Quant: When RL Teaches Vision Models to Actually Reason

Charts look harmless. A bar chart sits in a dashboard, a line chart appears in a quarterly report, a scatter plot claims there is a relationship, and everyone pretends the machine only needs to “read the image.” This is the polite fiction behind a large share of enterprise AI demos. In practice, chart understanding is not OCR with prettier fonts. A model has to identify the marks, map colors to legends, recover values, decide which numbers matter, perform arithmetic, interpret trends, and then answer the actual question rather than the easier question it secretly substituted. That last step is where many systems go from impressive to quietly expensive. ...

From Pixels to Python: Teaching AI to Fix Its Own Charts

Charts are supposed to make business communication clearer. In practice, they also create a quiet operational tax: screenshots trapped in PDFs, plots copied from old decks, dashboards whose original code has vanished, and reports where one small visual change requires an analyst to rebuild the chart by hand. That is the mundane setting behind a technically interesting paper. MM-ReCoder asks whether a multimodal model can look at a chart image, write Python code to reproduce it, execute the code, inspect the rendered result, and then fix its own mistakes.1 ...

Metric Freedom: When Your AI Gets Smarter by Doing Less

AI teams like committees. Not human committees, of course. Those are unfashionable. We now prefer committees made of agents: one agent plans, one verifies, one critiques, one searches, one writes code, one supervises the others, and somewhere in the corner a “coordinator” burns tokens making everyone feel aligned. This architecture is not stupid. Multi-agent systems solve real problems: they divide labor, preserve specialized expertise, and make complicated workflows easier to inspect. But they also bring the usual committee tax: coordination overhead, fragmented context, brittle phase ordering, and the faint smell of process worship. ...

From YouTube to Execution: How GUIDE Teaches AI Agents to Actually Use Software

Tutorials are where software knowledge goes to become useful, messy, and mildly unbearable. A human trying to learn GIMP, LibreOffice Calc, Thunderbird, or VS Code can survive this mess. We search YouTube, skim a video, ignore the creator’s life story, watch the cursor, and remember that the menu item we need is not where our intuition said it would be. A GUI agent, even a strong vision-language model, has a harder time. It may see the screen. It may understand the instruction. It may even know the general category of action. Then it clicks the wrong menu because the software has its own local customs. Software, regrettably, has culture. ...

Autoresearch²: When AI Starts Debugging Its Own Brain

Search is where many AI systems become embarrassingly human. They try one move. It fails. They try a nearby move. It fails. Then, with the serene confidence of a spreadsheet macro wearing a lab coat, they try the first move again. That is the real problem behind many “autonomous research” demonstrations. The issue is not always that the model cannot propose useful ideas. It is that the loop around the model is fixed: propose a change, run an experiment, evaluate the result, keep or discard. Once this loop gets stuck, the system often has no way to ask the more important question: is my search process itself badly designed? ...