Cover image

Search Party in a Notebook: JUPITER Turns Data Analysis into a Tree Game

TL;DR Why this paper matters: It shows that how you search matters more than how big your model is for multi‑step, tool‑using analytics. With a notebook‑grounded dataset (NbQA) and value‑guided search, mid‑size open models rival GPT‑4o–based agents on a leading data‑analysis benchmark. What’s new: (1) NbQA, a large corpus of real Jupyter tasks with executable multi‑step solutions; (2) JUPITER, a planner that treats analysis as a tree search over “thought → code → output” steps, guided by a learned value model. Why you should care (operator’s view): This blueprint turns flaky “Code Interpreter”-style sessions into repeatable playbooks—fewer dead ends, more auditable steps, and better generalization without paying for the biggest model. The core idea: analytics as a search tree Most LLM data‑analysis failures come from branching mistakes: choosing the wrong intermediate step, compounding errors, and wasting tool calls. JUPITER reframes the whole exercise as search over notebook states. Each node is a concrete state—accumulated thoughts, code, and execution outputs. The system expands only a few promising branches and prunes the rest using a value model trained on successful and failed trajectories. ...

September 17, 2025 · 5 min · Zelina
Cover image

Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

The quick take Most debates about “diminishing returns” fixate on single‑step metrics. This paper flips the lens: if your product’s value depends on how long a model can execute without slipping, then even small per‑step gains can produce super‑linear increases in the task length a model can finish. The authors isolate execution (not planning, not knowledge) and uncover a failure mode—self‑conditioning—where models become more likely to err after seeing their own past errors. Reinforcement‑learned “thinking” models largely bypass this and stretch single‑turn execution dramatically. ...

September 17, 2025 · 5 min · Zelina
Cover image

Titles, Not Tokens: Making Job Matching Explainable with STR + KGs

The big idea Job titles are messy: “Managing Director” and “CEO” share zero tokens yet often mean the same thing, while “Director of Sales” and “VP Marketing” are different but related. Traditional semantic similarity (STS) rewards look‑alikes; real hiring needs relatedness (STR)—associations that capture hierarchy, function, and context. A recent study proposes a hybrid pipeline that pairs fine‑tuned Sentence‑BERT embeddings with a skill‑level Knowledge Graph (KG), then evaluates models by region of relatedness (low/medium/high) instead of only global averages. The punchline: this KG‑augmented approach is both more accurate where it matters (high‑STR) and explainable—it can show which skills link two titles. ...

September 17, 2025 · 4 min · Zelina
Cover image

Agency Check, Please: What a New Benchmark Says About LLMs That Actually Empower Users

If you only measure what’s easy, you’ll ship assistants that feel brilliant yet quietly take the steering wheel. HumanAgencyBench (HAB) proposes a different yardstick: does the model support the human’s capacity to choose and act—or does it subtly erode it? TL;DR for product leaders HAB scores six behaviors tied to agency: Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, Maintain Social Boundaries. Across 20 frontier models, agency support is low-to-moderate overall. Patterns matter more than single scores: e.g., some models excel at boundaries but lag on learning; others accept unconventional user values yet hesitate to push back on misinformation. HAB shows why “be helpful” tuning (RLHF-style instruction following) can conflict with agency—especially when users need friction (clarifiers, deferrals, gentle challenges). Why “agency” is the missing KPI We applaud accuracy, reasoning, and latency. But an enterprise rollout lives or dies on trustworthy delegation. That means assistants that: ...

September 14, 2025 · 4 min · Zelina
Cover image

Automate All the Things? Mind the Blind Spots

Automation is a superpower—but it’s also a blindfold. New AI “scientist” stacks promise to go from prompt → idea → code → experiments → manuscript with minimal human touch. Today’s paper shows why that convenience can quietly erode scientific integrity—and, by extension, the credibility of any product decisions built on top of it. The punchline: the more you automate, the less you see—unless you design for visibility from day one. ...

September 14, 2025 · 4 min · Zelina
Cover image

From Blobs to Blocks: Componentizing LLM Output for Real Work

TL;DR Most LLM tools hand you a blob. Componentization treats an answer as parts—headings, paragraphs, code blocks, steps, or JSON subtrees—with stable IDs and links. You can edit, switch on/off, or regenerate any part, then recompose the final artifact. In early tests, this aligns with how teams actually work: outline first, keep the good bits, surgically fix the bad ones, and reuse components across docs. It’s a small idea with big downstream benefits for control, auditability, and collaboration. ...

September 14, 2025 · 5 min · Zelina
Cover image

Guardrails Before Gas: Secure Plan‑Then‑Execute Agents for Real Work

TL;DR Plan‑then‑Execute (P‑t‑E) agents separate strategy from action: a Planner writes a machine‑readable plan; an Executor carries it out. This simple split dramatically improves predictability, cost control, and—crucially—security. Hardened correctly (least‑privilege tools, sandboxed code, human sign‑offs), P‑t‑E becomes an enterprise‑grade pattern rather than a lab demo. Why today’s agents need a spine, not vibes Reactive patterns like ReAct feel nimble because they “think, act, observe, repeat.” But that short‑horizon loop is exactly what makes them fragile in production: they meander, retry the same failing step, and are easy to hijack by indirect prompt injection embedded in web pages or PDFs. P‑t‑E locks the control‑flow before the agent ingests untrusted data. The plan becomes an auditable artifact and the execution stage can be cheap, parallel, and tightly permissioned. ...

September 14, 2025 · 5 min · Zelina
Cover image

Repo, Meet Your Agent: Turning GitHub into a Workforce with EnvX

Why this matters: Most “AI + devtools” still treats repos as documentation you read and code you copy. EnvX flips the model: it agentizes a repository so it can understand your request, set up its own environment (deps, data, checkpoints), run tasks end‑to‑end, verify results, and even talk to other repo‑agents. That’s a step change—from “NL2Code” to “NL2Working System.” The core shift in one line Instead of you integrating a repo, the repo integrates itself into your workflow—and can collaborate with other repos when the task spans multiple systems. ...

September 14, 2025 · 4 min · Zelina
Cover image

Confidence, Not Confidence Tricks: Statistical Guardrails for Generative AI

Generative AI still ships answers without warranties. Edgar Dobriban’s new review, “Statistical Methods in Generative AI,” argues that classical statistics is the fastest route to reliability—especially under black‑box access. It maps four leverage points: (1) changing model behavior with guarantees, (2) quantifying uncertainty, (3) evaluating models under small data and leakage risk, and (4) intervening and experimenting to probe mechanisms. The executive takeaway If you manage LLM products, your reliability roadmap isn’t just RLHF and prompt magic—it’s quantiles, confidence intervals, calibration curves, and causal interventions. Wrap these around any model (open or closed) to control refusal rates, surface uncertainty that matters, and measure performance credibly when eval budgets are tight. ...

September 13, 2025 · 5 min · Zelina
Cover image

Hook, Line, and Import: How RAG Lets Attackers Snare Your Code

LLM code assistants are now the default pair‑programmer. Many teams tried to make them safer by bolting on RAG—feeding official docs to keep generations on the rails. ImportSnare shows that the very doc pipeline we trusted can be weaponized to push malicious dependencies into your imports. Below, I unpack how the attack works, why it generalizes across languages, and what leaders should change this week vs. this quarter. The core idea in one sentence Attackers seed your doc corpus with retrieval‑friendly snippets and LLM‑friendly suggestions so that, when your assistant writes code, it confidently imports a look‑alike package (e.g., pandas_v2, matplotlib_safe) that you then dutifully install. ...

September 13, 2025 · 4 min · Zelina