Dare to Benchmark: Why Data Science Agents Still Trip Over Their Own Pipelines

Opening — Why This Matters Now

Everyone wants an “AI data scientist.” Few are prepared for what that actually entails.

Over the past two years, LLMs have been upgraded from chatty copilots to so-called agentic systems capable of reading files, writing code, training models, and producing forecasts. In theory, they can autonomously execute end-to-end machine learning workflows. In practice, they frequently forget to pass a filename to a tool call.

The paper “DARE-Bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science” (ICLR 2026) fileciteturn0file0 offers something the industry urgently needs: a reality check.

Not another leaderboard. Not another “LLM solves Kaggle” headline. But a benchmark designed to answer a harder question:

Can LLM agents follow instructions faithfully and build correct machine learning pipelines under real-world constraints?

The short answer? Not yet.

Background — The Missing Piece in AI Benchmarking

Most existing benchmarks test one of two things:

Final answer correctness (e.g., coding benchmarks like HumanEval)
Open-ended reasoning ability (math or planning tasks)

Data science, however, is neither a short function nor a philosophical essay.

It is:

Multi-step
Tool-augmented
Process-sensitive
Vulnerable to subtle implementation errors
Dependent on reproducibility

The DARE-bench authors identify two structural gaps in prior benchmarks:

Gap	Why It Matters in Practice
No process-aware evaluation	Agents can produce correct outputs for the wrong reasons
Scarcity of verifiable training data	Hard to fine-tune for reproducible ML workflows

DARE-bench addresses both.

It introduces 6,300 Kaggle-derived tasks split across classification, regression, and time-series forecasting, with two major evaluation modes:

IF (Instruction Following) — Strictly reproduce a specified ML workflow
MM (ML Modeling) — Optimize predictive performance under constraints

This dual framing is subtle but important.

One tests obedience. The other tests competence.

What the Paper Actually Does — Architecture of a Verifiable Benchmark

DARE-bench is built on a key insight:

Data science workflows are reproducible if randomness and execution environments are controlled.

The benchmark enforces:

Fixed random seeds
Controlled sandbox execution
Explicit file inputs/outputs
Runtime limits
Turn limits

This allows evaluation to be fully programmatic.

For instruction-following tasks:

A reference solution generates deterministic predictions.
The agent must produce exactly the same output.
Score = 1 if match, 0 otherwise.

For modeling tasks:

Predictions are compared to ground truth.
Metrics: Macro-F1 (classification) and clipped $R^2$ (regression/time-series).

Clipped regression metric:

$$ R^2_{clipped} = \max(R^2, 0) $$

This ensures all scores lie in $[0,1]$, making comparison consistent.

This design removes subjective judging and enables reinforcement learning with verifiable rewards — a major operational advantage.

Evaluation — How Bad Is It Really?

Under a balanced configuration (5 turns, 200s sandbox limit), performance results are sobering.

Selected Results (Test Set)

Model	Class-IF	Class-MM	Reg-IF	Reg-MM	Time-XF	Time-CF
GPT-4o	32.88	40.45	20.28	40.60	35.54	4.77
GPT-5	69.81	43.40	57.24	56.29	36.83	10.13
Claude-3.7	61.48	61.03	46.37	63.20	49.88	13.70
Qwen3-32B	17.11	30.71	15.21	35.86	26.96	0.00

Several observations emerge:

Instruction-following is fragile. Even frontier models frequently fail strict reproducibility tests.
Time-series forecasting remains a major weakness. Scores in canonical forecasting are near-zero for many models.
Open-source models collapse under tool constraints. Execution limits and API misuse dominate failure modes.

Failure analysis (Table 9 in the paper fileciteturn0file0) identifies four dominant categories:

Failure Type	What It Reveals
Instruction non-adherence	Weak procedural discipline
Code errors	Poor pipeline robustness
Execution limit exceeded	Inefficient exploration
Token limit exceeded	Over-decomposition

In short: LLMs are competent coders, but unreliable ML engineers.

The Real Contribution — Training, Not Just Testing

The benchmark’s most strategic contribution is not evaluation. It is trainability.

Using DARE-bench training data:

Supervised Fine-Tuning (Qwen3-32B)

Total score improved by ~1.83×
Significant gains in both IF and MM tasks

Reinforcement Learning (Qwen3-4B)

Total score improved from 4.39 → 37.40
Code errors reduced by nearly half

This matters for one reason:

It proves that failure modes are learnable — not fundamental capability ceilings.

The ablation study is particularly telling:

Training Data	IF Performance	MM Performance
IF only	↑ Instruction	↓ Modeling
MM only	↓ Instruction	↑ Modeling
IF + MM	Balanced improvement

Process fidelity and modeling competence are complementary skills. You cannot train one and expect the other to emerge automatically.

This has direct implications for enterprise AI strategy.

Implications — What This Means for Business AI

1. “Agentic” Does Not Mean Reliable

If your AI agent is deployed in analytics pipelines, finance dashboards, or forecasting systems, you are not evaluating it correctly unless you test:

Instruction adherence
Reproducibility
Execution efficiency
Output alignment under constraints

Final accuracy is not enough.

2. Verifiable Rewards Unlock Scalable Alignment

DARE-bench demonstrates a powerful pattern:

Structured tasks
Deterministic evaluation
Programmatic reward

This enables reinforcement learning without human labeling.

For businesses building domain-specific AI agents, this is a blueprint.

3. Time-Series Is the Next Capability Frontier

Near-zero performance in canonical forecasting suggests that:

Temporal reasoning is undertrained
Format adherence is brittle
Models rely on naive heuristics

If your industry depends on forecasting (finance, supply chain, energy), generic LLMs are not production-ready without targeted fine-tuning.

Strategic Takeaway — Benchmarks as Infrastructure

DARE-bench is not merely another dataset.

It represents a shift toward:

Executable evaluation
Process-aware supervision
Outcome-verifiable reinforcement learning
Domain-specialized training

In other words:

Benchmarks are becoming alignment infrastructure.

For organizations serious about deploying AI agents in operational environments, building similar verifiable task frameworks internally may be more valuable than chasing the next frontier model release.

Models improve. Infrastructure compounds.

Conclusion

DARE-bench forces the industry to confront an uncomfortable truth:

LLMs can write pipelines. They cannot reliably follow them.

Yet the paper also offers optimism. When given structured, verifiable supervision, even mid-sized models improve dramatically.

The future of agentic AI will not be decided by larger context windows or better marketing copy.

It will be decided by whether we can teach models to:

Respect constraints
Handle real data noise
Execute reproducibly
Optimize under time limits

In short: to behave like disciplined engineers.

Until then, your AI data scientist may still need adult supervision.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Missing Piece in AI Benchmarking#

What the Paper Actually Does — Architecture of a Verifiable Benchmark#

Evaluation — How Bad Is It Really?#

Selected Results (Test Set)#

The Real Contribution — Training, Not Just Testing#

Supervised Fine-Tuning (Qwen3-32B)#

Reinforcement Learning (Qwen3-4B)#

Implications — What This Means for Business AI#

1. “Agentic” Does Not Mean Reliable#

2. Verifiable Rewards Unlock Scalable Alignment#

3. Time-Series Is the Next Capability Frontier#

Strategic Takeaway — Benchmarks as Infrastructure#

Conclusion#