Opening — Why This Matters Now
Everyone wants an “AI data scientist.” Few are prepared for what that actually entails.
Over the past two years, LLMs have been upgraded from chatty copilots to so-called agentic systems capable of reading files, writing code, training models, and producing forecasts. In theory, they can autonomously execute end-to-end machine learning workflows. In practice, they frequently forget to pass a filename to a tool call.
The paper “DARE-Bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science” (ICLR 2026) fileciteturn0file0 offers something the industry urgently needs: a reality check.
Not another leaderboard. Not another “LLM solves Kaggle” headline. But a benchmark designed to answer a harder question:
Can LLM agents follow instructions faithfully and build correct machine learning pipelines under real-world constraints?
The short answer? Not yet.
Background — The Missing Piece in AI Benchmarking
Most existing benchmarks test one of two things:
- Final answer correctness (e.g., coding benchmarks like HumanEval)
- Open-ended reasoning ability (math or planning tasks)
Data science, however, is neither a short function nor a philosophical essay.
It is:
- Multi-step
- Tool-augmented
- Process-sensitive
- Vulnerable to subtle implementation errors
- Dependent on reproducibility
The DARE-bench authors identify two structural gaps in prior benchmarks:
| Gap | Why It Matters in Practice |
|---|---|
| No process-aware evaluation | Agents can produce correct outputs for the wrong reasons |
| Scarcity of verifiable training data | Hard to fine-tune for reproducible ML workflows |
DARE-bench addresses both.
It introduces 6,300 Kaggle-derived tasks split across classification, regression, and time-series forecasting, with two major evaluation modes:
- IF (Instruction Following) — Strictly reproduce a specified ML workflow
- MM (ML Modeling) — Optimize predictive performance under constraints
This dual framing is subtle but important.
One tests obedience. The other tests competence.
What the Paper Actually Does — Architecture of a Verifiable Benchmark
DARE-bench is built on a key insight:
Data science workflows are reproducible if randomness and execution environments are controlled.
The benchmark enforces:
- Fixed random seeds
- Controlled sandbox execution
- Explicit file inputs/outputs
- Runtime limits
- Turn limits
This allows evaluation to be fully programmatic.
For instruction-following tasks:
- A reference solution generates deterministic predictions.
- The agent must produce exactly the same output.
- Score = 1 if match, 0 otherwise.
For modeling tasks:
- Predictions are compared to ground truth.
- Metrics: Macro-F1 (classification) and clipped $R^2$ (regression/time-series).
Clipped regression metric:
$$ R^2_{clipped} = \max(R^2, 0) $$
This ensures all scores lie in $[0,1]$, making comparison consistent.
This design removes subjective judging and enables reinforcement learning with verifiable rewards — a major operational advantage.
Evaluation — How Bad Is It Really?
Under a balanced configuration (5 turns, 200s sandbox limit), performance results are sobering.
Selected Results (Test Set)
| Model | Class-IF | Class-MM | Reg-IF | Reg-MM | Time-XF | Time-CF |
|---|---|---|---|---|---|---|
| GPT-4o | 32.88 | 40.45 | 20.28 | 40.60 | 35.54 | 4.77 |
| GPT-5 | 69.81 | 43.40 | 57.24 | 56.29 | 36.83 | 10.13 |
| Claude-3.7 | 61.48 | 61.03 | 46.37 | 63.20 | 49.88 | 13.70 |
| Qwen3-32B | 17.11 | 30.71 | 15.21 | 35.86 | 26.96 | 0.00 |
Several observations emerge:
- Instruction-following is fragile. Even frontier models frequently fail strict reproducibility tests.
- Time-series forecasting remains a major weakness. Scores in canonical forecasting are near-zero for many models.
- Open-source models collapse under tool constraints. Execution limits and API misuse dominate failure modes.
Failure analysis (Table 9 in the paper fileciteturn0file0) identifies four dominant categories:
| Failure Type | What It Reveals |
|---|---|
| Instruction non-adherence | Weak procedural discipline |
| Code errors | Poor pipeline robustness |
| Execution limit exceeded | Inefficient exploration |
| Token limit exceeded | Over-decomposition |
In short: LLMs are competent coders, but unreliable ML engineers.
The Real Contribution — Training, Not Just Testing
The benchmark’s most strategic contribution is not evaluation. It is trainability.
Using DARE-bench training data:
Supervised Fine-Tuning (Qwen3-32B)
- Total score improved by ~1.83×
- Significant gains in both IF and MM tasks
Reinforcement Learning (Qwen3-4B)
- Total score improved from 4.39 → 37.40
- Code errors reduced by nearly half
This matters for one reason:
It proves that failure modes are learnable — not fundamental capability ceilings.
The ablation study is particularly telling:
| Training Data | IF Performance | MM Performance |
|---|---|---|
| IF only | ↑ Instruction | ↓ Modeling |
| MM only | ↓ Instruction | ↑ Modeling |
| IF + MM | Balanced improvement |
Process fidelity and modeling competence are complementary skills. You cannot train one and expect the other to emerge automatically.
This has direct implications for enterprise AI strategy.
Implications — What This Means for Business AI
1. “Agentic” Does Not Mean Reliable
If your AI agent is deployed in analytics pipelines, finance dashboards, or forecasting systems, you are not evaluating it correctly unless you test:
- Instruction adherence
- Reproducibility
- Execution efficiency
- Output alignment under constraints
Final accuracy is not enough.
2. Verifiable Rewards Unlock Scalable Alignment
DARE-bench demonstrates a powerful pattern:
- Structured tasks
- Deterministic evaluation
- Programmatic reward
This enables reinforcement learning without human labeling.
For businesses building domain-specific AI agents, this is a blueprint.
3. Time-Series Is the Next Capability Frontier
Near-zero performance in canonical forecasting suggests that:
- Temporal reasoning is undertrained
- Format adherence is brittle
- Models rely on naive heuristics
If your industry depends on forecasting (finance, supply chain, energy), generic LLMs are not production-ready without targeted fine-tuning.
Strategic Takeaway — Benchmarks as Infrastructure
DARE-bench is not merely another dataset.
It represents a shift toward:
- Executable evaluation
- Process-aware supervision
- Outcome-verifiable reinforcement learning
- Domain-specialized training
In other words:
Benchmarks are becoming alignment infrastructure.
For organizations serious about deploying AI agents in operational environments, building similar verifiable task frameworks internally may be more valuable than chasing the next frontier model release.
Models improve. Infrastructure compounds.
Conclusion
DARE-bench forces the industry to confront an uncomfortable truth:
LLMs can write pipelines. They cannot reliably follow them.
Yet the paper also offers optimism. When given structured, verifiable supervision, even mid-sized models improve dramatically.
The future of agentic AI will not be decided by larger context windows or better marketing copy.
It will be decided by whether we can teach models to:
- Respect constraints
- Handle real data noise
- Execute reproducibly
- Optimize under time limits
In short: to behave like disciplined engineers.
Until then, your AI data scientist may still need adult supervision.
Cognaptus: Automate the Present, Incubate the Future.