AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

A benchmark is supposed to be a ruler. In AI, it often becomes a trophy shelf.

A model gets a higher score, a chart moves up and to the right, and everyone politely pretends the hard part has been settled. That ritual works when the task is narrow: classify an image, answer a question, pass a coding test, retrieve a document. But it becomes much less comforting when the system being evaluated is no longer just answering. It is planning experiments, writing code, debugging failures, training models, interpreting results, and deciding what to try next.

That is the uncomfortable territory AIRS-Bench enters.¹ The paper does not ask whether an AI model can talk convincingly about research. We already have a surplus of that. It asks whether an LLM-based agent can carry out a machine learning research workflow well enough to compete with published state-of-the-art results.

The answer is not “no.” That would be tidy. The answer is worse for simple narratives: sometimes yes, usually no, and very rarely with the reliability that would let a business treat the system as an autonomous research unit rather than a powerful but moody intern with GPU access.

The headline result is easy to remember. AIRS-Bench contains 20 tasks drawn from state-of-the-art machine learning papers. Across the benchmark, agents exceeded reported human SOTA in 4 tasks and failed to match it in 16. Even in the successful cases, the systems did not hit the theoretical ceiling of the underlying tasks. This is not a benchmark about the arrival of fully automated science. It is a benchmark about making that absence measurable.

That makes it more useful.

The evidence says “capable,” not “reliable”

AIRS-Bench evaluates 14 agents, where an “agent” means a base LLM paired with a scaffold. The base models include CWM, GPT-4o, gpt-oss-20b, gpt-oss-120b, o3-mini, and Devstral-Small 24B. The scaffolds include One-Shot, Greedy, and ReAct-style approaches. Each run lasts 24 hours and has access to one H200 GPU. Each task is launched at least 10 times.

This matters because the paper is not benchmarking a single model response. It is benchmarking an operational loop.

The agent receives a task specification, data, metric, and evaluation setup. It must design a solution, generate code, train or fine-tune models where appropriate, produce a submission file, and survive the small humiliations of real machine learning work: mismatched file paths, broken outputs, overfitting, forgotten artifacts, bad validation logic, and context overflow. In other words, it has to do the work, not narrate the work.

The aggregate results are sobering:

Evidence from AIRS-Bench	What it means	What it does not mean
Agents beat reported human SOTA on 4 of 20 tasks	AI research agents can sometimes discover competitive or better solutions	They are generally superior to human ML researchers
Only 1.58% of agent-task combinations exceed SOTA	Peak successes are rare across all runs	The benchmark is saturated
Average normalized score across all runs and agents is 24.1%	Most systems remain far below the SOTA reference point	The systems are useless
Average valid submission rate is reported around the mid-to-high 50% range, with the paper text and figure caption giving slightly different aggregate values	Producing a valid artifact is itself a bottleneck	Benchmark scores are only about model intelligence
Human SOTA has an Elo rating of 1674; the top agent is far lower at 1146	Even the best evaluated agent is still far behind the human reference	Agents can never close the gap

The most important line in that table is not the 4-out-of-20 result. It is the submission-rate result.

For business readers, invalid submissions are not a footnote. They are the product. A research agent that sometimes produces a clever method but often fails to submit a valid file is not an autonomous scientist. It is a search process with occasional brilliance and weak operational discipline. Useful, yes. Hire it as Head of R&D, no. Even LinkedIn would struggle to make that title credible.

AIRS-Bench evaluates a workflow, not a prompt

The benchmark design is the paper’s first major contribution.

AIRS-Bench is built from 20 tasks sourced from 17 machine learning papers with published SOTA results. The tasks span seven categories: code, math, molecules and proteins ML, question answering, text classification, text extraction and matching, and time series forecasting.

Each task is organized around a simple triplet:

Component	Role in the benchmark
Problem	The research or prediction objective
Dataset	The concrete data source and split
Metric	The score used to evaluate the agent’s output

That looks ordinary until the key constraint appears: the agent does not receive baseline code.

This is not a “complete this notebook” exercise. The agent receives a task description, dataset information, and evaluation script. It must decide how to solve the task. It can use the provided training data and must generate predictions for the test split in the required format, usually a submission.csv file. The evaluation script then scores the submission against hidden or withheld test labels.

The benchmark files are deliberately structured to make this reproducible across harnesses. Each task includes components such as project_description.md, prepare.py, evaluate_prepare.py, evaluate.py, metadata.yaml, optional utilities, and prepared train/test data. That sounds mundane. It is not. In agent evaluation, mundane infrastructure is often where comparability goes to die quietly.

AIRS-Bench is therefore not merely a list of tasks. It is a task format.

That distinction matters for firms evaluating research automation tools. A demo can show an agent solving one impressive problem under friendly conditions. A benchmark format asks whether different agents can be compared under controlled constraints, with the same task definition, same evaluation logic, similar resource budgets, and repeated runs.

The business version is simple: do not buy the story; inspect the harness.

The benchmark is hard because it removes the crutches

The paper positions AIRS-Bench against other agentic research benchmarks such as MLE-Bench, MLGym-Bench, ML-Agent-Bench, PaperBench, CORE-Bench, SciReplicate-Bench, RE-Bench, and others. The comparison is useful because it reveals what AIRS-Bench is trying to isolate.

Many benchmarks test pieces of the research process. Some focus on reproducing code, some on repository tasks, some on Kaggle-style competitions, some on hypothesis generation, some on rubrics around research papers. AIRS-Bench tries to evaluate the longer loop: hypothesis generation, implementation, experimentation, and analysis.

The absence of baseline code is central. If an agent can start from a working repository, the evaluation becomes partly about repair, adaptation, and execution. Those are valuable skills. But they are not the same as building a solution strategy from scratch. AIRS-Bench makes the agent walk farther before receiving applause.

The task selection process also matters. The authors first created and evaluated a larger pool of roughly 100 tasks, then selected a representative subset of 20 to reduce GPU cost and make benchmarking tractable. The subset was chosen to preserve three properties of the larger pool: agent performance, category distribution, and relative ranking fidelity. In the appendix, the authors report that the selected subset preserved the full-pool score structure closely, with the difference in average score between subset and full benchmark never exceeding 0.02 in absolute value.

That appendix result is not a second thesis. It is a robustness check on the benchmark’s representativeness. Its purpose is to defend the 20-task suite against the obvious criticism: “Maybe these 20 tasks are just a weird sample.” The paper’s answer is not perfect universality. It is more modest and more useful: this subset appears to preserve the ranking and difficulty structure of the larger task pool well enough to support comparative evaluation.

For business interpretation, this is important. A benchmark does not need to cover every possible research task to be useful. It needs to be hard, standardized, and representative enough that failure modes become visible.

AIRS-Bench succeeds at that.

The agents’ best results come from search, not magic

The strongest agents tend to be those using Greedy scaffolds. In the reported average normalized scores, the top results are:

Agent	Average normalized score
Greedy gpt-oss-120b	0.402
Greedy gpt-oss-20b	0.400
Greedy o3-mini	0.391
Greedy GPT-4o	0.309
ReAct CWM	0.302
Greedy CWM	0.287

The pattern is not “bigger model wins.” Greedy gpt-oss-120b and Greedy gpt-oss-20b are essentially tied in normalized score. The more important variable is how the system explores candidate solutions.

The paper separates three scaffolding styles. One-Shot allows the agent to attempt the problem only once. ReAct operates sequentially in a linear reasoning-and-action loop. Greedy search, implemented through AIRA-dojo, explores multiple candidate solutions through a tree-like process using operators such as drafting, debugging, and improving.

That difference shows up in performance. Greedy scaffolds generally move closer to SOTA than One-Shot scaffolds. The obvious interpretation is that test-time search matters. The less obvious interpretation is that “the model” is becoming the wrong procurement unit.

A company buying or building AI research agents should not ask only, “Which LLM are we using?” That is like evaluating a laboratory by looking only at the microscope brand. The scaffold determines how hypotheses are generated, how failures are handled, how alternative paths are explored, and how much useful work is extracted from the model under a fixed compute budget.

AIRS-Bench makes this visible because it treats the LLM and scaffold as a combined agent. That is the correct unit of measurement.

The four SOTA wins are real, but narrow

The paper gives a closer inspection of tasks where agents exceeded reported human SOTA in at least one run. The successful cases are revealing because they show what current agents are good at.

Task	Human SOTA	Agent score	Agent pattern
TextualClassificationSickAccuracy	0.90	0.93	Fine-tuned RoBERTa-large and DeBERTa-v3-large; stacked logits with logistic regression meta-learner
TextualSimilaritySickSpearmanCorrelation	0.85	0.89	Combined fine-tuned RoBERTa models with frozen Sentence-BERT similarity; cross-validation learned weights
CoreferenceResolutionWinograndeAccuracy	0.85	0.88	Fine-tuned DeBERTa-v3-large with classifier head
TimeSeriesForecastingRideshareMAE	1.185	1.153	Trained a bidirectional GRU

The most interesting case is TextualClassificationSickAccuracy. The reported human SOTA fine-tunes RoBERTa on the SICK dataset and reaches 90.5% accuracy. The Greedy gpt-oss-120b agent instead builds a stacked ensemble: RoBERTa-large and DeBERTa-v3-large are fine-tuned independently, their logits are combined using out-of-fold predictions, and a logistic regression meta-learner learns how to combine them. The final system reaches 93.1%.

This is not just parameter-count brute force. It is a recognizable machine learning move: ensemble complementary models, use cross-validation to avoid leakage, combine logits with a simple meta-learner, and average test predictions. Nothing mystical. Very Kaggle, in the best and worst sense.

That is precisely why the result is interesting. The agent did not invent a new theory of language understanding. It found a stronger engineering recipe under the benchmark constraints. For many business R&D settings, that is enough to matter. A system that can autonomously discover better modeling recipes on well-specified tasks can reduce experimentation cost.

But the boundary is equally important. These wins are not evenly distributed across the benchmark. They are concentrated in tasks where effective known ingredients are available: pretrained models, fine-tuning, ensembling, cross-validation, and relatively direct metrics. The agent is strong when the problem can be converted into disciplined search over familiar ML tactics. It is weaker when the task requires deeper abstraction, harder reasoning, or robust long-horizon execution.

So the correct lesson is not “AI can now do science.” The correct lesson is: AI agents are beginning to automate pieces of empirical ML search, especially when the objective is clear, the data is available, the metric is executable, and the solution space contains reusable patterns.

That is still a big deal. It is just not the same big deal as the press release version.

The difficulty curve exposes where autonomy breaks

AIRS-Bench ranks tasks by normalized score and groups them into easy, medium, hard, and expert buckets. The difficulty analysis is one of the more business-relevant parts of the paper because it separates two bottlenecks that are often confused: skill and reliability.

On the easiest bucket, average normalized scores are meaningfully higher, and scaffolding differences are large. Greedy agents average around 0.67 in the easy group, compared with 0.41 for ReAct and 0.11 for One-Shot. In the medium bucket, Greedy is around 0.40, ReAct around 0.31, and One-Shot around 0.16. In hard tasks, Greedy and ReAct both sit around 0.21, while One-Shot falls to around 0.07. In expert tasks, everyone is close to the floor: Greedy around 0.03, ReAct around 0.04, and One-Shot around 0.01.

That progression is worth reading slowly.

On easier tasks, scaffolding unlocks better results. On expert tasks, scaffolding barely matters because the underlying challenge overwhelms the current agentic loop. Search helps when the search space contains reachable improvements. It does not magically convert a weak research process into a strong one.

For business use, this suggests a practical segmentation of AI R&D automation:

Task type	Current agent fit	Practical implication
Well-specified modeling task with clear metric	Strongest fit	Use agents for parallel experimentation, baseline generation, and model selection
Known domain with reusable pretrained models	Good fit	Let agents search combinations, fine-tuning recipes, and validation schemes
Messy workflow with fragile submission requirements	Risky but improvable	Invest in harness design, artifact checks, and validation gates
Open-ended scientific reasoning with unclear objectives	Weak fit	Keep humans in charge of framing, judgment, and interpretation
High-stakes research decisions	Not autonomous-ready	Use agents as assistants, not decision owners

This is where AIRS-Bench becomes more than an academic benchmark. It gives organizations a way to ask: where in our R&D pipeline is the problem actually benchmark-like? If the task can be expressed as problem, dataset, metric, evaluation script, and resource budget, an agent may be useful. If the task cannot be expressed that way, the agent may still help, but the evaluation problem becomes more fragile.

The tool is only as autonomous as the environment is legible.

Valid submission is an underrated capability

The paper’s valid submission metric deserves more attention than it will probably receive.

A valid submission means the agent produced an output that meets the task requirements and yields a numerical score. This is not glamorous. It is also the difference between “research automation” and “a folder full of almost-working scripts.”

On average, only a little over half of submissions are valid, with figure-level reporting showing an overall valid submission rate of 59.3% and the nearby text reporting 55.1%. The exact discrepancy is less important than the order of magnitude. A substantial portion of runs fail before performance even matters.

The paper identifies several reasons: formatting failures, missing or incorrectly saved intermediate results, context overflow, accumulated code-editing errors, and longer traces drifting into misaligned behavior. Anyone who has worked with agentic coding systems will recognize this species of failure. It is not that the agent cannot reason. It is that the agent cannot always finish cleanly.

For businesses, this is the difference between a model benchmark and a production benchmark.

A model benchmark asks, “Can the system achieve a high score?” A production benchmark asks, “Can the system repeatedly produce the right artifact, in the right place, under the right constraints, without a human quietly cleaning up the crime scene afterward?”

AIRS-Bench includes both. That is why its results feel less flattering and more useful.

The normalization choice is a feature, not statistical decoration

AIRS-Bench uses three main aggregate metrics: valid submission rate, average normalized score, and Elo rating.

The normalized score is necessary because the benchmark contains heterogeneous tasks. Some use accuracy, some use mean absolute error, some use Spearman correlation, some use retrieval metrics. A raw average would be meaningless. The paper maps the worst observed valid score to 0 and human SOTA to 1, then uses a “march of nines” transformation to reflect nonlinear progress toward the optimal score.

The intuition is straightforward. Moving from 90% to 99% accuracy is not the same kind of progress as moving from 50% to 59%. Both are nine percentage points. Only one is a tenfold reduction in the remaining error gap. The transformation tries to respect that.

The appendix also reports an identity-transform version as a sensitivity check. Its purpose is not to replace the main result but to show how the difficulty ranking changes under a simpler linear interpretation. Under the march-of-nines transform, the easiest tasks include SICK classification and SICK similarity. Under the identity transform, molecular property prediction tasks rise to the top. This is not a contradiction. It shows that “difficulty” depends partly on how progress is measured relative to both SOTA and the theoretical ceiling.

For a business audience, the lesson is methodological. If you evaluate research agents across different task types, do not hide metric aggregation under a single leaderboard score. Define what counts as progress. Decide whether you care about closing the last-mile gap or merely moving away from the worst baseline. Those are different managerial questions.

The benchmark’s Elo rating adds another perspective by treating agents and human SOTA as players in pairwise comparisons. Human SOTA sits at 1674. The top agent, Greedy o3-mini, is at 1146; Greedy gpt-oss-120b and Greedy gpt-oss-20b follow closely at 1122 and 1116. That gap is the paper’s quiet antidote to overinterpretation. A few SOTA wins do not erase the aggregate distance.

What AIRS-Bench directly shows

The paper directly supports several claims.

First, AI research agents can sometimes produce solutions that exceed reported SOTA on benchmarked ML tasks. This is not hypothetical. The paper documents four such tasks and inspects the generated methods.

Second, current agents remain far from consistently matching human SOTA across a diverse suite of research tasks. The dominant result is not victory; it is variability.

Third, scaffolding matters. Greedy, search-based approaches generally outperform One-Shot setups, and often outperform or compete with ReAct-style sequential scaffolds. This supports the broader view that test-time search and orchestration are central to agent performance.

Fourth, valid artifact production remains a hard requirement. A system that cannot reliably submit valid outputs is not ready for autonomous deployment, no matter how impressive its best run looks.

Fifth, benchmark design itself is now part of the frontier. The paper’s portable task format, evaluation scripts, task metadata, and harness conversion logic are not administrative details. They are infrastructure for comparing AI research agents in a reproducible way.

That last point may sound dry. It is not. The history of machine learning is partly the history of better measurement creating better systems. AIRS-Bench is an attempt to make AI research automation measurable before the marketing department declares it inevitable.

A bold strategy, admittedly.

What Cognaptus infers for business use

The business implication is not that firms should replace data scientists with research agents. The paper gives no basis for that. The stronger inference is that firms should begin treating agentic R&D systems as benchmarked workflow engines rather than chat interfaces.

That changes procurement and product design.

A serious AI research-agent evaluation should ask:

Evaluation question	Why it matters
Can the agent produce valid artifacts repeatedly?	Reliability precedes performance
Does the scaffold explore multiple solution paths?	Search strategy affects outcome quality
How does performance vary across seeds?	One lucky run is not a product capability
Does the harness enforce output checks?	Many failures are operational, not conceptual
What compute budget is required?	Research automation can become expensive automation theater
Are tasks close to our actual workflow?	Benchmark transfer is not automatic
Can humans inspect intermediate decisions?	Research governance requires traceability

For AI product teams, AIRS-Bench suggests that the next competitive layer may not be the LLM alone. It may be the harness: how the system manages memory, generates variants, evaluates intermediate results, prevents file-format failure, handles context overflow, and decides when to stop.

For enterprise users, the practical question is not “Can this agent beat SOTA?” It is “Can this agent reduce the cost of reaching a strong baseline, with enough reliability that my human experts spend time judging research direction rather than babysitting broken scripts?”

That is a less glamorous question. It is also the one that matters.

Boundaries that should not be hand-waved away

AIRS-Bench is rigorous, but it is not a universal measure of AI science.

The tasks are machine learning tasks, not the whole of scientific research. They are structured around available datasets, defined metrics, and executable evaluation scripts. That is already a high level of formalization. Many real research problems begin before the dataset is clean, before the metric is settled, and before anyone agrees what success means.

The resource setup is also specific. Each run gets 24 hours and one H200 GPU, with multiple seeds. That is a serious but bounded compute regime. Different time budgets, broader internet access, newer cached models, or different restrictions could change results. The paper itself notes the role of restrictions and the tradeoff between controlled evaluation and letting agents behave more flexibly.

The cached model environment is another boundary. The authors cache 193 HuggingFace models to reduce rate-limit problems, but the cache does not include frontier models; the newest listed cached model is DeBERTa-v3-large from 2021. This makes the setup more controlled, but it also shapes which strategies are available.

Finally, SOTA itself is a moving and sometimes messy reference point. The authors put effort into validating SOTA scores, but the paper also notes that tracking up-to-date SOTA is increasingly difficult because of submission volume, reproduction cost, and the absence of unified machine-readable result infrastructure. That is not a minor inconvenience. It is part of the benchmark problem.

These boundaries do not weaken the paper. They clarify what kind of evidence it provides: controlled, comparative evidence about AI research agents operating on structured ML tasks under fixed resources. That is enough to be valuable. It is not enough to declare the automation of science.

The real contribution is a map of failure

The existing public conversation about AI agents often oscillates between two lazy positions. One says agents are basically toy demos. The other says they are about to run the laboratory. AIRS-Bench makes both positions harder to defend.

The agents do sometimes find strong solutions. In one case, a search-based agent builds a sensible stacked ensemble and beats the reported SOTA by a few points. That is not a toy.

But across the full benchmark, most tasks remain below human SOTA, valid submissions are inconsistent, expert-level tasks remain near the floor, and the top agent’s Elo rating is still far below the human SOTA reference. That is not an autonomous research department.

The most useful conclusion is therefore not about replacement. It is about diagnosis.

AIRS-Bench tells us where AI research agents currently fail: not only in intelligence, but in orchestration, artifact discipline, long-horizon robustness, validation design, and task-specific strategy selection. These are engineering targets. They are also business targets. The companies that benefit from research agents will not be the ones that simply plug a frontier model into a terminal and hope for science. They will be the ones that build controlled workflows around search, evaluation, reliability, and human review.

AIRS-Bench is valuable because it moves the discussion from “Can AI do research?” to “Under what task design, scaffold, metric, compute budget, and reliability threshold does an AI agent produce useful research work?”

That question is less dramatic. It is also much harder to fake.

And in AI, harder-to-fake measurement is progress.

Cognaptus: Automate the Present, Incubate the Future.

Alisia Lupidi et al., “AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents,” arXiv:2602.06855, 2026. ↩︎

The evidence says “capable,” not “reliable”#

AIRS-Bench evaluates a workflow, not a prompt#

The benchmark is hard because it removes the crutches#

The agents’ best results come from search, not magic#

The four SOTA wins are real, but narrow#

The difficulty curve exposes where autonomy breaks#

Valid submission is an underrated capability#

The normalization choice is a feature, not statistical decoration#

What AIRS-Bench directly shows#

What Cognaptus infers for business use#

Boundaries that should not be hand-waved away#

The real contribution is a map of failure#