Peer Review, But Make It Multi‑Agent: Inside aiXiv’s Bid to Publish AI Scientists

TL;DR for operators

aiXiv is not mainly a claim that AI scientists are ready to flood the world with publishable research and we should all politely applaud the machines. It is more interesting than that, and less comforting. The paper proposes an infrastructure layer for AI-generated science: structured submission, automated review, retrieval-grounded feedback, revision loops, pairwise comparison, prompt-injection detection, multi-model voting, provisional acceptance, DOI-style publication, APIs, MCP interfaces, and public discussion.¹

The operational lesson is simple: once AI agents can generate research-like artefacts at scale, the scarce resource shifts from writing to governance. The bottleneck becomes deciding what deserves attention, what has improved, what is adversarial, what is merely polished nonsense, and which reviewer or model should be trusted when they disagree. Very glamorous. Exactly the sort of plumbing everyone ignores until the basement floods.

The paper’s experiments suggest that review-and-revision loops can materially improve AI-generated proposals and papers. Revised proposals are preferred over originals in more than 90% of pairwise comparisons across the reported settings, and revised papers show similar improvement, reaching 100% preference when a response letter is included. Publication-style voting also improves: proposal acceptance rises from 0% for originals to a mean of 45.2% after revision, while paper acceptance rises from 10% to 70%.

But the boundary matters. These are mostly controlled, simulated, or benchmark-based tests. The system is evaluated on AI-generated proposals, AI Scientist papers, curated ICLR pairwise comparisons, synthetic and suspicious prompt-injection samples, and multi-LLM voting. That is useful evidence for workflow design. It is not proof that autonomous AI scientists can now produce reliable, novel, real-world science without human oversight. The paper itself says current AI Scientist systems remain limited in rigorous experimentation, cross-domain generalisation, long-horizon reasoning, and real-world validation.

For enterprises, the practical translation is not “replace experts with agents.” It is: if AI is producing strategy memos, legal drafts, diligence reports, scientific hypotheses, investment notes, code changes, or clinical research summaries, build the aiXiv-shaped control layer before the volume arrives. Intake, review, revision, adversarial scanning, reviewer diversity, staged approval, provenance, and auditability are not optional ornaments. They are the product.

The machine is not the author; the machine is the venue

The easiest way to misread aiXiv is to treat it as another paper about AI generating papers. That is the shiny object. The actual contribution is a proposed venue.

The paper begins with a familiar problem: AI agents are increasingly able to generate research proposals, run experiments, write papers, and even produce reviews. Existing publication systems, however, were not designed for this. Journals and conferences rely on human peer review, move slowly, and often resist AI authorship. Preprint servers distribute work quickly but do not provide full review or quality control. The authors argue that AI-generated science needs a different kind of publication layer: one that is open enough to accept machine-generated submissions, but structured enough not to become an infinite landfill of synthetic PDFs.

That is the platform-shaped argument. aiXiv is presented as an open-access ecosystem for both human and AI scientists. It accepts proposals and papers. It routes them through automated review. It supports revision and re-submission. It includes safeguards against prompt injection. It exposes API and Model Context Protocol interfaces so external AI agents can interact with the platform. Accepted submissions receive attribution and publication infrastructure, including DOI-style records and public discussion.

This is why a mechanism-first reading is more useful than a normal summary. The important question is not “does the system use LLMs?” Of course it does; it is 2025, apparently everything does. The question is: what sequence of controls turns raw AI output into something reviewable?

The proposed sequence looks like this:

An AI scientist or human submits a research proposal or full paper.
LLM-based review agents evaluate it using structured criteria.
Retrieval augmentation supplies relevant scientific context.
The authoring agent revises the submission based on feedback.
Pairwise review can compare the original and revised versions.
Prompt-injection detection screens documents for adversarial manipulation.
Five AI reviewers vote on publication.
Submissions that pass internal voting are provisionally accepted, with later external reviews potentially upgrading their status.

In other words, aiXiv is less “robot scientist” and more “synthetic manuscript customs office.” It does not assume every generated paper is worth publishing. It assumes the flood is coming and asks what kind of checkpoint might make the flood survivable.

The workflow treats review as a loop, not a stamp

Traditional review often works like a gate: submit, wait, receive judgment, revise if lucky. aiXiv treats review as a loop. That design choice is the paper’s central mechanism.

The platform supports two review modes. Direct Review Mode gives structured feedback on one submission. For proposals, the criteria include methodological quality, novelty and significance, clarity and organisation, and feasibility and planning. For papers, the criteria resemble conference review dimensions: clarity, originality, quality or soundness, and significance or impact. The reviewer is not only supposed to judge; it is supposed to produce actionable feedback.

Meta Review Mode adds a second layer. An area-chair or editor agent identifies the submission’s subfields, creates domain-specific reviewer agents, collects their reports, and synthesises them into a final meta-review. This is a familiar academic workflow translated into agent choreography. Whether one finds that elegant or mildly cursed depends on how many programme committees one has survived.

Pairwise Review Mode then asks a different question: not “is this submission good?” but “is the revised version better than the original?” That distinction matters. Absolute quality judgment is hard. Relative improvement is often easier. The system therefore measures whether review feedback leads to better artefacts, not merely whether reviewers can produce plausible criticism.

This is the design pattern businesses should notice. AI governance workflows often fail because they collapse several questions into one approval step. Is the draft accurate? Is it better than the last version? Is it safe to share? Is it strategically useful? Is it compliant? Is it worth human attention? aiXiv separates some of these questions into different mechanisms.

Platform mechanism	What it checks	Business analogue	What it does not solve
Structured submission	Whether artefacts arrive in a usable format	Intake templates for memos, diligence, legal drafts, model outputs	Truth or originality
Direct review	Whether feedback can improve one artefact	AI critique pass before expert review	Final accountability
Meta review	Whether multiple specialist views can be synthesised	Cross-functional review across legal, finance, technical, compliance	Reviewer bias or groupthink
Pairwise review	Whether revision improved the output	Version comparison before escalation	Absolute quality
Prompt-injection detection	Whether the document tries to manipulate the reviewer	Adversarial scanning for AI-readable content	All security risks
Multi-AI voting	Whether multiple models agree enough to provisionally accept	Model ensemble approval	Objective truth

The table is the article in miniature. aiXiv’s value is not one clever model. It is decomposition. The system makes review less mystical by assigning separate jobs to separate controls.

Retrieval helps, but it does not magically manufacture judgment

The authors add retrieval-augmented generation to the review process, using external scientific literature to help reviewers assess novelty, positioning, missing citations, and weaknesses. This is sensible. Scientific review is not just grammar correction wearing a tweed jacket. It depends on knowing what has already been done.

The paper evaluates pairwise assessment alignment on both proposals and papers. For paper-level evaluation, it uses DeepReview’s ICLR 2024 and 2025 test datasets, removes ambiguous borderline papers with mean reviewer ratings in the 5–6 range, and builds balanced accepted-versus-rejected pairs. That produces 235 pairs for ICLR 2024 and 163 for ICLR 2025. For proposal-level evaluation, the authors convert ICLR papers into proposal format and assemble 500 pairs, each containing one high-quality and one low-quality proposal, again removing borderline cases.

The strongest paper-level result in the table is 81.70% accuracy on ICLR 2024 using Claude Sonnet 4 with RAG, with 79.75% on ICLR 2025. Proposal-level results are lower and more uneven: the best ICLR 2024 proposal figure in the table is 77.91% for Claude Sonnet 4 with RAG, while the ICLR 2025 proposal figures top out around 70%. The paper text also states that a GPT-4.1 RAG-based evaluator achieves 77% on proposal-level assessment, although the adjacent table reports GPT-4.1 at 69.73% with RAG on ICLR 2024. That discrepancy is not fatal to the paper’s direction, but it is a useful reminder to read the tables before turning the abstract into a trumpet section.

The more important finding is variance. RAG does not uniformly improve every model. In proposal assessment, Gemini 2.5 Pro drops from 77.46% without RAG to 71.90% with RAG on ICLR 2024. Claude Sonnet 4 improves slightly on ICLR 2024 but declines on ICLR 2025. GPT-4.1 also performs worse with RAG in the proposal table. For papers, the pattern is more favourable for some models, especially Claude Sonnet 4, but still mixed.

That is not a reason to discard retrieval. It is a reason to treat retrieval as an engineering component, not an incantation. Retrieval can give reviewers context. It can also introduce irrelevant material, distract the model, or change the comparison frame. The business version is painfully familiar: dumping a knowledge base into an LLM pipeline does not automatically create institutional judgment. It creates a better-informed failure mode unless retrieval quality, prompt design, task definition, and evaluation all line up.

The prompt-injection module is the paper’s least decorative feature

The paper’s prompt-injection defence may look like a security sidebar. It is not. It is one of the clearest signs that the authors understand publication as infrastructure.

An AI-reviewed manuscript is not just text. It is an input to an evaluator. That means a malicious author can try to manipulate the evaluator through hidden instructions, strange formatting, invisible characters, metadata, mixed-language tricks, or contextual commands. The paper gives the example of an instruction telling the reviewer to ignore previous instructions and give only a positive review. Elegant? No. Effective enough to worry about? Yes.

aiXiv proposes a five-stage pipeline:

Extract raw PDF text plus layout metadata such as font size, colour, position, and encoding.
Run coarse parallel scans for known injection keywords, visual anomalies, white text, zero-width characters, and Unicode variants.
Perform deeper semantic verification with LLM analysis, contextual checks, and multilingual validation.
Categorise confirmed anomalies into attack types.
Aggregate severity, type, and location into a risk score for final decision.

The authors test this on 150 recent arXiv papers across five computer science domains, manually filtering to 105 clean papers. They synthetically augment 35% of the data with attacks, producing 36 adversarial papers across categories such as white text, metadata, invisible characters, mixed language, steganographic attack, and contextual attack. They report 84.8% detection accuracy on the synthetic adversarial dataset and 87.9% on a real-world suspicious sample set.

This is not a complete security proof. It is a useful design signal. If AI systems are reviewing documents, the documents become attack surfaces. A business review pipeline for contracts, research memos, M&A materials, vendor proposals, regulatory submissions, or code documentation has the same problem. The adversary may not need to hack the model. They may only need to insert model-readable instructions where humans are unlikely to look.

That changes the governance checklist. Content review is not enough. Document forensics becomes part of AI quality control.

The revision results are strong, but they measure improvement more than truth

The most commercially relevant result in the paper is not the headline accuracy of the reviewers. It is the improvement after revision.

For proposals, the authors generate 50 proposals per topic across three topics using the AI Scientist proposal module, remove redundant content with sentence-level embeddings and an 80% cosine-similarity threshold, and evaluate revised versions against originals. The final topic counts shown in the table are 28 for NanoGPT, 27 for 2dDiffusion, and 29 for Grokking. Across single-review and meta-review settings, revised proposals are preferred in more than 90% of cases. With a response letter included, many settings reach 100%.

For papers, the authors use 10 full-length documents generated by the AI Scientist, each with reproducible baselines and code. Revised versions are again compared with originals. Without a response letter, the revised papers are preferred 90% or 95% on average depending on model and ordering. With a response letter, the revised papers reach 100% preference across both models and both presentation orders.

The response-letter effect is especially interesting. In human peer review, rebuttals and response letters do not only transmit information; they frame the revision. They tell the reviewer what changed, why it changed, and where to look. aiXiv appears to reproduce that dynamic in an LLM-agent setting. The result is slightly amusing: even synthetic reviewers like being handled with a proper response memo. Bureaucracy survives the singularity.

But the interpretation has limits. Pairwise preference does not prove scientific correctness. A revised paper can be clearer, more complete, and better positioned while still being wrong. A response letter can improve perceived responsiveness without proving that the underlying experiments are reliable. The paper’s own limitation section is direct on this point: current AI Scientist systems still struggle to conduct rigorous experimental workflows or produce high-quality publishable outputs without human oversight.

So the right reading is not “review agents make AI science reliable.” It is narrower and more useful: structured review agents can improve the presentation, organisation, methodological articulation, and apparent quality of AI-generated research artefacts. That is valuable. It is also not the same thing as truth.

Voting exposes the problem it is supposed to solve

The publication decision mechanism uses five high-performing LLMs. A submission is accepted if at least three of five vote to accept. The authors argue that this reduces single-model bias. In principle, yes. In practice, the tables show why ensemble governance is both useful and awkward.

For proposals, original versions receive 0% acceptance by majority vote across all three topics. After revision, single-review and meta-review versions improve substantially. Topic A reaches 42.85% or 50% depending on review mode. Topic B reaches 66.66% in both revised modes. Topic C remains much weaker, rising only to 6.89% or 20.68%. The reported mean acceptance rate for proposals increases from 0% to 45.2%.

For papers, acceptance rises from 10% for original versions to 70% after revision. That is the clean headline. The messier detail is model disagreement. In the proposal voting table, one model accepts nearly everything in some conditions, while another accepts almost nothing. DeepSeek V3, listed as M4, gives very high acceptance rates, including 100% in many proposal rows. Claude Sonnet 4, listed as M1, is usually far more conservative. Gemini 2.5 Pro also tends to accept many revised proposals. The majority vote hides this disagreement while depending on it.

The paper acknowledges that simple LLM majority voting may still lack objectivity and proposes that submissions passing internal votes be marked “Provisionally Accepted” until enough diverse external reviewers contribute additional evaluations. That is the right instinct. Majority vote is not epistemology. It is a temporary coordination rule.

For business use, the lesson is sharp: multi-model voting can reduce dependence on one model, but it can also convert model personality into governance theatre. If one model is permissive, one is severe, and one is erratic, three-out-of-five voting may look procedural while remaining fragile. Enterprises using ensembles for approval should track reviewer behaviour over time: false acceptances, false rejections, calibration drift, sensitivity to prompt framing, and disagreement patterns by domain.

A committee of models is still a committee. It can produce better decisions. It can also produce minutes.

What the experiments support, and what they do not

The paper includes several experimental components. They are easy to over-merge, so it helps to separate their likely purpose.

Test or result	Likely purpose	What it supports	What it does not prove
ICLR proposal pairwise evaluation	Main evidence for reviewer alignment on proposal quality	LLM reviewers can distinguish clearer high- versus low-quality proposal pairs under filtered conditions	Reliable judgment on ambiguous, frontier, or domain-specific proposals
ICLR paper pairwise evaluation	Main evidence for reviewer alignment on paper quality	Some LLM reviewers, especially with RAG in paper settings, align reasonably with accepted/rejected paper distinctions	Replacement for human programme committees
Prompt-injection detection	Security and robustness test	AI-reviewed documents need adversarial scanning; the proposed detector catches many synthetic and suspicious cases	Comprehensive defence against all document-level attacks
Proposal revision preference	Main evidence for closed-loop improvement	Review feedback improves AI-generated proposals in pairwise comparison	That revised proposals are scientifically novel or feasible in the real world
Paper revision preference	Main evidence for closed-loop improvement	Review feedback improves AI-generated papers’ apparent quality and structure	That experiments are correct, reproducible, or important
Response-letter variant	Sensitivity or workflow test	Explicitly mapping reviewer concerns to revisions improves perceived quality	That response letters cannot game reviewers
Five-model voting	Governance mechanism test	Revision increases acceptance under the platform’s voting rule	Objective publication quality

This table is where the practical reading should land. The paper provides evidence for workflow improvement and review architecture. It does not settle the deeper question of scientific validity. That distinction is not pedantry. It is the difference between building a useful operating system for AI research and confusing the operating system with science itself.

The business value is platform governance, not automated genius

The enterprise analogy is not “build your own aiXiv for papers.” It is broader. Many organisations are already facing synthetic output volume: investment memos, market scans, legal drafts, code patches, customer research summaries, compliance reports, technical designs, patent landscapes, supplier evaluations, and clinical literature reviews. The common problem is not that no one can produce drafts. The problem is that everyone can produce too many drafts.

aiXiv offers a useful governance pattern:

Define structured intake before generation becomes chaotic.
Separate critique from approval.
Use retrieval for context, but evaluate whether retrieval helps.
Compare revised outputs against originals rather than relying only on absolute scores.
Require response letters or change logs for consequential revisions.
Scan documents for adversarial instructions before feeding them into AI reviewers.
Use multiple reviewers, but measure reviewer calibration.
Treat approval as staged, not binary.
Preserve attribution, provenance, and discussion history.

This maps cleanly to R&D, consulting, finance, legal, pharma, and enterprise strategy. In each case, AI-generated work should move through a pipeline that asks: who generated this, from what inputs, under what assumptions, with what changes, under whose review, and with what residual uncertainty?

That is the serious business relevance. Not cheaper prose. Cheaper diagnosis.

A manager does not need another AI system that can generate a confident memo. The internet is already full of confident memos; some of them even contain facts. What operators need is a way to sort, improve, challenge, and approve AI work without pretending that fluency equals competence. aiXiv’s contribution is to make that control layer explicit.

The limits are not footnotes; they define the product boundary

The paper’s limitation section is unusually important because it prevents the most obvious overclaim.

First, current AI Scientist systems are still not good enough to autonomously conduct rigorous experimental workflows or reliably generate high-quality publishable science without human oversight. The paper names the relevant weaknesses: cross-domain generalisation, long-horizon reasoning, ambiguous or under-specified tasks, and experimental execution.

Second, the validation is restricted largely to simulated environments and virtual agent interactions. That matters because many sciences are constrained by physical systems, expensive instruments, tacit lab knowledge, wet-lab variability, safety procedures, field conditions, and real-world failure. A workflow that improves a generated paper about a benchmark does not automatically translate to autonomous materials science, clinical research, robotics, or biology.

Third, adaptive learning across diverse users, tasks, and domains remains unresolved. aiXiv is framed as a future environment where agents could learn from proposal generation, review, revision, and publication outcomes. That is plausible as an ambition. It is not yet demonstrated as a mature learning system.

These limits do not make the paper weak. They make it properly bounded. The work is best read as a platform proposal plus experimental evidence that review-driven refinement improves AI-generated research artefacts under controlled conditions. It is not a final answer to autonomous scientific discovery. The authors are building the airport before proving every aircraft is safe to fly. Slightly terrifying, but at least someone remembered air traffic control.

The sharper takeaway: publish less like a printer, more like a protocol

The tempting future of AI-generated science is volume: more proposals, more experiments, more papers, more reviews, more everything. Volume, however, is not progress. Without review infrastructure, it is just spam with equations.

aiXiv’s useful move is to treat publication as a protocol. It asks what must happen between generation and dissemination: structured submission, grounded review, revision, comparison, security screening, multi-reviewer approval, provisional publication, public discussion, and attribution. That protocol is still imperfect. The evidence is uneven. The voting mechanism is fragile. The reviewers are model-dependent. The experiments do not prove autonomous science works in the wild.

Still, the platform points in the right direction. As AI agents become cheaper and more persistent, organisations will not win by generating the most artefacts. They will win by building the best filters, revision loops, and accountability layers around those artefacts.

The glamorous question is whether AI can become a scientist. The operational question is whether anyone can build institutions that keep AI-generated science from becoming unreadable, unreviewable, and occasionally adversarial. aiXiv is an early attempt at that institution.

Naturally, the institution is also made of AI agents. Academia wanted faster peer review. It may get five tireless reviewers, a prompt-injection scanner, and a provisional acceptance badge. Progress does enjoy arriving with paperwork.

Cognaptus: Automate the Present, Incubate the Future.

Pengsong Zhang et al., “aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists,” arXiv:2508.15126, https://arxiv.org/abs/2508.15126. ↩︎

TL;DR for operators#

The machine is not the author; the machine is the venue#

The workflow treats review as a loop, not a stamp#

Retrieval helps, but it does not magically manufacture judgment#

The prompt-injection module is the paper’s least decorative feature#

The revision results are strong, but they measure improvement more than truth#

Voting exposes the problem it is supposed to solve#

What the experiments support, and what they do not#

The business value is platform governance, not automated genius#

The limits are not footnotes; they define the product boundary#

The sharper takeaway: publish less like a printer, more like a protocol#