When AI Becomes the Reviewer: Pairwise Judgment at Scale

A committee has one expensive problem before it has any philosophical problem: too many proposals, too little time, and no clean way to know whether Proposal 17 was actually better than Proposal 42.

So the usual system does what institutions often do when the task is too large to compare directly. It fragments the work. A few reviewers score a few proposals. Their scores are averaged. A ranked list appears. Everyone pretends the number is more stable than the process that produced it.

The paper behind today’s article, LLMs Can Assist with Proposal Selection at Large User Facilities, does not merely ask whether a large language model can imitate a scientific reviewer.¹ That would be the obvious headline, and also the least interesting one. The sharper idea is operational: if an LLM makes pairwise comparisons cheap enough, a review system can stop asking every reviewer to produce isolated absolute scores and instead build a ranking from many direct head-to-head judgments.

That sounds small. It is not.

Absolute scoring asks: “How good is this proposal?” Pairwise preference asks: “Between these two proposals, which is stronger?” The second question is often cognitively easier, statistically cleaner, and organizationally more useful. It was also historically impractical, because comparing every pair scales as $N(N-1)/2$. A pool of 70 proposals means 2,415 pairwise comparisons. Charming, if your reviewers are immortal.

The paper’s contribution is to show a practical version of that workflow for historical proposal data from three beamlines at the Spallation Neutron Source at Oak Ridge National Laboratory: EQ-SANS, CNCS, and POWGEN. The authors convert proposal PDFs into markdown, use Gemini-2.5-flash to judge every proposal pair within a cycle, aggregate the win-loss outcomes with a Bradley-Terry model, compare the resulting ranks with historical human ranks, examine publication-output metrics, estimate cost, and add an embedding-based similarity layer for resubmissions or overlapping proposals.

The misconception to avoid is simple: this is not a paper saying “fire the reviewers.” It is a paper saying “stop wasting reviewer attention on the parts of ranking machinery that machines can make cheap, repeatable, and inspectable.” The committee is not gone. Ideally, it becomes less blind.

The mechanism is not “AI scoring”; it is pairwise preference plus ranking aggregation

Most AI-review discussions begin with the wrong mental model. They imagine a model reading a proposal and assigning a score, as if the goal were to replace one human score with one machine score. That would reproduce the old weakness with a shinier vendor invoice.

This paper instead changes the shape of the task.

In the current individual-scoring approach, proposals are distributed across reviewers. Each proposal receives only a small number of expert reviews, and each reviewer sees only a subset of the pool. The authors note that at SNS each proposal is typically reviewed by no more than three experts, while each expert may handle up to nine proposals. That structure creates a comparability problem: Reviewer A’s “excellent” is not necessarily Reviewer B’s “excellent,” and proposals in different reviewer subsets are connected only indirectly.

Pairwise preference attacks that weakness directly. Every proposal is compared with every other proposal in the same cycle. The model decides whether Proposal A, Proposal B, or a tie is preferred. Those outcomes become a win-loss matrix. The Bradley-Terry model then estimates each proposal’s latent strength from the head-to-head results.

The core probability model is:

$$ P(i > j) = \frac{s_i}{s_i + s_j} $$

where $s_i$ and $s_j$ represent the estimated strength of proposals $i$ and $j$. Ties are treated as half-wins for each side, and scores are normalized during iterative estimation.

That matters because the final ranking is not a mysterious one-shot opinion from the LLM. It is an aggregation of many local judgments. The model may still be wrong. Of course it may. We are not conducting a religious conversion. But the errors are now distributed across a structured comparison matrix, which means disagreement, instability, and outliers can be inspected instead of buried inside a single averaged score.

For business readers, this is the first transferable lesson: the value is not merely using an LLM to “review.” The value is using an LLM to make a previously unaffordable decision protocol affordable.

The workflow has four distinct layers, and only one of them is “the model judges”

The paper’s system is easier to understand as a pipeline than as a model benchmark.

Layer	What happens in the paper	Operational meaning
Document conversion	Proposal PDFs are converted into markdown using OCR	Review automation begins with document intake, not model magic
Pairwise judgment	Gemini-2.5-flash compares every proposal pair within a cycle	The model performs many relative judgments rather than one absolute score
Ranking aggregation	Bradley-Terry converts win-loss results into proposal strength scores	The organization gets a ranked list with a formal aggregation mechanism
Similarity analysis	Qwen3-embedding-8b maps proposals into vectors for cosine similarity	Committees can flag resubmissions, overlaps, and same-topic submissions

This structure is why the paper is more useful than a generic “LLMs are promising for peer review” article. The proposal ranking problem is not just a natural-language task. It is a resource allocation process. Resource allocation needs intake, comparison, ranking, exception handling, and audit trails.

In that sense, the LLM is only one component. The system’s actual product is a decision-support layer.

The authors also make an implementation choice worth noticing. The model is prompted to summarize each proposal, compare the two, reason through the comparison, and only then decide the winner in JSON. That prompt design is not a proof of correctness. But it shows the authors are not merely asking for a naked label. They are trying to force the judgment through a repeatable response structure.

In production terms, that matters because a committee does not only need a winner. It needs a reason to inspect.

The main evidence is agreement with historical human rankings, not proof of truth

The first major result compares LLM-generated rankings with historical human rankings over past run cycles. The paper reports positive correlation across all analyzed cycles. Spearman’s $\rho$ ranges roughly from 0.2 to 0.8, and the abstract states that correlation improves to at least 0.5 after removing the top 10% outliers.

This is main evidence, but it must be read carefully.

A positive correlation means the LLM-based pairwise system tends to recover something similar to the historical human ranking. It does not prove that humans were correct. Nor does it prove that the LLM discovered “true scientific merit,” a lovely phrase that tends to evaporate under committee lighting. The paper’s comparison baseline is historical human judgment because that is what exists.

The more interesting point is how the authors use disagreement. Figure 4 compares normalized LLM ranks and human ranks for EQ-SANS and then shows how Spearman correlation changes as outliers are excluded across EQ-SANS, CNCS, and POWGEN. The authors interpret divergent cases as useful signals for review committees.

That is the correct operational reading. If an AI ranking simply agrees with humans, it may be redundant. If it disagrees randomly, it is noise. But if disagreement clusters in specific proposals, it becomes an audit queue.

A practical review workflow could therefore treat LLM-human rank gaps as triage signals:

Signal	Interpretation	Committee action
High agreement between human and LLM ranking	The proposal’s relative position is stable across methods	Lower review burden, unless policy requires extra scrutiny
Moderate disagreement	The proposal may depend on criteria the model or reviewers weigh differently	Ask a domain expert to compare the competing rationale
Extreme rank gap	Potential human inconsistency, model misunderstanding, OCR failure, or genuinely ambiguous proposal	Escalate for manual adjudication
Many outliers in one cycle	Review criteria may be unstable or proposal pool may be unusually heterogeneous	Revisit rubric calibration before final ranking

This is where the mechanism-first interpretation pays off. The paper is not only about whether the AI rank is “close.” It is about turning rank distance into review intelligence.

The publication metric is a useful reality check, but not a ground truth oracle

The second evidence layer examines whether the rankings identify proposals with higher publication potential. The authors link accepted proposals to publication records and use a discounted publication count, $N_{dpub}$, where a publication associated with $K$ proposals counts as $1/K$ for each proposal.

They then define a publication metric:

$$ M_{dpub} = \frac{ \sum_{i=1}^{N}(1 - R_i)N_{dpub,i} }{ \sum_{i=1}^{N}N_{dpub,i} } $$

Here, $R_i$ is the normalized rank, where a better rank is closer to 0. If high-publication proposals are ranked higher, the metric should be larger.

The results are cautious. The paper reports no statistically significant difference between the LLM ranking and human ranking on this metric. For EQ-SANS, the LLM ranking has $M_{dpub}=0.481 \pm 0.079$, while the human ranking has $0.474 \pm 0.110$. For CNCS, both means are reported as 0.516, with standard deviations of 0.069 for LLM and 0.077 for humans. For POWGEN, the LLM metric is $0.542 \pm 0.106$ versus $0.514 \pm 0.093$ for humans, but the difference is not statistically significant.

This result supports comparability, not superiority.

There is a trap here. A lazy reading would say: “The LLM finds publishable science as well as humans.” The better reading is narrower: among accepted proposals for which publication outcomes are observable, this metric does not show the LLM ranking performing worse than the historical human ranking.

That boundary matters for two reasons.

First, rejected proposals do not generate experimental output at the facility, so their publication potential is largely unobserved. Second, accepted proposals were accepted under the human process, which biases the available outcome data toward the human ranking regime. The authors explicitly note this bias.

So the publication test is best interpreted as a sanity check. It asks whether the LLM ranking obviously fails when compared with a downstream productivity signal. It does not prove that the LLM would make better funding decisions in a counterfactual world where rejected proposals were also tested.

Useful? Yes. Final proof? No. Please do not put it on a sales slide as “AI reviewer validated by publications,” unless your compliance team enjoys recreational danger.

The cost result is the business hinge

The strongest business implication comes from cost, because pairwise comparison is only attractive if the quadratic workload stops being fatal.

The authors estimate that a human proposal review costs about $54.9, while one LLM pairwise comparison costs about $0.0046 using the Gemini-2.5-flash pricing and observed token usage. Their token statistics are roughly 4,869 input tokens and 1,255 output tokens per pair. For typical SNS proposal-pool sizes of $N \in [30,70]$, they estimate that the human individual-scoring approach costs 346 to 823 times as much as LLM pairwise preference. Put differently, LLM pairwise ranking costs about 0.12% to 0.29% of the human individual-scoring approach.

This is the paper’s real strategic hinge. Pairwise preference is not new. Bradley-Terry is not new. Committees have always known that direct comparison can be valuable. The blocker was workload.

LLMs change the economics of the protocol.

Review design	Human feasibility	Machine feasibility	Main weakness
Individual scoring	High	High	Weak cross-proposal consistency
Human pairwise comparison	Low for large pools	Not applicable	Quadratic workload
LLM pairwise comparison	Not human-limited	High for normal proposal pools	Needs validation, monitoring, and escalation
Sparse or active pairwise comparison	Medium if carefully designed	High	Requires sampling strategy

The cost analysis should not be read as “AI is cheap, therefore use it.” Cheap mistakes can still be expensive. The better reading is: AI makes richer comparison structures economically possible, so organizations can spend human attention on exceptions, policy judgment, and final accountability.

That is a different design philosophy.

In a grant agency, accelerator program, procurement process, venture-screening funnel, or internal innovation committee, the expensive part is not only reading. It is maintaining fairness and consistency across a crowded decision space. Pairwise LLM ranking can create a first-pass map of that space. Humans can then inspect suspicious borders.

Similarity analysis may be the quiet killer feature

The paper’s final extension uses embeddings for proposal similarity analysis. Each proposal is converted into a high-dimensional vector using Qwen3-embedding-8b, and proposal-pair similarity is computed through cosine similarity.

This is exploratory extension evidence, not the main ranking validation. But it may be the most immediately useful feature in real institutions.

Why? Because committees often struggle not only with “which proposal is better?” but with “are these two proposals basically the same thing?” That question appears in several forms:

a revised resubmission across cycles;
two proposals from different principal investigators on nearly the same topic;
overlapping methods or data requests;
duplicated operational burden for the facility;
possible strategic fragmentation of one research program into multiple applications.

The paper shows heatmaps for EQ-SANS run cycles 25A and 25B. In one inter-cycle case, the highest-similarity point corresponds to a revised resubmission. In an intra-cycle case, the highest-similarity point corresponds to two proposals on the same topic from different principal investigators.

That is not merely a nice dashboard. It changes committee attention.

A similarity system can flag proposal pairs before the final meeting. The LLM can then summarize similarities and differences. Human experts can verify whether overlap is benign, duplicative, or strategically important. The committee does not need to remember every two-page proposal across cycles like a sleep-deprived archive clerk. A machine can do the remembering. Humans can do the judging.

For many organizations, this similarity layer may be easier to deploy than automated ranking because it is less politically explosive. Flagging overlap is decision support. Declaring winners is power. Institutions tend to notice the difference.

What the paper directly shows, and what business users should infer

The paper directly shows that, on historical proposal data from three SNS beamlines, an LLM-enabled pairwise preference workflow produces rankings positively correlated with human rankings, shows no statistically significant disadvantage on the publication metric used, has much lower estimated cost than human individual scoring for typical proposal-pool sizes, and can support embedding-based similarity analysis.

Cognaptus would infer a broader design pattern, but with boundaries.

Paper result	Direct support	Business interpretation	Boundary
Pairwise LLM rankings correlate with human rankings	Main evidence from historical run cycles	LLMs can provide a second ranking lens for committees	Human ranking is the comparison baseline, not ground truth
Publication metric shows no significant disadvantage	Main evidence using accepted proposals and publication records	AI ranking does not obviously fail on downstream productivity proxy	Rejected proposals lack observed facility-output outcomes
LLM pairwise comparison is much cheaper	Cost analysis	Richer comparison protocols become economically feasible	Assumptions depend on model pricing, token use, review-time estimates, and governance overhead
Embeddings flag similar proposals	Exploratory extension	Similarity screening can reduce duplicate, overlapping, or resubmission blind spots	Thresholds and final interpretation require human verification
Outlier disagreement can be inspected	Robustness/sensitivity-style use of rank gaps and excluded outliers	Disagreement becomes an audit queue	Requires a process for escalation, not just a plot

The most important business lesson is not “automate review.” It is “separate the review system into routine comparison, statistical aggregation, exception discovery, and final accountable judgment.”

That architecture applies beyond neutron-scattering facilities. It maps naturally onto:

grant and fellowship selection;
university admissions shortlisting;
startup accelerator screening;
vendor proposal evaluation;
internal R&D project selection;
procurement bid comparison;
enterprise AI-use-case prioritization.

In each case, the organization faces many written applications, scarce capacity, inconsistent evaluators, and a need to explain why some items were selected. Pairwise LLM ranking can provide an additional layer of structured comparison. It should not be the sole authority, especially when decisions affect careers, funding, public resources, or regulated procurement.

The governance problem does not disappear; it becomes more explicit

If an institution deployed this workflow, the hard questions would not be “Can the model read?” They would be more specific.

What criteria should the model apply? Should feasibility be excluded because instrument scientists already screened it, as in the paper’s prompt? Should diversity of research portfolio matter? Should early-career investigators be handled differently? Should the model see investigator identities, or should those be masked? What happens when the model strongly disagrees with expert reviewers? Who owns the final explanation?

These are not annoying edge cases. They are the review system.

The paper’s design makes some of them easier to manage because pairwise outputs, Bradley-Terry rankings, rank gaps, and similarity matrices are inspectable artifacts. But governance must still be designed around them.

A serious deployment would need at least five controls:

Control	Purpose
OCR and document-quality checks	Prevent corrupted proposal text from producing bad judgments
Prompt and rubric versioning	Ensure the ranking criteria remain stable across cycles
Human-AI disagreement review	Convert rank gaps into structured committee discussion
Bias and identity audits	Test whether model preferences shift with names, institutions, geography, or writing style
Appeal and documentation process	Preserve accountability for high-stakes decisions

The model can make pairwise comparison cheap. It cannot decide what fairness means for the institution. Sorry. We still need adults in the room.

Where the evidence stops

The paper is useful because it is concrete, but the boundaries are also concrete.

First, the study uses three representative beamlines at one large user facility. The proposal format, scientific domain, historical database quality, and review culture may differ in other settings.

Second, the proposal text is private and cannot be shared. That is understandable, but it limits independent replication. The scripts are available, but the most important data are not.

Third, publication output is an imperfect metric. Some excellent facility experiments may produce delayed publications, non-publication impact, negative results, industrial value, or foundational data that do not show up cleanly in publication counts. Conversely, publication count is not the same as scientific quality.

Fourth, the ranking comparison uses historical human decisions as the practical benchmark. That is reasonable, but it means the system is evaluated against the existing process, not against a fully objective truth.

Fifth, the similarity analysis is promising but threshold-dependent. High cosine similarity can flag overlap, but it does not by itself prove duplication, misconduct, or lack of novelty. It points to a question. It does not answer it.

These limits do not weaken the paper’s core contribution. They clarify how it should be used: as a decision-support architecture for committees, not a vending machine for merit.

The better reviewer is probably a system, not a person or a model

The old review process depends on scarce expert attention and imperfect score calibration. The naive AI alternative replaces expert scores with machine scores and calls it transformation. That is not transformation. That is spreadsheet cosplay with an API key.

The more interesting future is a review system where machines do what machines are now economically suited to do: compare at scale, aggregate consistently, detect outliers, remember similar cases, and prepare structured evidence for humans.

Humans then do what humans should still do: set criteria, interpret context, resolve contested cases, weigh institutional priorities, and own the decision.

This paper is valuable because it shifts the question from “Can LLMs be reviewers?” to “What review protocol becomes possible when pairwise judgment is cheap?”

That is the question business leaders should take seriously. Not because AI judgment is magically fair. It is not. But because many current review systems are already inconsistent, expensive, and difficult to audit. Pairwise LLM workflows offer a way to make the machinery more visible.

And in committee work, visibility is already progress. A small miracle, really, considering how often committees behave like fog with minutes.

Cognaptus: Automate the Present, Incubate the Future.

Lijie Ding, Janell Thomson, Jon Taylor, and Changwoo Do, “LLMs Can Assist with Proposal Selection at Large User Facilities,” arXiv:2512.10895, 2025, https://arxiv.org/abs/2512.10895. ↩︎

The mechanism is not “AI scoring”; it is pairwise preference plus ranking aggregation#

The workflow has four distinct layers, and only one of them is “the model judges”#

The main evidence is agreement with historical human rankings, not proof of truth#

The publication metric is a useful reality check, but not a ground truth oracle#

The cost result is the business hinge#

Similarity analysis may be the quiet killer feature#

What the paper directly shows, and what business users should infer#

The governance problem does not disappear; it becomes more explicit#

Where the evidence stops#

The better reviewer is probably a system, not a person or a model#