Every R&D team has a shelf of papers that are theoretically useful and practically booby-trapped.
The abstract is promising. The method is relevant. The results look transferable. Then reality arrives wearing a conda error message: the repository has three setup paths, two notebooks, one undocumented dependency, and a tutorial that assumes you already know the answer. The paper has been published. The method has not, in any serious operational sense, been delivered.
Paper2Agent is interesting because it attacks that gap directly. It does not merely ask a language model to summarise a paper, explain a figure, or write some hopeful wrapper code while everyone pretends that counts as reproducibility. It proposes a pipeline that turns a methodological paper and its associated codebase into a Model Context Protocol server: a callable, tested interface containing tools, resources, and prompts that an AI agent can use through natural language.1
That framing matters. The product is not “ChatGPT for PDFs”, the most durable software category in the history of investor fatigue. The product is closer to an executable contract: here are the paper’s usable methods, here is the environment they run in, here are the datasets and resources they depend on, here are the workflows that should be followed, and here are tests showing the exposed tools reproduce reference behaviour.
A paper stops being a passive artefact. It becomes something closer to a junior research assistant with a lab notebook, a tool belt, and, crucially, a leash.
The core shift is from reading interface to execution interface
The usual AI-paper interaction is linguistic. A user uploads a PDF and asks: “What does this paper say?” That is useful, but shallow. It improves comprehension, not adoption. A method paper still has to travel from prose to environment, from environment to functions, from functions to workflow, and from workflow to new data. This is where the bodies are buried, politely, under supplementary materials.
Paper2Agent shifts the target. Instead of making the paper easier to read, it makes the paper easier to run.
The framework represents each paper as an MCP server. That server exposes three classes of assets:
| MCP component | What it captures | Why it matters operationally | What it does not magically solve |
|---|---|---|---|
| Tools | Executable functions distilled from the paper’s code and tutorials | Turns methodological steps into callable interfaces with clearer inputs and outputs | Bad original code remains bad raw material |
| Resources | Manuscript text, repositories, figures, datasets, supplementary files, and metadata | Gives the agent traceable context and programmatic access to paper-linked assets | Resource access is not equivalent to scientific validity |
| Prompts | Encoded multi-step workflows inferred from the paper and codebase | Helps the agent run methods in the correct order without the user hand-holding every step | A workflow prompt is only as reliable as the tests around it |
That third column is the business story. Most enterprises do not have a “we cannot read enough PDFs” problem. They have a “we cannot reliably convert technical knowledge into reusable operating procedures fast enough” problem. Paper2Agent is valuable because it treats a paper as a knowledge asset that can be packaged, tested, versioned, and reused.
The authors implement the system with an orchestrator agent coordinating specialised sub-agents. An environment manager creates a clean, reproducible setup. A tutorial scanner identifies useful examples. A tool extractor turns tutorial logic into parameterised functions. A test-verifier-improver runs the functions against reference behaviour and iterates until they work, or removes tools that fail repeatedly.
This last step is easy to underappreciate. The system does not simply generate wrappers and hope. It uses the original tutorials as behavioural anchors. If a tool cannot reproduce expected numerical or visual outputs, it does not graduate into the MCP server. That is not perfect assurance, but it is a useful design principle: agentic access should be earned by executable evidence, not granted because a model sounded confident in a terminal.
Paper2Agent is really a reproducibility pipeline with a conversational front end
The paper describes Paper2Agent as a multi-agent workflow, but the practical sequence is more prosaic and more important:
- Locate the official codebase and associated resources.
- Build a clean software environment.
- Discover tutorials and examples.
- Execute and audit those tutorials.
- Extract reusable tools from tutorial logic.
- Assemble the MCP server with tools, resources, prompts, versioning, and basic security defaults.
This is not glamorous. It is also where most method adoption fails. Anyone who has tried to reuse a computational biology paper, a quantitative finance model, or an internal machine-learning pipeline knows the pattern: the conceptual method is clear long before the runnable method is.
Paper2Agent’s mechanism-first contribution is that it turns this adoption process into an automated pipeline. The agent is downstream of that pipeline. It is not the whole point.
That distinction prevents a common misreading. Paper2Agent is not claiming that a general-purpose AI agent can read any paper and become a flawless principal investigator by lunchtime. The framework works best for methodological papers with public, usable codebases. If the repository is incomplete, undocumented, brittle, or simply wrong, the system cannot alchemise it into reliability. It can expose the failure more quickly. In research software, that is already a public service.
The AlphaGenome benchmark tests whether the wrapper beats raw repo access
The strongest quantitative evidence comes from the AlphaGenome case study. AlphaGenome is a genomics foundation model for predicting the impact of DNA variants across regulatory modalities. It is powerful, but its practical use involves environment setup, API keys, object hierarchies, modality choices, tissue and cell-type identifiers, and non-trivial interpretation. In other words: a beautiful method, surrounded by enough plumbing to keep casual users outside the building.
Paper2Agent generated 22 AlphaGenome MCP tools in around three hours on a personal laptop. Those tools covered variant scoring, sequence-level prediction, tissue ontology exploration, batch analysis, and visualisation.
The authors then evaluated the resulting AlphaGenome agent on two manually curated benchmark sets:
| Test | Likely purpose | Result | What it supports | What it does not prove |
|---|---|---|---|---|
| 15 tutorial-based queries | Main reproducibility evidence | AlphaGenome agent: 15/15; Claude + Repo: 9/15; Biomni: 6/15 | The MCP tools can reproduce known example-style usage more reliably than giving an agent repo access | It does not prove general scientific correctness beyond the tested tasks |
| 15 novel queries | Robustness/generalisation check | AlphaGenome agent: 15/15; Claude + Repo: 12/15; Biomni: 9/15 | The agent can handle related but unseen variants, substitutions, and tissue contexts | It is still a small, expert-curated benchmark |
| Runtime comparison | Efficiency comparison | Median speedups of 1.8x and 3.1x over Claude + Repo and Biomni on tutorial queries; 3.2x and 4.6x on novel queries | Pre-built tools reduce the overhead of repeated code discovery and execution | Runtime gains may vary by task, implementation, infrastructure, and model stack |
The comparison is well chosen. A sceptic might say: why build a paper-specific MCP at all? Why not give Claude Code the repository and let it figure things out?
The answer, at least in this case, is that raw repo access is not the same as a tested interface. Claude + Repo has more freedom, but also more room to wander. It must inspect files, infer API usage, write code, execute code, and extract the answer every time. The Paper2Agent output narrows the action space. The agent calls tools that have already been extracted, parameterised, and tested.
That is the quiet engineering lesson. Sometimes the smartest agent is the one with fewer choices.
The AlphaGenome case also includes a more scientifically interesting demonstration. The agent is asked to interpret a known LDL cholesterol-associated variant, chr1:109274968:G>T. It prioritises SORT1 as a likely causal gene, while the original AlphaGenome paper emphasised CELSR2 and PSRC1. The authors then manually check GTEx liver eQTL evidence and find the variant significantly associated with SORT1 expression, while also noting that CELSR2 and PSRC1 have strong AlphaGenome scores and significant eQTL associations too.
The point is not that the agent “solved” the locus. Complex GWAS loci rarely hand over causal genes like receipts. The point is that an agentified method can be used to revisit a published interpretation in a traceable way. That is a second-order use case: not just running the method, but using the method to interrogate the paper’s own claims.
There is a business analogue here. In enterprise settings, a reusable method interface does not merely reduce onboarding time. It lets teams re-run, stress-test, and reinterpret prior analytical conclusions when new data, new assumptions, or new regulatory constraints appear. The PDF cannot do that. It just sits there, full of promise and absolutely no API.
TISSUE and Scanpy show that workflows matter as much as tools
The TISSUE and Scanpy cases are less numerically dramatic than AlphaGenome, but they are important because they test a different part of the mechanism.
TISSUE is a method for uncertainty-aware single-cell spatial transcriptomics. Paper2Agent generated six MCP tools covering spatial gene expression prediction, prediction interval construction, and downstream uncertainty-aware analysis. The TISSUE agent could answer practical input/output questions, run a prediction interval workflow on mouse somatosensory cortex data, and reproduce outputs generated by human researchers following the original tutorial.
The likely purpose here is not competitive benchmarking against other agents. It is workflow reproducibility. The authors are showing that a paper agent can execute an end-to-end analysis, not just expose a few isolated function calls.
The resource layer also becomes visible. Paper2Agent turns the data availability section into a structured registry, including metadata such as species, tissue type, modality, and data URL. That lets a user ask for the relevant mouse spatial transcriptomics data and have the agent filter, fetch, and apply the method. This is not glamorous either. It is exactly the sort of unglamorous glue work that consumes research teams and quietly destroys project velocity.
Scanpy tests another variant: not an entire new method paper, but a focused workflow inside a widely used software ecosystem. Paper2Agent generated seven tools for preprocessing and clustering single-cell RNA-seq data, including quality control and normalisation. More importantly, it generated MCP prompts encoding the correct sequence: inspect the data, run quality control, normalise, select highly variable genes, reduce dimensions, build a neighbourhood graph, cluster, and annotate.
That is a prompt, yes. But it is not a random user prompt floating in Slack. It is a workflow object derived from the paper and codebase and connected to tested tools. The distinction matters. Enterprises already have many “standard operating prompts”. Most are folk artefacts: copied, tweaked, forgotten, and eventually contradicted by three other prompts with slightly different defaults. Paper2Agent points toward something more durable: prompts as versioned workflow interfaces, coupled to tools and resources.
The multi-paper case is an exploratory extension, not a victory lap
The paper’s most ambitious demonstration connects an AlphaGenome method MCP with an ADHD GWAS data MCP. The combined AI co-scientist generates hypotheses, designs analyses, and uses AlphaGenome to prioritise variants in ADHD-associated loci.
This is where the article practically begs for breathless language. Resist. That way lies “AI scientist discovers biology”, and the office should have a jar where everyone deposits five dollars when they say that too casually.
What the paper directly shows is narrower and still useful. Paper2Agent converts GWAS-associated datasets and supplementary materials into MCP resources. It then connects those resources with the AlphaGenome MCP so an agent can propose and run analyses across both. In one highlighted analysis, the agent prioritises rs1626703 among 209 candidate variants from fine-mapping credible sets. It predicts that this intronic variant alters MPHOSPH9 splicing and expression in glutamatergic neurons, with reported AlphaGenome quantile scores of 1.000 for splice junction impact and 0.963 for RNA-seq impact. The agent also extends the workflow across 39 loci within two hours, producing a markdown report of top variants, target genes, molecular effects, and biological interpretation.
The purpose of this case is exploratory extension. It is not the same evidentiary category as the AlphaGenome benchmark. The AlphaGenome benchmark asks: can the agent answer known and novel executable queries accurately against ground truth? The ADHD-GWAS case asks: if method and data papers become interoperable agent resources, can an AI system generate plausible, structured, testable hypotheses faster than manual review?
That is a legitimate question. It is not a final answer.
For business readers, the useful analogy is not “replace scientists”. It is “connect method libraries with data rooms”. Many organisations have analytical methods in one silo, datasets in another, and domain interpretation scattered across documents, people, and tacit habits. A Paper2Agent-like architecture suggests a more composable model: methods become tools, datasets become resources, workflows become prompts, and agents become orchestrators across them.
The value is not magic cognition. It is lower coordination cost.
The business lesson is to agentify high-friction methods, not every document
The obvious but wrong conclusion is that every PDF should become an agent. This is how organisations create expensive knowledge gardens full of decorative bots that nobody trusts.
Paper2Agent’s better lesson is more selective: agentify the technical artefacts where reuse is valuable, execution is difficult, and validation is possible.
That points to a practical enterprise filter:
| Candidate asset | Good Paper2Agent-style fit? | Why |
|---|---|---|
| A computational method with public code and tutorials | High | Tools can be extracted, tested, and reused |
| A recurring analytics workflow with stable inputs and outputs | High | Prompts and tools can encode the standard path |
| A dataset paper with structured supplementary files | Medium to high | Resources can be indexed and exposed, though interpretation may be harder to validate |
| A conceptual strategy paper | Low | Useful for summarisation, but weak as an execution interface |
| A messy internal codebase with no examples | Low until cleaned | The agentification attempt will expose reproducibility debt, not remove it |
For enterprises, the immediate opportunity is internal. Do not begin with the global scientific literature. Begin with the methods your own organisation repeatedly fails to reuse: risk models, pricing pipelines, regulatory screening procedures, lab workflows, customer analytics scripts, due diligence templates, technical due diligence notebooks, claims-processing heuristics, procurement scoring models. Anywhere the organisation has “a method” trapped in code, documents, and a few people’s heads, the same pattern applies.
The implementation question is not: can an LLM understand the document?
The sharper question is: can the organisation expose the method as a tested, permissioned, observable interface?
That is where ROI lives. A paper-specific or method-specific agent can reduce time spent on setup, onboarding, repeated explanation, workflow sequencing, and basic reproducibility checks. It can also make method adoption auditable. Instead of asking who copied which notebook and changed which parameter, the organisation can ask which tool version was called, with which inputs, against which resource registry, under which workflow prompt.
This is less cinematic than “AI co-scientist”. It is also far more likely to survive procurement.
What the paper shows, what we can infer, and what remains unresolved
A clean business interpretation should separate evidence from extrapolation. Otherwise the article becomes a press release wearing glasses.
| Layer | What the paper directly shows | Cognaptus inference for business use | Remaining uncertainty |
|---|---|---|---|
| Mechanism | Paper2Agent can convert selected method papers and codebases into MCP servers with tools, resources, and prompts | Research and analytics assets can be packaged as reusable execution interfaces | Generality across messier, proprietary, poorly documented assets is not established |
| Reliability | The AlphaGenome agent scored 100% on two 15-query benchmarks and beat Claude + Repo and Biomni in those tests | Pre-tested tools can outperform raw repository access for repeated method use | Benchmark scale is small and expert-curated |
| Workflow capture | TISSUE and Scanpy agents reproduced human-run outputs on selected workflows | Standard workflows can be encoded as agent-callable procedures | “Matches human outputs” depends on the chosen datasets, defaults, and evaluation criteria |
| Cross-paper use | AlphaGenome and ADHD GWAS MCPs were combined to prioritise variants and generate mechanistic hypotheses | Method and data assets can become composable across organisational silos | Hypotheses still require independent expert and empirical validation |
| Governance | Failed tools can be excluded; resources and code references can be traced | Agentification can become a governance and reproducibility standard | Security, permissions, version drift, and auditability need production-grade infrastructure |
That last row deserves attention. MCP is useful because it standardises how agents access external tools and resources. It is also a new integration surface. In enterprise deployment, a paper agent that can run code, fetch data, call APIs, and generate reports must sit behind proper authentication, logging, sandboxing, cost controls, and data-access policies. Otherwise the organisation has not built a research assistant. It has built a charming little exfiltration intern.
The Paper2Agent paper is primarily a research demonstration, not a full enterprise governance blueprint. That is fine. But buyers should not confuse an impressive Hugging Face-hosted MCP demo with production readiness in regulated environments.
Agent availability may become a stronger reproducibility signal than code availability
One of the paper’s more useful provocations is the idea of an “agent availability” section, analogous to data and code availability. That sounds futuristic until one remembers how low the current bar often is. A code availability statement can mean anything from a polished, versioned package to a repository last updated by someone who has since changed institutions, laptops, and probably emotional priorities.
Agent availability would raise the bar because it asks whether the contribution has been embodied as an interactive, tested interface. Not merely: is there code? But: can the method be invoked, reproduced, and applied by someone who was not in the original lab?
This could become a practical quality signal:
| Availability layer | Question it answers | Why the next layer is stronger |
|---|---|---|
| Paper availability | Can I read the claim? | Reading does not imply reuse |
| Data availability | Can I inspect the evidence? | Data without workflow still leaves adoption friction |
| Code availability | Can I theoretically run the method? | Code may be brittle, undocumented, or environment-bound |
| Agent availability | Can I call the method through a tested interface? | Execution, context, and workflow become packaged together |
The strongest version of this future is not a universe where every paper has a chatbot. It is a universe where methodological contributions are published with machine-usable interfaces, test suites, environment manifests, and resource registries. The agent is the user experience layer. The deeper reform is that research outputs become operational artefacts.
That would change incentives. Authors would structure tutorials more clearly. Repositories would become more modular. Inputs and outputs would be made explicit. Figures would become regression targets. Supplementary tables would become structured resources. The paper would remain the narrative unit, but the agent would become the adoption unit.
Academia may resist this because academia resists everything until it becomes a grant requirement. Enterprise R&D will not wait as long.
The boundary is code quality, not model eloquence
The most important limitation is not that the agent might occasionally say something silly. That is true, but generic. The more precise limitation is that Paper2Agent depends on the quality of the source artefacts. Complete, documented, executable codebases are suitable candidates. Broken, under-specified, or highly tacit workflows are not.
The authors acknowledge this directly: not every paper can be turned seamlessly into a robust agent. If the codebase is incomplete or contains unresolved errors, Paper2Agent cannot reliably expose it as a functioning tool. That is not a minor caveat. It defines the adoption boundary.
The evaluation boundary also matters. AlphaGenome’s benchmark is strong relative to many agent demos because it uses manually curated ground truth and compares against alternatives. But it is still 30 queries in one domain-specific case. TISSUE and Scanpy demonstrate reproducible workflows, but not broad generalisation across all possible datasets or edge cases. The ADHD-GWAS analysis is a compelling example of cross-paper orchestration, but its biological hypotheses remain computational hypotheses.
None of this weakens the paper. It makes the result interpretable. Paper2Agent is not a universal paper-to-scientist compiler. It is a serious step toward converting well-structured technical research into reusable agent infrastructure.
That is enough. Frankly, it is more useful than the universal claim would have been.
From papers as memory to papers as machinery
The deeper shift in Paper2Agent is not conversational access. It is mechanical access.
A PDF stores knowledge as prose. A repository stores knowledge as code. A tutorial stores knowledge as example behaviour. Paper2Agent tries to bind those layers into something an agent can operate: tools for action, resources for context, prompts for procedure, and tests for trust.
That pattern will travel beyond biomedical research. Any organisation with high-friction technical knowledge should pay attention. The near-term opportunity is not replacing experts; it is reducing the distance between expert method and repeatable execution. Experts should spend less time explaining setup rituals and more time judging whether outputs make sense.
In other words, the productive paper agent is not the principal investigator. Not yet, and certainly not unsupervised.
It is the lab assistant who knows where the code is, remembers the tutorial, runs the workflow the same way twice, fetches the right dataset, logs what happened, and does not improvise a dependency because it felt inspired.
In modern research operations, that is already a promotion.
Cognaptus: Automate the Present, Incubate the Future.
-
Jiacheng Miao, Joe R. Davis, Yaohui Zhang, Jonathan K. Pritchard, and James Zou, “Reimagining Research Papers As Interactive and Reliable AI Agents,” arXiv:2509.06917, 2025, https://arxiv.org/abs/2509.06917. ↩︎