Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework

Instructions are usually treated as the beginning of an AI workflow.

A user, developer, or system designer writes a prompt. The model produces an output. Then, if the output looks wrong, someone writes another prompt telling the model how to check it, another prompt telling it how to repair it, and eventually a small mountain of prompt glue accumulates around what was supposed to be an automated system.

Very elegant. In the same way that taping a flashlight to a Roomba is “home robotics.”

The paper Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents proposes a cleaner trick: do not write extra verification and refinement prompts at all. Reuse the original system prompt as the standard for checking and improving each output.1 The system prompt becomes not only the instruction for generation, but also the rubric for verification and the guide for repair.

That is the real contribution. Not “LLMs can reproduce papers better,” although the paper does report better scores. Not “agents can self-reflect,” which is already a crowded neighborhood. The sharper idea is this: in a multi-step agent workflow, the original task contract may already contain enough structure to become a quality-control contract.

For companies building agentic workflows, this matters because prompt maintenance is one of the least glamorous forms of technical debt. It is invisible when the demo works, expensive when the workflow changes, and embarrassing when the system confidently follows yesterday’s workaround into today’s production incident.

This paper is about paper-to-code reproduction. The broader lesson is about workflow governance.

“Prompt-Free” Does Not Mean Promptless

The first misconception is easy to catch.

“Prompt-free” does not mean the system runs without prompts. The method still uses prompts. It uses the original workflow system prompts from Paper2Code. What it removes is the need for additional human-designed prompts for checking and refinement.

That distinction matters because otherwise the paper sounds more magical than it is. The authors are not claiming that agents spontaneously discover evaluation standards. They are saying the standards are already inside the workflow instructions, and the system can reuse them.

In the paper’s setting, automated paper reproduction follows a staged pipeline. A research paper is converted into planning artifacts, design artifacts, logic specifications, configuration files, and then executable code. Each stage already has a system prompt that describes what the output should contain. Traditional refinement methods then add separate prompts to critique and repair the output.

The authors ask: why create a second instruction layer?

If the planning prompt says the output must include methodology, experimental setup, architecture, dependencies, and configuration details, then the verification agent can check whether those things are present. If the coding prompt says the implementation must be modular, complete, faithful to the paper, and free of TODO placeholders, then the refinement agent can use those same requirements to repair the code.

The mechanism is simple enough to fit in one table:

Workflow role Traditional use of system prompt Prompt-free collaborative-agent use
Generation Tell the model what to produce Tell the model what to produce
Verification Usually needs a separate critique prompt Reuse the original system prompt as the checklist
Refinement Usually needs a separate repair prompt Reuse the original system prompt plus verification report
Consistency Depends on how well prompts align Built around a shared requirement source

This is the paper’s main conceptual move. The system prompt stops being a one-way instruction and becomes a reusable contract.

The business translation is equally simple: if your workflow already contains task requirements, do not immediately build another layer of hand-written audit prompts. First ask whether the original requirements can be reused as the audit standard.

Less prompt theater. More contract reuse. Civilization advances in small ways.

The Core Loop: Check Against the Contract, Then Repair Against the Same Contract

The paper introduces two collaborative agents: a verification agent and a refinement agent.

The verification agent receives three inputs: the original paper content, the system prompt for the current workflow step, and the current output. It then produces a structured review report. The report includes missing information and action items. In other words, it turns vague dissatisfaction into a repair list.

The refinement agent then receives the paper content, the original system prompt, the original output, the verification report, and previously refined outputs. It updates the output while preserving correct parts and addressing the detected gaps.

The important part is not that there are two agents. Two-agent systems are cheap to describe and expensive to make useful. The important part is that both agents are grounded in the same source of requirements.

A rough flow looks like this:

Paper content
     +
System prompt for current step
     +
Current output
Verification agent
Structured report:
- missing_information
- action_items
Refinement agent
Improved output aligned with the same system prompt

The alignment claim is the heart of the mechanism. Generation, verification, and refinement all refer back to the same task requirements. That reduces one common failure mode in agentic systems: the critic judges the output by standards that differ from the generator’s instructions.

Anyone who has built multi-agent workflows has seen this. The generator writes a concise answer because the system prompt requested concision. The critic complains that the answer lacks depth. The refiner then expands it. A later checker complains it is too long. Congratulations, you have created a tiny bureaucracy.

This paper’s method tries to prevent that by keeping the instruction source stable.

Where the Loop Enters Paper2Code

The authors integrate the method into Paper2Code, a framework for generating code repositories from machine learning papers. They apply the verification-refinement loop to two stages: planning and coding.

The planning stage contains four artifacts:

Planning artifact What it does in the workflow
Overall Plan Captures methodology and experimental setup
Architecture Design Defines file organization and system structure
Logic Design Decomposes tasks and dependencies
Configuration Specifies hyperparameters and training settings

The method processes these artifacts sequentially. First it verifies and refines the overall plan. Then the refined overall plan becomes context for architecture design. Then architecture and plan inform logic design. Finally, the configuration is refined using the previous refined artifacts.

This sequence is not a minor implementation detail. It is how the method tries to prevent early planning errors from contaminating later outputs. Paper-to-code reproduction is path-dependent. A poor architecture plan can distort the logic design. A vague logic design can produce incomplete code. A wrong configuration can quietly ruin reproducibility while everything still “looks” professional.

The coding stage follows the dependency order specified in the logic design. Each Python file is checked against coding requirements such as completeness, modularity, faithfulness to the paper, consistency with the planning artifacts, and correct use of config.yaml. Refined files are written back so later files can build on improved code.

This is where the method becomes more interesting than a generic self-reflection loop. It does not merely say, “Please improve this.” It inserts quality control at points where errors actually propagate.

That is also where the business analogy becomes obvious. In enterprise workflows, errors rarely stay inside the step where they were born. A bad requirement summary becomes a bad design. A bad design becomes bad implementation. Bad implementation becomes a weekend incident. The expensive part is not the first error. The expensive part is the inheritance chain.

What the Evidence Actually Shows

The experiments use two benchmarks: PaperBench Code-Dev and Paper2CodeBench.

PaperBench Code-Dev contains 20 ICML 2024 papers with human-annotated grading rubrics for LLM-based functional correctness verification. Paper2CodeBench contains 90 papers from ICLR, ICML, and NeurIPS 2024, evaluated against official author-released implementations with LLM-based judges on a five-point scale.

The paper compares several variants:

Test or variant Likely purpose What it supports What it does not prove
Paper2Code baseline Main baseline Shows the starting performance of the workflow without added verification-refinement agents Does not isolate which workflow step causes failures
Self-Refine in planning Comparison with prior refinement style Tests whether hand-crafted critique/refinement prompts generalize Does not represent every possible Self-Refine prompt design
Auto-plan optimized Stage-specific ablation Tests whether prompt-free refinement helps planning artifacts alone Does not show full code-level correction
Auto-code optimized Stage-specific ablation Tests whether prompt-free refinement helps generated code alone Does not fix upstream planning errors by itself
Auto-plan and code optimized Main combined result Tests the full proposed intervention across planning and coding Does not guarantee production-grade code
RePro comparison Efficiency and prior-work comparison Compares improvement per iteration against another paper-to-code refinement method Uses different underlying model settings, so it is not a perfectly controlled model-to-model comparison

This separation is useful because not every table in a paper carries the same evidentiary weight. Table I and Table II provide the main performance evidence. The plan-only and code-only rows function as ablations across workflow placement. The Self-Refine comparison tests robustness against a hand-crafted-prompt alternative. The RePro comparison is partly an efficiency comparison and partly a prior-work positioning exercise, but it is less clean because RePro results are cited from the original paper and use a different model.

Now the numbers.

On PaperBench Code-Dev, the Paper2Code baseline using GPT-4.1 scores 0.682. Adding prompt-free optimization to planning alone raises the average to 0.723. Applying it to coding alone raises the average to 0.747. Applying it to both planning and coding gives the best result: 0.786.

Method PaperBench Code-Dev average Win rate vs. baseline Average improvement
Paper2Code baseline 0.682
Paper2Code + Self-Refine in plan 0.655 50.0% -3.96%
Paper2Code + auto-plan optimized 0.723 55.0% +6.01%
Paper2Code + auto-code optimized 0.747 80.0% +9.53%
Paper2Code + auto-plan and code optimized 0.786 85.0% +15.25%

The most important row is not merely the final 0.786. It is the contrast between planning-only, coding-only, and combined optimization.

Planning helps. Coding helps more. Both together help most.

That supports the authors’ argument that the method is not tied to one artifact type. It can be applied across workflow stages without designing a new refinement prompt for each stage.

On Paper2CodeBench, the same pattern appears:

Method ICLR 2024 ICML 2024 NeurIPS 2024 Average
Paper2Code baseline 3.85 4.09 3.60 3.84
Paper2Code + Self-Refine in plan 4.14 4.26 3.75 4.05
Paper2Code + auto-plan optimized 4.01 4.00 3.80 3.94
Paper2Code + auto-code optimized 4.28 4.39 4.18 4.29
Paper2Code + auto-plan and code optimized 4.35 4.43 4.23 4.34

Here Self-Refine improves the baseline from 3.84 to 4.05, while prompt-free plan optimization reaches 3.94. But auto-code optimization reaches 4.29, and combined optimization reaches 4.34.

That nuance matters. The paper’s strongest claim is not that prompt-free planning alone beats Self-Refine everywhere. It does not. The stronger claim is that the prompt-free method is more robust across benchmark distributions and extends more naturally to multiple workflow stages.

The method wins by being boring in a useful way: reuse the same requirements everywhere, rather than inventing custom critique prompts that may travel poorly.

The Self-Refine Comparison Is a Robustness Story, Not Just a Score Story

The Self-Refine comparison is easy to misread.

On Paper2CodeBench, Self-Refine improves the baseline from 3.84 to 4.05. That is a meaningful gain. On PaperBench Code-Dev, however, Self-Refine drops from 0.682 to 0.655, with only a 50% win rate. The authors argue that this suggests the hand-crafted Self-Refine prompts were tuned to Paper2CodeBench and did not generalize as well to PaperBench.

That interpretation is plausible, though it should be handled carefully. The result does not prove that Self-Refine as a general method is weak. It shows that this specific Self-Refine implementation, using manually designed planning prompts from the Paper2Code setting, behaves inconsistently across these two benchmark distributions.

The practical lesson is still valuable. Hand-crafted refinement prompts can become hidden overfitting devices. They may look like “expert guidance,” but they can also encode assumptions about one dataset, one output style, or one grading regime.

This is a familiar business problem in a different costume. A team tunes an agent workflow until it works well on internal examples. Then a client changes document format, product category, region, or compliance language. Suddenly the beautiful prompt stack starts making strange choices. Everyone blames the model. Sometimes the model deserves it. Often, the workflow’s hidden assumptions are doing the damage.

The paper’s prompt-free method tries to reduce that risk by using the task’s own system prompt as the checking standard. That does not make it universally robust. It makes the source of evaluation less dependent on manually crafted secondary rubrics.

The Per-Task Results Reveal the Method’s Bias

The paper’s per-task analysis is especially useful because it shows where the method struggles.

According to the authors, Self-Refine shows negative improvements on 10 PaperBench tasks, including severe drops on tasks such as what-will-my-model-forget (-39.6%) and fre (-28.2%). The prompt-free planning method is more stable and outperforms Self-Refine on 16 out of 20 tasks. But it is not flawless. It also has drops on several tasks, including mechanistic-understanding (-26.6%) and fre (-15.8%).

The authors’ explanation is revealing: planning-stage verification tends to prioritize paper completeness and high-level design correctness over fine-grained implementation details such as hyperparameters and model-loading procedures. That helps architecture-level coherence but can hurt tasks where small implementation details determine evaluation success.

This is not a footnote. It tells us what the method is biased toward.

A verifier that checks against a high-level system prompt will naturally favor requirements expressed in that prompt. If the prompt emphasizes completeness, architecture, and methodology, the verifier will likely catch missing conceptual elements. If the scoring rubric depends on obscure implementation details not emphasized in the prompt, the verifier may miss them.

That is why coding-stage optimization matters. The combined method performs best because planning-level correction and code-level correction cover different failure surfaces.

Here the paper quietly teaches a useful design principle:

Error type Better intervention point Why
Missing methodology or experiment structure Planning verification The issue concerns what the repository should contain
Inconsistent architecture or dependency ordering Planning and logic refinement The issue affects downstream file relationships
Incomplete code or TODO placeholders Coding verification The issue is visible inside generated implementation files
Wrong hyperparameters or model-loading details Coding and configuration refinement The issue may be too fine-grained for high-level planning checks
Cross-file inconsistency Sequential refinement with previous outputs Later files need access to already-refined earlier files

For enterprise users, the message is direct: do not place verification only where it is convenient. Place it where the error becomes observable.

A contract-review agent should check clauses at the clause level, not only the executive summary. A financial-reporting agent should validate formulas inside the spreadsheet, not only the narrative commentary. A customer-support agent should check policy compliance at the final response level, not only the retrieved knowledge snippet.

The paper is about code reproduction, but the placement logic travels.

The RePro Comparison: Similar Gains, Fewer Iterations, Messier Control

The paper also compares against RePro, a reflective paper-to-code reproduction method that extracts a paper “fingerprint” as supervisory criteria and uses iterative verification-refinement.

The reported comparison is:

Method Model Iterations Original score Final score Average improvement
RePro o3-mini-high 5 0.528 0.614 +16.29%
Paper2Code + auto-plan and code optimized GPT-4.1 1 0.682 0.786 +15.25%

The authors highlight that their method achieves comparable improvement in one iteration, while RePro uses five. That is meaningful because each iteration involves LLM calls, and LLM calls are not free. Even when token prices fall, latency, orchestration complexity, failure handling, and audit logging still cost something.

But this comparison should not be overplayed. The models differ. RePro uses o3-mini-high in the cited result, while the proposed method uses GPT-4.1. The starting scores also differ. So this is not a clean controlled comparison of algorithms under identical model conditions.

The safer interpretation is: the proposed method appears computationally efficient within the authors’ setup and reaches strong absolute performance, but the RePro comparison is better read as prior-work positioning than as a definitive head-to-head victory.

Still, the operational idea is important. A method that improves outputs with one verification-refinement pass is easier to deploy than one that requires repeated loops. Multi-iteration systems tend to look powerful in papers and annoying in production. They need stopping rules, budget controls, retry logic, observability, and error handling when the loop “improves” an output into a different kind of failure.

The most useful production workflow is often not the cleverest loop. It is the loop that produces enough improvement before the budget officer notices.

The Business Value Is Prompt Maintenance Reduction, Not Just Better Code

The obvious business takeaway is that this could help automated research assistants generate better code from papers. That is true, but narrow.

The more valuable interpretation is that system prompts can become reusable workflow assets. If a company already writes structured instructions for agent tasks, those instructions can serve three functions:

  1. Generation contract — what the model should produce.
  2. Verification rubric — what the checker should inspect.
  3. Refinement guide — what the repair step should improve.

That has consequences for how agentic workflows should be designed.

First, system prompts become more important. A vague prompt produces vague generation, but it also produces vague verification. If the original contract is weak, reusing it as a rubric does not magically create precision. The method rewards well-specified instructions.

Second, prompt governance becomes less fragmented. Instead of maintaining separate generation prompts, critique prompts, and repair prompts, teams can invest in a smaller number of stronger task contracts.

Third, auditability improves. If generation, verification, and refinement all reference the same requirement source, it becomes easier to explain why a correction was made. The verification report can point to missing requirements, and the refinement step can address them explicitly.

Fourth, workflow scaling becomes less painful. When a new task stage is added, the team does not necessarily need to design a new pair of critique-and-repair prompts. The new stage’s system prompt can become the basis for its own checking loop.

This is where the paper connects to Cognaptus-style business automation. The value is not that “AI agents are now autonomous.” Please. Autonomy remains the favorite word of systems that still need six dashboards and a human holding a fire extinguisher.

The value is that existing workflow instructions can become quality-control infrastructure.

A Practical Framework for Enterprise Agent Design

A business team applying this idea should not start by copying the paper’s architecture mechanically. The paper’s domain is paper-to-code reproduction. Enterprise workflows vary. The right question is not “Can we add a verifier agent?” It is “Do our task instructions contain enough structure to verify against?”

A useful implementation checklist would look like this:

Design question Good sign Bad sign
Does the system prompt define required output components? The prompt lists concrete fields, files, checks, or criteria The prompt says “produce a high-quality response”
Can missing requirements be detected from the output? The verifier can identify absent sections or inconsistent references The requirement depends entirely on subjective taste
Does refinement have enough context to repair errors? It receives the original input, output, verification report, and prior refined artifacts It only receives “please improve this”
Are workflow stages sequentially dependent? Later artifacts consume earlier outputs Each step is isolated and low-risk
Is manual prompt maintenance becoming costly? Many task-specific critique prompts exist The workflow has only one simple task

This framework also shows where the paper’s method may fail in business settings.

If the original system prompt is badly written, the verifier inherits the weakness. If the prompt omits an important compliance rule, the verification agent may not check for it. If the workflow requires external validation against databases, policies, laws, or live systems, the system prompt alone is insufficient. If the output quality depends on tacit human judgment, the loop may produce tidy reports without meaningful improvement.

The method is not a substitute for ground truth. It is a way to reuse existing task requirements more efficiently.

That distinction saves everyone time.

What the Paper Directly Shows, and What We Can Infer

The paper directly shows that, in the tested paper-to-code setting, prompt-free collaborative verification and refinement improves Paper2Code performance on PaperBench Code-Dev and Paper2CodeBench. The strongest variant applies the method to both planning and coding stages, raising PaperBench Code-Dev from 0.682 to 0.786 and Paper2CodeBench from 3.84 to 4.34. It also shows that the tested Self-Refine planning setup performs inconsistently across the two benchmarks.

Cognaptus can infer a broader workflow-design principle: when an agentic system already has structured task prompts, those prompts may be reusable as verification and refinement criteria. This can reduce the need for secondary hand-crafted prompts, improve alignment among workflow stages, and make quality-control logic easier to maintain.

What remains uncertain is equally important. The evidence is limited to automated paper reproduction. The experiments use GPT-4.1 for the authors’ method. The evaluation involves benchmark scoring and LLM-based judging. The RePro comparison is not perfectly controlled because model settings differ. The method’s performance depends on the quality and completeness of the original system prompts.

So the business conclusion should be neither timid nor inflated.

This is not proof that agents can fully check their own work in arbitrary enterprise settings. It is evidence that, in structured multi-step workflows, the original instructions can be reused more intelligently than many current prompt stacks allow.

That is a smaller claim. It is also much more useful.

The Real Lesson: Treat Prompts as Contracts, Not Decorations

Most prompt engineering still treats prompts as disposable interface text. Write a few instructions, adjust tone, add examples, hope the model behaves.

Agentic workflows need a more serious view. A system prompt is not just a request. It is a contract between stages. If written well, it defines what success looks like. If it defines what success looks like, it can also support verification. If it supports verification, it can guide refinement.

That is the paper’s best idea.

The benchmark gains are nice. The planning-and-coding combination is useful. The Self-Refine comparison is informative. But the durable lesson is architectural: good workflow instructions should not die after generation. They should stay alive as standards for checking and repair.

In human organizations, this would be obvious. A project brief is not only used to start the project. It is used to review whether the project delivered what it promised. The surprise is that many agent systems forgot this very basic management practice and replaced it with custom critique prompts taped onto the side.

Now the agents are being asked to check their own homework.

The trick is making sure they check against the assignment, not against whatever the last prompt engineer happened to write at 2 a.m.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zijie Lin, Qilin Cai, Liang Shen, and Mingjun Xiao, “Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents,” arXiv:2512.02812, 2025. https://arxiv.org/abs/2512.02812 ↩︎