Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework
Opening — Why this matters now
Reproducing machine learning research has become the academic equivalent of assembling IKEA furniture without the manual: possible, but unnecessarily traumatic. With papers ballooning in complexity and code availability hovering around a charitable 20%, the industry is grasping for automation. If LLMs can write papers, reason over them, and generate code — surely they can also reproduce experiments without melting down.
Until now, that hope has run into a wall: automated reproduction systems were clever but fragile. They followed instructions, but they didn’t check themselves, and when they attempted self-correction, they required stacks of manually crafted prompts. Fine for demos; unusable at scale.
The paper at hand proposes a simple but surprisingly effective fix: prompt-free collaborative agents that validate and refine outputs using only the system prompts already present in the workflow. No handcrafted heuristics. No human-authored critique templates. Just the model using the instructions it’s already been given.
Astonishingly, this removes the most expensive bottleneck in automated paper-to-code pipelines: human prompt engineering.
Background — The brittle workflows of paper reproduction
Automated paper reproduction frameworks such as Paper2Code operate through a staged workflow:
- Planning: Extract structure, methodology, and design.
- Architecture: Define the code file tree and dependencies.
- Logic: Detail the internal behaviors of each component.
- Coding: Produce a complete implementation.
The problem is not the workflow — it is what happens between the steps. Errors introduced early propagate downward. Refinement systems like Self-Refine attempt to critique and correct, but demand eight separate hand-written prompts just to review the four planning components.
Worse, those prompts don’t generalize. As the evaluations in the paper show (Table I & II, page 4), Self-Refine performs well on datasets it was tuned for (Paper2CodeBench) and falls apart on others (PaperBench Code-Dev), even producing −39.6% degradation on some tasks.
The culprit is obvious: handcrafted rubrics don’t travel well.
Analysis — The elegant hack: agents that reuse the system prompt
The authors propose a neat inversion: instead of writing new prompts to verify and refine outputs, use the existing system prompt as the reference standard.
Two agents orchestrate each step:
- Verification Agent — Checks if the output satisfies the system prompt. Produces a structured JSON report with missing elements and concrete action items.
- Refinement Agent — Fixes the output using three ingredients: the system prompt, the original output, and the verification report.
No extra prompt engineering. No manual supervision. The system prompts — previously one-way instructions — become the ground truth schema for quality control.
This design yields a crucial advantage: perfect objective alignment between generation, verification, and refinement.
A simple schema, powerful implications
Because each step uses only the system prompt as its standard, the agents:
- Avoid overfitting to handcrafted rubrics.
- Maintain consistency across planning → architecture → logic → code.
- Scale across datasets without modification.
And in practice? Performance jumps.
Findings — Results with visualization
Across PaperBench Code-Dev and Paper2CodeBench, the improvements are large and consistent.
Performance Summary
| Method | Benchmark | Avg. Score | Improvement |
|---|---|---|---|
| Paper2Code baseline | PaperBench | 0.682 | — |
| Auto-plan | PaperBench | 0.723 | +6.01% |
| Auto-code | PaperBench | 0.747 | +9.53% |
| Auto-plan + Auto-code | PaperBench | 0.786 | +15.25% |
Based on Table I (page 4) and Table III (page 4) of the paper.
Win Rate Across Tasks
| Optimization | Win Rate |
|---|---|
| Self-Refine | 50% |
| Auto-plan | 55% |
| Auto-code | 80% |
| Auto-plan + code | 85% |
Per-task consistency
The chart on page 5 shows a per-task comparison: Self-Refine swings wildly (−39.6% to +32%), while the prompt-free method demonstrates narrower variance and more stable gains.
In other words: The prompt-free agents behave like sober adults; Self-Refine behaves like a crypto meme token.
Implications — Why this matters beyond reproduction
This work slots neatly into a broader trend: agentic LLM workflows need governance, self-verification, and predictable repair mechanisms. Handwritten prompts are fragile, expensive, and domain-specific. By contrast, using system prompts as structured ground truth is:
- Scalable — No per-domain engineering.
- Consistent — Verification aligns exactly with instructions.
- Generalizable — Works across datasets and tasks.
- Cost-efficient — Achieves RePro-level improvements with 5× fewer iterations.
For businesses building automated research assistants, agentic software engineering systems, or compliance-oriented LLM workflows, this approach is attractive because it:
- Reduces operational overhead.
- Increases reproducibility.
- Improves auditability.
- Reduces reliance on bespoke prompt craftsmanship.
The deeper implication: LLMs can use instructions not only to produce output — but to evaluate their own adherence to those instructions.
That universality makes prompt-free verification an important building block for safe autonomous systems.
Conclusion — Closing the loop
The authors’ framework replaces brittle prompt engineering with a self-contained validation loop grounded in the original system prompt. The result is a more robust, more generalizable, and more efficient pipeline for automated paper reproduction.
The lesson is simple but profound: you can eliminate entire classes of errors by teaching LLMs to check their own homework — as long as you force them to check against the same instructions they were given in the first place.
Cognaptus: Automate the Present, Incubate the Future.
fileciteturn0file0