Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework

Opening — Why this matters now

Reproducing machine learning research has become the academic equivalent of assembling IKEA furniture without the manual: possible, but unnecessarily traumatic. With papers ballooning in complexity and code availability hovering around a charitable 20%, the industry is grasping for automation. If LLMs can write papers, reason over them, and generate code — surely they can also reproduce experiments without melting down.

Until now, that hope has run into a wall: automated reproduction systems were clever but fragile. They followed instructions, but they didn’t check themselves, and when they attempted self-correction, they required stacks of manually crafted prompts. Fine for demos; unusable at scale.

The paper at hand proposes a simple but surprisingly effective fix: prompt-free collaborative agents that validate and refine outputs using only the system prompts already present in the workflow. No handcrafted heuristics. No human-authored critique templates. Just the model using the instructions it’s already been given.

Astonishingly, this removes the most expensive bottleneck in automated paper-to-code pipelines: human prompt engineering.

Background — The brittle workflows of paper reproduction

Automated paper reproduction frameworks such as Paper2Code operate through a staged workflow:

Planning: Extract structure, methodology, and design.
Architecture: Define the code file tree and dependencies.
Logic: Detail the internal behaviors of each component.
Coding: Produce a complete implementation.

The problem is not the workflow — it is what happens between the steps. Errors introduced early propagate downward. Refinement systems like Self-Refine attempt to critique and correct, but demand eight separate hand-written prompts just to review the four planning components.

Worse, those prompts don’t generalize. As the evaluations in the paper show (Table I & II, page 4), Self-Refine performs well on datasets it was tuned for (Paper2CodeBench) and falls apart on others (PaperBench Code-Dev), even producing −39.6% degradation on some tasks.

The culprit is obvious: handcrafted rubrics don’t travel well.

Analysis — The elegant hack: agents that reuse the system prompt

The authors propose a neat inversion: instead of writing new prompts to verify and refine outputs, use the existing system prompt as the reference standard.

Two agents orchestrate each step:

Verification Agent — Checks if the output satisfies the system prompt. Produces a structured JSON report with missing elements and concrete action items.
Refinement Agent — Fixes the output using three ingredients: the system prompt, the original output, and the verification report.

No extra prompt engineering. No manual supervision. The system prompts — previously one-way instructions — become the ground truth schema for quality control.

This design yields a crucial advantage: perfect objective alignment between generation, verification, and refinement.

A simple schema, powerful implications

Because each step uses only the system prompt as its standard, the agents:

Avoid overfitting to handcrafted rubrics.
Maintain consistency across planning → architecture → logic → code.
Scale across datasets without modification.

And in practice? Performance jumps.

Findings — Results with visualization

Across PaperBench Code-Dev and Paper2CodeBench, the improvements are large and consistent.

Performance Summary

Method	Benchmark	Avg. Score	Improvement
Paper2Code baseline	PaperBench	0.682	—
Auto-plan	PaperBench	0.723	+6.01%
Auto-code	PaperBench	0.747	+9.53%
Auto-plan + Auto-code	PaperBench	0.786	+15.25%

Based on Table I (page 4) and Table III (page 4) of the paper.

Win Rate Across Tasks

Optimization	Win Rate
Self-Refine	50%
Auto-plan	55%
Auto-code	80%
Auto-plan + code	85%

Per-task consistency

The chart on page 5 shows a per-task comparison: Self-Refine swings wildly (−39.6% to +32%), while the prompt-free method demonstrates narrower variance and more stable gains.

In other words: The prompt-free agents behave like sober adults; Self-Refine behaves like a crypto meme token.

Implications — Why this matters beyond reproduction

This work slots neatly into a broader trend: agentic LLM workflows need governance, self-verification, and predictable repair mechanisms. Handwritten prompts are fragile, expensive, and domain-specific. By contrast, using system prompts as structured ground truth is:

Scalable — No per-domain engineering.
Consistent — Verification aligns exactly with instructions.
Generalizable — Works across datasets and tasks.
Cost-efficient — Achieves RePro-level improvements with 5× fewer iterations.

For businesses building automated research assistants, agentic software engineering systems, or compliance-oriented LLM workflows, this approach is attractive because it:

Reduces operational overhead.
Increases reproducibility.
Improves auditability.
Reduces reliance on bespoke prompt craftsmanship.

The deeper implication: LLMs can use instructions not only to produce output — but to evaluate their own adherence to those instructions.

That universality makes prompt-free verification an important building block for safe autonomous systems.

Conclusion — Closing the loop

The authors’ framework replaces brittle prompt engineering with a self-contained validation loop grounded in the original system prompt. The result is a more robust, more generalizable, and more efficient pipeline for automated paper reproduction.

The lesson is simple but profound: you can eliminate entire classes of errors by teaching LLMs to check their own homework — as long as you force them to check against the same instructions they were given in the first place.

Cognaptus: Automate the Present, Incubate the Future.

fileciteturn0file0

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework#

Opening — Why this matters now#

Background — The brittle workflows of paper reproduction#

Analysis — The elegant hack: agents that reuse the system prompt#

A simple schema, powerful implications#

Findings — Results with visualization#

Performance Summary#

Win Rate Across Tasks#

Per-task consistency#

Implications — Why this matters beyond reproduction#

Conclusion — Closing the loop#