When Three Examples Beat a Thousand GPUs

A GPU bill is usually treated as a hardware problem. Buy faster accelerators, shorten training runs, negotiate a better cloud contract.

Less often asked is whether the expensive part of the pipeline began with a badly calibrated prompt.

An LLM generating neural-network architectures can create thousands of candidates before training begins. If the prompt provides too little context, the model may repeatedly produce shallow variations of the same familiar design. Add more examples, and it may combine useful ideas across architectural families. Add still more, and the output can become worse, incomplete, or invalid.

That creates an unusual optimization problem. The cheapest prompt may waste training compute. The richest prompt may fail to generate trainable code. Somewhere between imitation and collapse sits a useful operating zone.

A study by Raghuvir Duvvuri and colleagues examines that zone by generating and evaluating 1,900 neural architectures while varying the number of supporting examples supplied to a code-generating LLM.¹ Its headline result is memorable: prompts with three supporting examples achieved the highest dataset-balanced early-performance score.

No, the paper does not literally pit three examples against a thousand GPUs. The title describes the operational leverage. Better prompt calibration and cheap duplicate rejection can prevent expensive experiments from entering the training queue in the first place.

The prompt is part of the search algorithm

Traditional Neural Architecture Search, or NAS, treats architecture design as an optimization problem. A system explores a defined search space using reinforcement learning, evolutionary methods, gradient-based optimization, or related techniques. These approaches can find strong architectures, but the search itself often requires substantial compute and carefully designed constraints.

LLM-based architecture generation changes the interface. Instead of selecting operations from a predefined search space, the model generates complete PyTorch architecture code from a prompt.

The paper builds on the NNGPT and LEMUR ecosystem. Its generator is a LoRA-fine-tuned DeepSeek Coder 7B model. Each prompt contains:

a task and dataset description;
a main reference architecture and its accuracy;
a variable number of supporting architectures and their accuracies;
explicit instructions to combine the best elements;
strict output requirements so the generated code can enter an automated training pipeline.

One detail matters before interpreting the results: the paper labels variants by their number of supporting examples. An alt-nn3 prompt contains a main reference model plus three supporting models. “Three examples” is convenient shorthand, but the generator sees more than three code blocks.

The controlled variable is the supporting-example count, denoted here as $n$. Supporting models are sampled randomly, rather than selected through an evolutionary or performance-optimizing procedure. Other generation settings remain fixed.

This design turns prompt context into an experimental hyperparameter.

Three operating zones appear as context grows

The paper’s most useful contribution is not simply that $n=3$ produced the highest average. It is the mechanism suggested by the entire response curve.

The results divide prompt behavior into three operating zones:

Limited context                 Moderate context                 Excessive context
Template anchoring   →          Cross-pattern synthesis   →      Signal dilution and failure
n = 1                           n = 3                            n = 4, 5, and especially 6

Each zone creates a different failure mode and therefore requires a different operational response.

Zone one: limited context encourages imitation

With only one supporting example, the model has a narrow architectural vocabulary available inside the prompt.

The representative alt-nn1 architecture examined by the authors follows a largely sequential AlexNet-like structure: convolution, activation, pooling, then more of the same. It changes dimensions and implementation details, but it does not introduce the modular abstractions found in the richer outputs. Residual shortcuts, dual-path structures, and reusable architectural blocks are absent from the example selected for qualitative analysis.

This is consistent with template anchoring. When the model sees a small number of architectural patterns, the safest completion is a local variation of what it has already been shown.

That does not make limited context useless. The alt-nn1 group produced 1,268 successfully trained architectures and achieved a dataset-balanced mean one-epoch accuracy of 51.5%. On several individual datasets, it remained competitive with or slightly better than the three-example configuration.

The limitation is not basic functionality. It is the tendency to explore locally.

For a generation pipeline, this distinction matters. A system that reliably produces valid but structurally similar candidates may appear productive while contributing little genuine search diversity. The GPUs remain busy. The design space does not.

Zone two: moderate context enables architectural synthesis

At $n=3$, the model has enough examples to draw from multiple architectural patterns without yet losing prompt coherence.

The authors’ qualitative examples show generated architectures combining ideas associated with different network families:

ResNet-style residual units paired with large AlexNet-like fully connected heads;
DPN-inspired bottleneck blocks combined with an unusual progressive convolutional backbone;
aggressive early spatial reduction combined with hierarchical residual feature extraction.

These examples support a plausible mechanism: moderate context gives the LLM enough contrasting material to perform structural crossover rather than shallow mutation.

The prompt’s wording reinforces this behavior. It explicitly instructs the model to “combine best elements,” while attaching accuracy values to each architecture. The model is not merely asked to continue code. It is given a small, labelled design portfolio and told to synthesize.

The qualitative evidence is informative, but it should not be promoted beyond its role. Representative hybrid architectures demonstrate that synthesis occurred. They do not constitute a formal diversity metric, nor do they prove that three supporting examples always produce more novel or better-designed networks.

The quantitative results provide the stronger evidence.

Variant	Supporting examples	Successfully trained models	Dataset-balanced mean one-epoch accuracy
`alt-nn1`	1	1,268	51.5%
`alt-nn2`	2	306	49.8%
`alt-nn3`	3	103	53.1%
`alt-nn4`	4	102	47.3%
`alt-nn5`	5	121	43.0%

The three-supporting-example configuration achieved the highest dataset-balanced mean: 53.1%, compared with 51.5% for the one-example baseline.

A 1.6-percentage-point improvement is useful, but hardly a reason to replace a research department with a prompt template. Its importance becomes clearer when the results are separated by dataset.

On CIFAR-100, alt-nn3 achieved a mean one-epoch accuracy of 26.1%, compared with 14.5% for alt-nn1. The difference was 11.6 percentage points, with $p=0.001$ and Cohen’s $d=0.73$.

That is the paper’s strongest performance result. It suggests that richer architectural context may be particularly useful when the task offers enough complexity for architectural choices to matter during early training.

On MNIST, by contrast, most variants performed well. Supplying a broader architectural vocabulary provides little advantage when the underlying task is already forgiving. There are only so many ways to impress a handwritten digit.

The balanced average prevents easy datasets from winning the argument

The study does not calculate its overall result by pooling every generated model into one large average.

That would be misleading because the number of successful architectures differs sharply across datasets and prompt variants. Easier datasets and variants with more valid outputs could dominate the result simply by contributing more observations.

Instead, the authors first calculate a mean for each prompt-variant and dataset pair, then average those dataset-level means:

$$ \bar{A}_{\text{balanced}}(n) = \frac{1}{D} \sum_{d=1}^{D} \bar{A}_{n,d} $$

Here, $\bar{A}_{n,d}$ is the mean accuracy for example-count configuration $n$ on dataset $d$, and $D$ is the number of quantitatively compared datasets.

This macro-averaging gives each dataset equal weight. The paper reports that failing to balance across datasets could create evaluation bias of up to 15.7%.

That methodological choice is not decorative. Without it, a prompt configuration might appear superior because it generated many models on an easy benchmark, not because it behaved better across tasks.

The study also uses within-dataset Welch’s t-tests for statistical comparisons. That is important because the sample sizes are highly unequal: alt-nn1 includes 1,268 trained models, while alt-nn3 includes only 103. The macro-level confidence intervals overlap, partly because the smaller alt-nn3 sample produces a much wider interval. The meaningful significance tests therefore occur within datasets, not through a simplistic comparison of the two overall means.

The best example count changes with the task

“Three examples is optimal” is the memorable interpretation. “Example count behaves like a task-dependent hyperparameter” is the more accurate one.

The detailed results do not show a universal victory for alt-nn3.

On CelebA-Gender, the two-supporting-example configuration achieved 82.3%, compared with 75.8% for the one-example baseline and 74.4% for alt-nn3. The alt-nn2 improvement over the baseline was statistically significant.

On SVHN, alt-nn2 also recorded the highest mean among the reported variants. On CIFAR-10 and ImageNette, larger contexts produced statistically significant deterioration.

The evidence is better read as a response surface than a winning number.

Test	Likely experimental purpose	Result	What it supports
Dataset-balanced mean	Main overall comparison	`alt-nn3`: 53.1%; `alt-nn1`: 51.5%	Three supporting examples provide the strongest overall early-performance regime in this pipeline
CIFAR-100 comparison	Main per-dataset evidence	`alt-nn3` improves by 11.6 points; $p=0.001$, $d=0.73$	Moderate context can materially help on a difficult fine-grained task
CelebA-Gender comparison	Task-sensitivity evidence	`alt-nn2` improves by 6.5 points; $p=0.038$, $d=0.41$	The best context size varies by dataset
CIFAR-10 and ImageNette comparisons	Negative sensitivity evidence	`alt-nn4` significantly underperforms the baseline	Adding examples can actively reduce early performance
Representative architecture code	Qualitative mechanism evidence	Three-example outputs contain cross-family hybrids	Moderate context can enable structural synthesis
Six-example generation test	Stability and capacity evidence	Only 7 valid models from 3,394 queries	Excessive context can collapse generation reliability
Hash-validation benchmark	Implementation and efficiency evidence	Approximately 1 ms per code sample	Cheap rejection can prevent redundant training

The business implication is not to standardize all prompts at three examples. It is to stop treating example count as harmless formatting.

Zone three: excessive context first degrades, then collapses

The common intuition behind few-shot prompting is monotonic: more examples should provide more knowledge, more constraints, and therefore better outputs.

The experiment shows why that intuition fails for long-form code generation.

At four supporting examples, the dataset-balanced mean falls to 47.3%. At five, it drops further to 43.0%. On ImageNette, alt-nn4 underperforms alt-nn1 by 14.5 percentage points, with a large negative effect size. On CIFAR-10, it falls by 8.4 points.

The paper proposes three interacting mechanisms:

Attention dilution: the model must distribute attention across more heterogeneous architectural examples.
Conflicting generation signals: additional designs may provide patterns that do not combine cleanly.
Insufficient output capacity: long input examples consume context that would otherwise support complete generated code.

The third mechanism becomes difficult to ignore at six supporting examples. From 3,394 generation queries, the system produced only seven valid models—a 99.8% failure rate.

This result changes the optimization target. Once generation reliability collapses, architecture quality is no longer the immediate problem. The pipeline cannot consistently produce candidates worth evaluating.

For this specific model and prompt format, six supporting examples are not a slightly inefficient setting. They are an operational failure boundary.

Calling that boundary universal would be careless. A larger-context model, shorter architecture representation, different prompt template, or alternative code-generation method could shift it. But every production system has some version of this limit. The expensive mistake is assuming it does not.

Generation quality must include validity, novelty, and cost

Architecture-generation systems are often evaluated as though every output enters training cleanly. In practice, an output pipeline has several gates:

Prompt
  ↓
Generated code
  ↓
Syntactically and operationally valid?
  ↓
Already generated before?
  ↓
Worth early-stage training?
  ↓
Worth full training?

An example-count setting that improves mean accuracy but sharply reduces valid-output rates may still be economically inferior. Similarly, a configuration that generates hundreds of valid architectures may offer little value if many are formatting-level duplicates.

This is where the paper’s second contribution becomes operationally important.

A one-millisecond filter protects multi-hour training runs

LLMs frequently produce code that is functionally or textually identical except for whitespace and indentation. A raw text hash treats these variants as different strings, allowing duplicated architectures to proceed to training.

The paper’s solution is deliberately simple:

remove whitespace from the generated code;
calculate an MD5 hash of the normalized string;
check the identifier against an indexed database;
reject matches before training.

The reported processing time is approximately 1 millisecond per code sample, measured across 4,033 generated architectures. The paper compares this with approximately 10–100 milliseconds for AST parsing and 50–200 milliseconds for GraphCodeBERT-based comparison.

Those differences are trivial beside a multi-hour training run. That is precisely the point.

The authors estimate that rejecting one duplicate before training saves approximately two to three GPU-hours. A lightweight check does not need to understand the architecture’s semantics perfectly to produce a high operational return. It only needs to catch a sufficiently common and expensive class of redundancy.

Deduplication method	Approximate time per sample	Detects formatting variants	Detects semantic equivalence
Raw string hash	~1 ms	No	No
Winnowing	2–5 ms	Partially	No
AST parsing	10–100 ms	Yes	Partially
GraphCodeBERT-based comparison	50–200 ms	Yes	Yes
Whitespace-normalized hash	~1 ms	Yes	No

Whitespace-normalized hashing is not a semantic duplicate detector. Two differently implemented programs that produce the same computation will pass through as distinct. Conversely, removing all whitespace must be applied carefully in languages or code contexts where whitespace can alter content.

Within this PyTorch generation pipeline, however, the method serves as an effective first-pass filter. More sophisticated semantic checks can be reserved for the smaller set of candidates that survive.

The broader lesson is not that every code-generation problem needs MD5. It is that validation should be layered according to the cost of the next step.

What the paper directly shows—and what businesses can infer

The study directly investigates LLM-generated neural architectures. Most businesses are not generating thousands of vision models. The practical value therefore lies in the pipeline logic rather than the narrow task alone.

Level	Conclusion
What the paper directly shows	In the tested NNGPT/LEMUR pipeline, three supporting examples produced the highest dataset-balanced one-epoch accuracy; excessive context reduced performance and eventually collapsed valid generation; whitespace-normalized hashing cheaply removed formatting-level duplicates.
What Cognaptus infers for business use	Demonstration count should be calibrated like a model or workflow hyperparameter. Generated outputs should pass through cheap rejection gates before expensive execution, review, simulation, or training.
What remains uncertain	The best example count, context ceiling, and value of syntactic deduplication will vary across models, tasks, prompt formats, programming languages, and evaluation objectives.

This pattern applies beyond neural architecture generation.

An LLM producing SQL queries, workflow definitions, compliance controls, advertising variants, or simulation code faces a related trade-off. Too little context can produce generic imitation. Moderate context can support useful recombination. Excessive context can introduce contradiction, distraction, truncation, or invalid output.

The correct question is not, “How many examples can fit?”

It is, “How many examples improve the downstream objective before reliability and cost begin to deteriorate?”

A practical calibration workflow

Teams deploying generative systems into costly downstream pipelines can translate the paper into a compact operating procedure.

1. Treat demonstration count as a tunable parameter

Begin with a small number of representative examples. The paper suggests three supporting examples as a defensible starting point for similar architecture-generation settings, not as a universal final setting.

Evaluate several nearby configurations on a held-out task set. A useful initial sweep might compare one, two, three, and four supporting examples rather than jumping directly to the model’s context limit.

2. Measure more than output quality

For each prompt configuration, track at least four dimensions:

valid-output rate;
downstream performance;
duplicate or near-duplicate rate;
generation and evaluation cost.

A prompt that produces slightly stronger surviving outputs but fails most requests may be unsuitable for production.

3. Set an empirical context ceiling

Once validity or performance deteriorates sharply, stop expanding the prompt. Context capacity is not the same as useful context capacity.

The model may technically accept a long prompt while becoming operationally incapable of completing the requested output.

4. Use a validation cascade

Apply the cheapest reliable checks first:

Exact duplicate check
  ↓
Formatting-normalized duplicate check
  ↓
Syntax and contract validation
  ↓
Selective semantic comparison
  ↓
Expensive execution or training

The purpose is not to build the most intellectually impressive validator. It is to prevent avoidable costs before they occur.

5. Balance evaluation across task categories

When outputs are evaluated across datasets, markets, customer segments, or workflow types, aggregate results can be distorted by whichever category produces the largest number of successful samples.

Calculate category-level performance first, then construct an appropriately balanced overall measure. Otherwise, the system may optimize for the easiest or most frequently represented work while appearing broadly effective.

The boundaries are narrower than the slogan

The result is useful because it is concrete. Its limitations are equally concrete.

First, the quantitative study uses one primary code model: DeepSeek Coder 7B, fine-tuned with LoRA on the LEMUR ecosystem. Other models may integrate examples differently and may tolerate much larger prompts.

Second, the performance results come from six computer-vision classification benchmarks. A seventh, Places365, is used for deduplication validation but excluded from the main quantitative comparison because of its longer training time.

Third, each architecture is trained for only one epoch. The resulting accuracies are early-screening signals, not estimates of final converged performance. An architecture that learns quickly in its first epoch may not remain superior after full training.

Fourth, the successful-model counts differ substantially among prompt variants. The study addresses this through dataset-balanced means and within-dataset Welch’s tests, but the imbalance still limits how confidently the overall 1.6-point difference should be interpreted.

Fifth, the architectural-synthesis evidence is qualitative. The examples strongly illustrate the proposed mechanism, but the paper does not provide a comprehensive formal measure of architectural novelty or diversity across every generated model.

Finally, whitespace-normalized hashing solves one narrow duplication problem. It cannot identify semantically equivalent programs with different implementations.

These boundaries do not erase the paper’s contribution. They define the conditions under which it is useful.

The expensive part may begin before training

The study’s most important lesson is easy to miss because its two main contributions look almost too simple.

The first is to vary the number of examples in a prompt.

The second is to remove whitespace and calculate a hash.

Neither resembles a grand replacement for conventional NAS. Together, however, they reveal where practical efficiency often comes from: controlling what enters the expensive part of the system.

Three supporting examples created enough contextual variety to enable richer architectural combinations while preserving output coherence. Four and five reduced performance. Six almost stopped the pipeline entirely. A millisecond-scale filter then prevented formatting duplicates from consuming multi-hour training runs.

The number three is not the principle.

The principle is calibration: provide enough context to broaden the model’s design vocabulary, stop before context becomes interference, and reject cheap mistakes before they become expensive experiments.

In an industry inclined to answer every performance problem with more compute, that is a pleasantly inconvenient result.

Cognaptus: Automate the Present, Incubate the Future.

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, and Radu Timofte, “Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design,” arXiv:2512.24120, https://arxiv.org/html/2512.24120. ↩︎

The prompt is part of the search algorithm#

Three operating zones appear as context grows#

Zone one: limited context encourages imitation#

Zone two: moderate context enables architectural synthesis#

The balanced average prevents easy datasets from winning the argument#

The best example count changes with the task#

Zone three: excessive context first degrades, then collapses#

Generation quality must include validity, novelty, and cost#

A one-millisecond filter protects multi-hour training runs#

What the paper directly shows—and what businesses can infer#

A practical calibration workflow#

1. Treat demonstration count as a tunable parameter#

2. Measure more than output quality#

3. Set an empirical context ceiling#

4. Use a validation cascade#

5. Balance evaluation across task categories#

The boundaries are narrower than the slogan#

The expensive part may begin before training#