Divide, Cache, and Conquer: How Mixture-of-Agents is Rewriting Hardware Design

Hardware design has a rather unforgiving relationship with “almost right”.

A chatbot can produce a slightly clumsy paragraph and survive the incident. A Verilog module that mishandles reset logic, races a signal, or politely misunderstands concurrency does not get partial credit from physics. It fails simulation, or worse, passes the wrong simulation and then becomes a very expensive archaeology project later in the design flow.

That is the practical tension behind VeriMoA, a new Mixture-of-Agents framework for spec-to-HDL generation.¹ The paper’s headline result is attractive enough: 15–30% Pass@1 improvements across VerilogEval 2.0 and RTLLM 2.0, without fine-tuning the underlying models. But the more interesting lesson is not “use more agents”. That is the sort of conclusion one reaches after reading the abstract, closing the PDF, and declaring victory over comprehension.

The actual mechanism is sharper. VeriMoA argues that multi-agent generation only helps hardware design when the system can do two things at once: preserve the best candidate artefacts across layers, and explore genuinely different solution paths rather than producing six mildly rearranged versions of the same mistake. In other words, collaboration needs memory and taste. A room full of agents is not automatically an engineering team. Sometimes it is just a meeting.

The failure mode is not weak generation; it is bad propagation

The paper starts from a familiar problem in LLM-based RTL generation. General-purpose models are much better exposed to Python and C++ than to Verilog or VHDL. They can write plausible HDL syntax, but HDL is not just another programming language with semicolons. It encodes concurrent hardware behaviour, timing assumptions, signal ownership, reset discipline, and synthesis constraints. The surface resemblance to software is helpful until it becomes actively misleading.

Prior work has attacked this from three directions. Prompting tries to coax latent HDL knowledge out of a general model. Fine-tuning injects more domain data into the model. Multi-agent systems split the work across generators, debuggers, critics, or debaters. VeriMoA sits in the third category, but its critique is aimed at the way these systems move information.

A linear pipeline can propagate errors. A debate can amplify noise. A standard Mixture-of-Agents setup can lose a good candidate if the next aggregation layer mishandles it. The paper frames this as two linked deficiencies: noise propagation and constrained exploration. One poisons the accumulated context; the other traps the system inside a narrow design neighbourhood.

VeriMoA’s answer is mechanism-first:

Problem in ordinary multi-agent HDL generation	VeriMoA mechanism	Practical interpretation
Good intermediate HDL may disappear in later layers	Global quality-guided cache stores all intermediate HDL candidates and ranks them	The system treats candidates as reusable engineering assets, not disposable chat turns
Bad outputs can contaminate the next generation layer	Later agents receive top-ranked candidates from all previous layers, not only the immediate predecessor	Information flow is filtered rather than merely forwarded
Multiple agents may explore the same reasoning path	Parallel direct-HDL, C++-guided, and Python-guided paths	Diversity is structured around different representations, not just random sampling
HDL-specific failure signals are hard for general LLMs	Simulation-based scoring plus rule-based fallback checks	Domain feedback becomes part of orchestration, not model weights

The cache is the conceptual centre. Each generated HDL candidate is evaluated and stored with a quality score. Deeper layers are then prompted with the best previous candidates selected globally. That breaks the fragile dependency chain where layer $i+1$ inherits whatever layer $i$ happened to produce. The paper describes this as monotonic knowledge accumulation: the top-ranked reference set should not get worse as the cache grows.

There is a caveat, and it matters. The guarantee is over the system’s quality score, not over the metaphysical truth of circuit correctness. If the scoring function is noisy, incomplete, or misaligned with production requirements, the cache will confidently preserve the wrong thing. Still, as an architecture for benchmarked HDL generation, the design is sensible: do not ask the model to remember quality implicitly when the system can rank and store it explicitly. Revolutionary? No. Useful? Annoyingly often, yes.

The C++ and Python paths are not decoration

The second mechanism is multi-path generation. VeriMoA does not simply ask several agents to produce HDL directly. It uses three agent types in each proposer layer: base agents generate HDL directly; C++ agents first translate the specification into C++-like intermediate code and then into HDL; Python agents do the same through Python.

This matters because the intermediate languages are not treated as cute explanatory notes. They are alternative reasoning scaffolds. C++ gives bit-level operations, explicit types, and control structures that align with hardware reasoning. Python gives high-level expressiveness and compact algorithmic decomposition. The direct HDL path preserves hardware idioms. Together they widen the search space.

The paper’s controlled multi-path experiment is useful here because it tests a tempting objection: perhaps the benefit comes merely from having differently worded prompts. To probe that, the authors compare a C++ + Python two-path setup against a two-Python setup where one Python agent is prompted toward bit-level style and the other toward high-level style. With GPT-4o-mini and four layers, the C++ + Python configuration reaches 64.52% Pass@1 on VerilogEval 2.0 and 57.89% on RTLLM 2.0. The two-Python variant reaches 59.47% and 53.22%. That gap supports the authors’ claim that language properties themselves contribute to diversity, not just prompt variance.

This is where the business lesson becomes clearer. In many enterprise agent deployments, “diversity” is implemented as multiple calls to the same model with slightly different instructions. That can help, but it is a weak form of diversity. VeriMoA’s stronger claim is that useful diversity is representational. Different intermediate artefacts expose different failure modes and recovery paths.

For hardware design, this is especially relevant because the final artefact is not prose. It must satisfy executable constraints. A candidate can be syntactically plausible and functionally useless. The value of an intermediate path is not that it sounds more reasoned; it is that it helps the system generate HDL that passes verification.

The main evidence says orchestration can rival training, but not magically replace it

The experiments use two benchmarks: VerilogEval 2.0, with 156 combinational and sequential circuit problems, and RTLLM 2.0, with 50 more complex real-world design tasks. The default VeriMoA configuration uses four proposer layers with six parallel agents per layer: two direct HDL agents, two C++ agents, and two Python agents. Icarus Verilog is used for simulation, and the evaluation metric is Pass@k, with $n=10$ sampled outputs per problem.

The main comparison is against non-training baselines: direct generation, Chain-of-Thought prompting, HDLCoRe, and VeriMaAS. Across Qwen2.5, Qwen2.5-Coder, GPT-4o-mini, and GPT-4o backbones, VeriMoA consistently performs best.

A few numbers show the scale:

Backbone	Benchmark	Direct Pass@1	Strong non-training baseline	VeriMoA Pass@1	Interpretation
Qwen2.5-7B	VerilogEval 2.0	22.90%	VeriMaAS: 32.81%	56.44%	Small models benefit heavily from orchestration
Qwen2.5-Coder-32B	VerilogEval 2.0	46.93%	VeriMaAS: 56.67%	73.31%	Gains persist even with stronger code models
GPT-4o-mini	RTLLM 2.0	46.60%	VeriMaAS: 57.25%	64.23%	Improvement remains visible on harder tasks
GPT-4o	VerilogEval 2.0	64.74%	VeriMaAS: 71.34%	84.97%	Strong commercial models still benefit from system design

The smaller-model result is the most commercially interesting. VeriMoA with Qwen2.5-7B reaches 56.44% Pass@1 on VerilogEval 2.0, exceeding VeriMaAS with Qwen2.5-32B at 53.57%. Likewise, VeriMoA with Qwen2.5-Coder-7B reaches 60.96%, above VeriMaAS with Qwen2.5-Coder-32B at 56.67%. That does not mean a small model becomes a large model. It means a well-designed inference-time process can recover value that a larger, poorly orchestrated workflow leaves on the table.

The comparison with fine-tuned models is also carefully worth reading. Qwen2.5-Coder-32B plus VeriMoA reaches 73.31% Pass@1 on VerilogEval 2.0 and 65.49% on RTLLM 2.0, beating the listed fine-tuned 7B models. But the most revealing row is not “VeriMoA versus fine-tuning”. It is “VeriMoA on top of fine-tuning”. When applied to VeriRL-CodeQwen2.5, VeriMoA improves Pass@1 from 66.28% to 82.47% on VerilogEval 2.0 and from 61.53% to 74.45% on RTLLM 2.0.

That is the right interpretation: orchestration and training are complements. Fine-tuning improves the model’s prior knowledge. VeriMoA improves the generation process around it. Enterprises often treat those as competing budget lines. The paper suggests they are different levers.

The ablation makes caching the protagonist

The paper’s ablation study is the strongest support for the mechanism-first framing. It compares direct generation, standard MoA, two-stage intermediate generation, quality-guided caching, the combination of caching and two-stage generation, and the full version with simulator-based self-refinement.

The key result: quality-guided caching contributes more than two-stage generation alone.

On Qwen2.5-7B, standard MoA reaches 28.86% Pass@1 on VerilogEval 2.0. Adding two-stage generation alone raises this only to 31.46%, a 2.60-point improvement. Adding quality-guided caching raises it to 40.79%, an 11.93-point improvement. Then adding two-stage generation on top of quality-guided caching lifts performance to 52.06%. The full configuration with simulator-based refinement reaches 56.44%.

That sequence matters. The intermediate representations are powerful only after the system has a reliable way to preserve and reuse good outputs. Without quality-guided caching, diversity is mostly noise with a nicer résumé.

The case study on the RTLLM 2.0 “LIFObuffer” task tells the same story in miniature. Standard MoA and MoA plus two-stage generation maintain high diversity but achieve poor pass rates. MoA plus Q-Cache improves quality monotonically and reaches a 50% pass rate. Q-Cache plus two-stage generation gets both quality growth and useful diversity, reaching an 80% pass rate across the reported trials.

This is a useful correction to a common agent-system misconception: diversity is not inherently good. Randomness expands the search space; it does not tell you which regions are worth searching. In technical workflows, diversity without selection is just a very expensive way to become confused.

The cost story is promising, but not free

VeriMoA is training-free, not cost-free. The distinction is not decorative.

The full configuration consumes roughly 9.58× to 11.03× the baseline token budget across the reported settings. On VerilogEval 2.0 with GPT-4o-mini, direct generation uses 0.52k tokens per problem and reaches 48.97% Pass@1. Full VeriMoA uses 5.51k tokens and reaches 72.43%. On RTLLM 2.0 with Qwen2.5-Coder-32B, baseline generation uses 0.76k tokens and reaches 40.40%, while VeriMoA uses 7.28k tokens and reaches 65.49%.

That trade-off may be attractive in hardware design, where engineer time and downstream verification failures are expensive. But it is still a trade-off. Token cost, latency, orchestration complexity, and simulator calls all accumulate. The paper’s VeriMoA-Lite setting is therefore important: with three agents per layer, it uses around 5.33× to 5.92× baseline tokens and still outperforms VeriMaAS under comparable token budgets.

The sensitivity results add another practical hint. Wider layers help more than deeper ones at the same total agent count. With 12 total agents, a 3-layer, 4-width setup reaches 47.6% Pass@1, while a 4-layer, 3-width setup reaches 46.4%; a 2-layer, 6-width setup reaches 48.5%. The implication is operational rather than philosophical: parallel diversity is often more valuable than a longer chain of increasingly self-referential reasoning. A useful sentence for AI architects, and possibly for committee chairs.

Robustness tests show graceful degradation, not production readiness

The robustness section asks what happens when golden testbenches are unavailable during generation. This is important because real teams do not always have complete, high-quality verification infrastructure at the earliest design stage.

The authors replace golden testbenches with LLM-generated testbenches during generation, while retaining golden testbenches for final evaluation. VeriMoA degrades by 2.83 to 4.59 percentage points across Pass@1, Pass@3, and Pass@5 on the two benchmarks. With GPT-4o-mini, Pass@1 on VerilogEval 2.0 falls from 72.43% to 67.84%; on RTLLM 2.0, it falls from 64.23% to 60.51%.

That is encouraging, but it should not be over-read. The final scoring still depends on golden testbenches. The experiment shows that the internal ranking process can tolerate noisier generated testbenches better than one might expect. It does not show that production verification can be replaced by LLM-generated tests. Please do not make the verification team discover this interpretation in a post-silicon review.

The intermediate-language failure analysis is similarly nuanced. On RTLLM 2.0 with GPT-4o-mini, 94.25% of intermediate C++/Python code is functionally correct. When the intermediate code is correct, 67.94% of final Verilog passes functional verification. When the intermediate code is wrong, only 3.48% passes. This tells us two things at once: the first stage is usually reliable, and errors in that stage are highly damaging when they occur.

For business use, this points toward a practical design rule: intermediate artefacts should be validated, logged, and inspectable. They are not just hidden reasoning traces. They are part of the engineering workflow.

What this means for chip-design teams

The direct business interpretation is not that AI will now “rewrite hardware design”, despite the title doing its best to behave. The more precise reading is that inference-time orchestration can improve early RTL drafting when three conditions hold.

First, the task has executable feedback. VeriMoA benefits from simulation-based scoring and HDL-specific checks. Agent frameworks become much more useful when outputs can be evaluated, ranked, and reused.

Second, candidate diversity must be structured. C++ and Python are not arbitrary detours; they map the natural-language specification into different computational representations before HDL synthesis. That is very different from asking six agents to “think step by step” and hoping one of them becomes an electrical engineer.

Third, the organisation must accept higher inference cost in exchange for reduced training burden. For teams without the data, budget, or governance appetite to fine-tune a specialised RTL model, a training-free framework may be attractive. For teams already using fine-tuned models, VeriMoA-style orchestration may still add value on top.

A realistic adoption pathway would look less like autonomous chip design and more like assisted module generation:

Stage	Plausible use	What remains human-owned
Early RTL drafting	Generate candidate modules from structured specifications	Specification quality, architecture choices
Design exploration	Compare alternative implementations from direct, C++, and Python paths	Area, timing, and power trade-offs
Verification triage	Use simulation feedback to rank candidates and identify recurring failure modes	Testbench authority and coverage strategy
Workflow integration	Feed selected candidates into existing EDA flows	Signoff, synthesis constraints, security, IP review

The ROI case is therefore not “replace RTL engineers”. It is “reduce iteration waste in constrained generation tasks”. Less glamorous. More likely to survive contact with procurement.

The boundary is small-scale RTL, not full-chip autonomy

The authors are clear that VerilogEval 2.0 and RTLLM 2.0 are still relatively small-scale benchmarks. That limitation matters because context overhead grows with the number of cached candidates, intermediate artefacts, and module dependencies. Carrying multiple C++ snippets, Python sketches, HDL candidates, scores, and feedback across large designs could become unwieldy.

The paper suggests hierarchical generation as a future direction: a planner decomposes a complex design into modules, VeriMoA generates modules independently, and the outputs are integrated. That aligns with standard hardware methodology, but it also introduces harder problems: interface consistency, cross-module timing, shared state assumptions, verification coverage, and integration failures that do not appear in isolated benchmark tasks.

There are also production constraints outside the paper’s benchmark scope. Passing functional simulation is not the same as satisfying synthesis, timing closure, area constraints, power budgets, security policies, or internal coding standards. Proprietary design environments will need stronger testbenches, audit trails, access controls, and EDA integration. The paper shows a promising generation framework; it does not eliminate the rest of hardware engineering. Hardware engineers may now exhale through the nose.

The broader lesson: agent systems need selective memory

VeriMoA is interesting because it makes a general point through a demanding domain. Multi-agent AI systems fail when they confuse interaction with progress. More messages, more agents, more debate, and more layers do not automatically create better work. They create more material. Some of it is useful. Some of it is junk wearing a confident syntax.

The paper’s contribution is to make the system selective. Cache candidates. Score them. Reuse the best. Explore through meaningfully different paths. Let diversity generate options, but let quality decide what survives.

For business leaders, the lesson travels beyond HDL. In any workflow where AI outputs can be tested—code generation, data transformation, configuration management, compliance checks, simulation, optimisation—the winning architecture may not be a larger model alone. It may be a process that treats intermediate outputs as assets, evaluation as memory, and orchestration as a first-class design problem.

That is less cinematic than “AI designs chips”. It is also more useful.

Cognaptus: Automate the Present, Incubate the Future.

Heng Ping et al., “VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation,” arXiv:2510.27617v2, 2026. ↩︎

The failure mode is not weak generation; it is bad propagation#

The C++ and Python paths are not decoration#

The main evidence says orchestration can rival training, but not magically replace it#

The ablation makes caching the protagonist#

The cost story is promising, but not free#

Robustness tests show graceful degradation, not production readiness#

What this means for chip-design teams#

The boundary is small-scale RTL, not full-chip autonomy#

The broader lesson: agent systems need selective memory#