Agent Factories: When More AI Means Better Hardware

Button.

That was the promise of High-Level Synthesis: write a high-level program, push it through the toolchain, and receive efficient hardware without spending the afternoon whispering to pragmas like a medieval engineer negotiating with silicon spirits.

The button never quite arrived.

HLS did raise the abstraction level from RTL to C/C++. But performance still depends on expert choices: where to pipeline, where not to pipeline, which arrays to partition, which loops to unroll, which memory access pattern is quietly sabotaging the whole design. The code looks like software; the reasoning remains hardware.

That is why the paper Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization? is interesting.¹ Not because it proves AI agents can replace hardware engineers. It does not. The more useful reading is narrower and more operational: a general-purpose coding-agent system can reduce the cost of exploring HLS design spaces by combining program rewriting, synthesis feedback, integer programming, and parallel agent search.

The paper’s central idea is not “one genius agent understands hardware.” The central idea is a factory: many imperfect agents generate variants, a synthesis pipeline rejects the broken ones, an ILP solver imposes resource discipline, and more agents explore full-design refinements that local sub-kernel optimization misses.

That mechanism matters. Without it, the article becomes the usual AI bedtime story: more agents, better results, everyone claps, procurement calls it innovation. With the mechanism visible, the result is more practical and less magical.

The problem is not choosing a pragma; it is choosing a hardware story

HLS optimization is hard for three linked reasons.

First, the search space grows quickly. A design can contain many loops, arrays, and functions. Each candidate configuration may require an HLS synthesis run, which is slow enough that exhaustive search becomes impractical.

Second, optimization is not local. Improving one function can consume area that another function needs. A fast sub-kernel may be a bad global choice if it burns too much FPGA resource for too little system-level latency gain.

Third, HLS directives are not linear “more is better” knobs. The paper gives a simple example from Needleman–Wunsch: fully unrolling a reverse_string loop increased latency from 26 to 71 cycles because memory port contention got worse. That is the type of result that makes junior engineers suspicious and senior engineers tired.

Traditional design space exploration often treats the problem as structured search over a predefined parameter space. That is useful, but it is also a constraint. It can search among allowed pragmas. It usually cannot decide to restructure code, reorganize memory, fuse loops, or apply a cross-function transformation that was never in the parameter menu.

The Agent Factory approach is a response to that limitation. It does not abandon structured optimization. It wraps it inside a broader workflow.

Stage 1 makes local search tractable, then refuses to trust it too much

The first stage decomposes the input design into sub-functions. A coordinator extracts the call graph, then launches an optimizer agent for each sub-kernel. Each optimizer explores variants such as conservative pragma choices, pipeline settings, aggressive unrolling, array partitioning, inlining, and occasional code rewrites.

Each candidate must pass two gates:

It must preserve functional correctness.
It must synthesize successfully and produce measurable latency and area.

That verification loop is important. The agents are not being trusted because they sound plausible. Their output is pushed through the toolchain. The toolchain is the adult in the room.

After valid variants are collected, the system faces a composition problem. A good variant for one function may be a poor system-level choice if it consumes too much area. So the factory builds an Integer Linear Programming model that chooses one variant per sub-kernel while satisfying the global area budget and minimizing composed latency.

This stage is not glamorous, but it is the backbone of the paper. The ILP step turns a pile of agent-generated experiments into a disciplined set of candidate designs. Without it, “agent factory” would mostly mean “parallel chaos with a dashboard.”

The paper’s ILP stage also reflects the call graph. Sequential paths accumulate latency. Parallel regions are modeled by the dominant branch. Loop multipliers matter. The authors validate this formulation on two synthetic benchmarks and two real kernels, showing that the agent can infer useful latency structures for different dataflow graphs.

But the authors are clear about the limit: this ILP model is built from isolated sub-kernel synthesis results. It cannot fully capture cross-function memory reuse, global pipeline scheduling, or interactions that appear only when the full design is assembled.

That limitation is not a footnote. It is the reason Stage 2 exists.

Stage 2 turns “top-ranked” into “good starting point,” not final answer

The factory does not take the single best ILP solution and stop. Instead, it asks for the top $N$ feasible ILP candidates and assigns one exploration agent to each.

This changes the role of ranking. The ILP ranking is no longer treated as truth. It becomes a triage system.

Each Stage 2 agent receives a full design instantiated from one ILP-selected candidate. Then it explores design-wide transformations that Stage 1 could not reach easily:

Stage 2 path	What it allows	Why Stage 1 may miss it
Pragma composition	Joint pragma choices across multiple functions	Local sub-kernel search sees only part of the interaction
Code restructuring	Loop reordering, fusion, or function inlining	These transformations alter the full program structure
Memory optimization	Cross-function array partitioning and memory access changes	Memory bottlenecks often appear globally, not locally
Compute optimization	Algebraic or closed-form transformations spanning modules	The rewrite may not belong to a single isolated function

This is the paper’s most business-relevant mechanism. The factory uses cheaper local exploration to generate candidates, then spends additional inference-time compute where local search is likely to be blind.

The useful mental model is not “AI agent writes better hardware.” It is closer to this:

agents propose; synthesis disposes; ILP disciplines; scaled agents reopen the global search.

A little less cinematic. A lot more deployable.

The evidence is positive, but not the fairy-tale version

The current published article on this topic needs one major correction: the evidence should not be read as “more agents always means better hardware.” The paper itself gives a more uneven picture.

The experiments use twelve HLS kernels: six HLS-Eval benchmarks — AES, DES, KMP, NW, PRESENT, and SHA256 — and six Rodinia-HLS kernels — lavamd, kmeans, hotspot, leukocyte, cfd, and streamcluster. Results are averaged over five runs. The toolchain is Claude Code using Opus 4.5/4.6 with AMD Vitis HLS.

For HLS-Eval, the baseline is a bounded exhaustive pragma search over common loop directives, followed by ILP selection. That is a meaningful comparison for pragma-level optimization, though not a proof against the best possible DSE framework. For Rodinia-HLS, the comparison is against optimized reference implementations such as tiling, pipelining, and double-buffering.

The paper reports that Stage 1 generally improves latency over baseline while respecting area budgets, except SHA256, where performance remains comparable. It also observes two patterns that should sound familiar to hardware engineers: agents frequently discover ARRAY_PARTITION as a high-impact directive, and they learn that PIPELINE alone may be useless or harmful unless memory bandwidth and loop-carried dependencies are addressed.

That is not trivial. The agents are not trained specifically for HLS. Yet through iteration and tool feedback, they recover domain patterns that human experts already know. This is one of the paper’s strongest practical findings: general coding agents can sometimes rediscover hardware optimization heuristics when embedded in the right evaluation loop.

But the scaling result is conditional.

Table II reports mean speedup over baseline as the number of agents increases. The sequence in the Results section is not monotonic at the end: the reported mean speedup rises from 4.31× to 4.92× to 7.07×, then falls to 6.53× when moving from 8 to 10 agents. In plain English: more agents helped a lot up to a point, then the aggregate metric slipped.

That does not kill the result. It makes it more useful.

The scaling pattern depends on workload richness and resource pressure

The paper groups the scaling behavior into three categories, and this categorization is more important than the headline average.

Workload pattern	What the paper shows	Business reading	Boundary
Rich optimization landscapes	Streamcluster reaches roughly 5–6×; cfd, hotspot, and kmeans reach roughly 7–10×; lavamd reaches roughly 8× at moderate area	More agents are valuable when there are many plausible design paths	Inference cost must still be justified by design importance
Intermediate or non-monotonic behavior	PRESENT, NW, and DES improve overall, but some 8-agent designs are not dominated by 10-agent designs	Parallel exploration is stochastic; more budget does not guarantee a better frontier point	Agent count should be tuned, not blindly maximized
Saturation or resource pressure	KMP saturates early around 10–12×; AES is similar at 8 and 10 agents; NW is near an area limit	Additional agents may spend resource or tokens without meaningful latency gain	Stop rules and area-aware selection matter

The non-monotonic behavior is the lesson. Agent scaling is not a law of nature. It is a search budget. Search budgets help when the design space contains undiscovered options. They waste money when the problem is simple, constrained, or already near the best feasible trade-off.

This is where a business team should resist the tempting slide title “Scale agents, scale performance.” A better version would be:

Scale agents when synthesis feedback says the frontier is still moving.

Less exciting. More likely to survive a finance review.

The winning design does not always come from the best local candidate

One of the paper’s more subtle findings is that final winning designs do not always originate from the top-ranked ILP variant.

That sounds like a small technical detail. It is not.

It means the first ranking is an incomplete predictor of final potential. A candidate that looks second-best or third-best after sub-kernel composition may offer better room for global transformation. The ILP sees isolated variant metrics and call-graph composition. Stage 2 sees whole-design interactions.

This is the real argument for multi-agent exploration. The value is not merely that ten agents try ten versions of the same thing. The value is that different starting points expose different transformation neighborhoods.

For businesses building agentic engineering workflows, this matters beyond hardware. If the scoring function is incomplete — and in serious engineering work it usually is — the best immediate answer may not be the best refinement path. A good system keeps several promising candidates alive long enough for later evidence to arrive.

That is also why “agent factory” is a better metaphor than “agent genius.” The factory structure preserves optionality.

The ASIC section is a useful extension, not a full portability proof

The paper includes an ASIC-oriented extension using ABC logic synthesis to examine whether FPGA-guided improvements have some relationship to ASIC-mapped results.

This section should be read carefully. Its purpose is not to prove that the workflow is production-ready for ASIC design. It is a generalization check.

The authors compare HLS-reported area against logic area from ABC across six HLS-Eval benchmarks. The correlations vary: SHA256, KMP, and NW show strong relationships; AES is moderate-to-strong; DES is moderate; PRESENT is weak. The weak PRESENT result is important because memory-heavy designs can make HLS area estimates less reliable.

The latency improvement factors in the ASIC-mapped view range from 1.4× to 14.5×, with improvements generally increasing with agent count and often plateauing after four agents.

The business interpretation is simple: the FPGA-based workflow may point in useful directions for ASIC-oriented exploration, but it is not a substitute for downstream synthesis, verification, timing closure, power analysis, or signoff. Anyone selling it as that should be asked to step away from the slide deck.

The appendix is a cost warning, not a decorative afterthought

The token-usage appendix changes how the result should be deployed.

Across valid runs, the method consumes a median of 5.82 million tokens and a mean of 7.67 million tokens per run. The 25th percentile is 3.09 million, the 90th percentile is 13.39 million, and the observed range spans 1.14 million to 45.33 million tokens.

That distribution is right-skewed. Some runs become much more expensive than typical runs.

This is not automatically bad. Hardware optimization is expensive even before AI enters the room. Engineer time, synthesis time, schedule delay, and missed performance targets are not free just because they arrive through payroll instead of an API bill.

But the economics cannot be hand-waved. A production workflow needs controls:

Control	Operational question	Why it matters
Agent-count policy	When do we run 1, 4, 8, or 10 agents?	Prevents token spending on saturated kernels
Frontier monitoring	Is the area–latency Pareto frontier still improving?	Turns scaling into evidence-based allocation
Early stopping	Are new agents producing dominated designs?	Cuts off low-value exploration
Candidate diversity	Are agents starting from genuinely different ILP solutions?	Avoids redundant search trajectories
Human review gates	Which transformations require engineer approval?	Keeps correctness, maintainability, and integration risk visible

The paper shows that inference-time compute can become an optimization resource. It also shows, indirectly, that this resource needs budgeting. “Run more agents” is not a strategy. “Run more agents only where the frontier is still moving” is closer.

What this means for AI engineering businesses

The immediate business value is not “fire the hardware engineers.” That interpretation is both lazy and commercially dangerous.

The better value proposition is cheaper diagnosis and broader exploration.

A company working with HLS, FPGA acceleration, or hardware/software co-design could use an agent-factory workflow to generate candidate transformations, surface surprising bottlenecks, compare area–latency trade-offs, and give engineers a richer set of options earlier in the design cycle.

That changes the role of human expertise. It does not remove it.

The engineer’s job shifts toward setting budgets, validating transformations, interpreting why certain variants win, and deciding which optimized design is maintainable enough to ship. In other words, less time manually poking every pragma combination; more time deciding which machine-generated possibilities are real engineering assets.

For AI product builders, the paper suggests a broader design pattern:

Technical contribution	Operational consequence	ROI relevance
Multi-agent variant generation	Larger candidate pool with diverse search paths	Reduces missed optimization opportunities
Correctness + synthesis feedback	Objective filtering of agent output	Prevents plausible nonsense from entering the pipeline
ILP-based selection	Resource-aware candidate composition	Keeps area budgets explicit
Stage 2 global refinement	Captures interactions missed by local search	Improves results on richer workloads
Token-cost reporting	Makes inference-time compute measurable	Enables budget-aware deployment

This is the part that generalizes beyond HLS. Agentic systems become more useful when they are embedded inside evaluation loops that are harder than language. Synthesis reports, unit tests, simulators, backtests, constraint solvers, and production telemetry can all play the same role: they turn agents from persuasive text generators into proposal engines inside measurable workflows.

The paper is not really about agents becoming brilliant. It is about agents becoming disposable enough to run in parallel, and constrained enough that bad ideas are filtered quickly.

That is less romantic than artificial general intelligence. It is also closer to how businesses actually get value from automation.

Boundaries that matter before deployment

Several limitations affect how far this result should be taken.

The benchmark set is small: twelve kernels, several of which are well-studied. Real-world HLS projects can involve messier codebases, integration constraints, verification burdens, and non-performance requirements that benchmarks do not capture.

The model and toolchain scope is also narrow. The experiments use one Claude Code model family and AMD Vitis HLS. Different LLMs, HLS tools, target devices, or coding styles may behave differently.

The baseline is meaningful but not definitive. For HLS-Eval, the baseline is a bounded exhaustive pragma search with ILP selection; for Rodinia-HLS, it is optimized reference code. The paper itself notes that it does not yet compare against advanced DSE frameworks such as AutoDSE across broader suites.

The ASIC extension is encouraging but limited. Correlation with ABC-mapped area varies across designs, especially when memory instances complicate area estimation. That is a warning against overgeneralizing FPGA-guided results to silicon cost.

Finally, the economics are unresolved. Token usage is reported, but a full ROI model would also need synthesis runtime, engineer review time, failure rates, design criticality, cloud/API pricing, and schedule value. The paper gives enough information to start that discussion, not enough to close it.

The practical lesson: factories beat heroes when feedback is strong

The strongest version of this paper is not the dramatic version.

It does not say that a general-purpose coding agent understands hardware better than hardware engineers. It does not say that more agents always win. It does not say that HLS optimization is now solved.

It says something more useful: when a domain has expensive search, measurable outputs, hard constraints, and reliable feedback tools, an agent factory can explore more of the design space than a single trajectory. The factory works because it combines generative diversity with mechanical judgment.

That distinction matters. A lone agent can hallucinate. A factory with correctness checks, synthesis results, ILP selection, and Pareto-front monitoring can turn hallucination-prone proposal generation into structured experimentation.

For HLS, that means better area–latency exploration. For businesses, it points to a more general rule for AI deployment:

Do not ask whether the agent is smart enough. Ask whether the workflow is strict enough to make many imperfect agents useful.

That is the uncomfortable edge of the paper. The future may not belong to one elegant AI assistant that knows everything. It may belong to noisy fleets of agents, each making attempts, most being discarded, a few improving the frontier.

Less oracle. More factory.

And in hardware, apparently, that can be enough.

Cognaptus: Automate the Present, Incubate the Future.

Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, and Akash Srivastava, “Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?”, arXiv:2603.25719v2, 2026. https://arxiv.org/html/2603.25719 ↩︎

The problem is not choosing a pragma; it is choosing a hardware story#

Stage 1 makes local search tractable, then refuses to trust it too much#

Stage 2 turns “top-ranked” into “good starting point,” not final answer#

The evidence is positive, but not the fairy-tale version#

The scaling pattern depends on workload richness and resource pressure#

The winning design does not always come from the best local candidate#

The ASIC section is a useful extension, not a full portability proof#

The appendix is a cost warning, not a decorative afterthought#

What this means for AI engineering businesses#

Boundaries that matter before deployment#

The practical lesson: factories beat heroes when feedback is strong#