A prototype begins innocently enough: a product team wants a small machine, a vehicle, a tool, a fixture, perhaps a mechanism that throws something across a room because medieval engineering apparently never left the group chat. The modern AI pitch says the agent can design it. Give it parts, constraints, and a goal; let it reason; let it test; let it improve.

The more interesting question is not whether the agent can draw something that looks mechanical. That is the easy parlour trick. The harder question is whether it can assemble parts so that the resulting object does something under physics.

That is where Agentic Design of Compositional Machines becomes useful.1 The paper is not a declaration that LLMs are now autonomous mechanical engineers. It is a controlled study of a narrower, more diagnostic problem: can language-model agents compose standardised mechanical parts, express those assemblies in structured code, run them in simulation, and improve them through feedback?

The answer is: sometimes, partially, and with enough failure modes to keep a safety engineer caffeinated for a decade. Good. That is precisely why the paper matters.

The paper studies machine composition, not full engineering design

The central move is abstraction. The authors do not attempt full mechanical engineering, with manufacturing constraints, materials, tolerance stacks, safety certification, maintainability, legal liability, procurement, corrosion, field repair, and the delightful little surprises that appear when physical systems encounter users.

Instead, they define compositional machine design: a machine is assembled from standardised parts, represented as a structured construction plan, and evaluated by whether it achieves a functional goal in a simulator.

That abstraction strips the problem down to four linked steps:

  1. Choose parts with useful functional semantics.
  2. Attach them in a valid spatial structure.
  3. Simulate the resulting machine under physics.
  4. Score the behaviour against a task objective.

This is a clever framing because it avoids two unhelpful extremes. On one side, pure 3D generation can produce shapes that look plausible but do not necessarily function. On the other, industrial CAD and simulation pipelines are too complex for a first diagnostic benchmark of LLM design ability. The paper’s chosen middle ground asks whether LLM agents can reason over parts, geometry, connections, and behaviour at the same time.

That combination is nastier than it sounds. A text model may know what a catapult is. It may describe a lever, a frame, a projectile container, and a tension mechanism. It may even produce a coherent high-level plan. But a catapult is not a paragraph with wheels. A few misplaced blocks and the machine becomes sculpture. Possibly avant-garde sculpture, but sculpture nonetheless.

BesiegeField turns design into an executable loop

To make the problem testable, the authors introduce BesiegeField, an environment adapted from the machine-building game Besiege. The environment supports part-based construction, physical simulation, reward-driven evaluation, and state feedback from the simulated machine.

This matters because the paper is less about one model generating one object and more about a loop:

Design prompt
LLM-generated construction plan
Structured machine representation
Physics simulation
Performance and state feedback
Refinement or reinforcement learning

The simulator gives the agent a form of reality check. Not full reality, obviously; nobody should certify a bridge because a game engine nodded politely. But it is a meaningful step beyond text-only self-critique. The generated machine either parses, assembles without invalid collisions, moves, throws, breaks, or fails.

The paper uses two main benchmark tasks:

Task What it mainly tests Evaluation signal
Car Static relational reasoning: symmetry, wheel placement, orientation, stability Travel distance and speed
Catapult Dynamic relational reasoning: leverage, projectile motion, timing, interaction among parts Boulder throw distance and height

The split is useful. A car mostly asks whether the model can place and orient parts into a stable rolling system. A catapult asks whether the model can coordinate mechanical causality over time. The latter is a better trap for fake understanding. It is easy to say “use leverage.” It is harder to attach the lever so that it actually launches the boulder.

The representation is part of the intelligence

One of the paper’s quietly important points is that machine representation changes what the LLM can understand.

BesiegeField can represent machines using global positions: each block has a 3D pose, and connections are recovered afterwards. The authors instead propose a construction tree representation, closer to how the machine is actually built. Each block is represented by its type, identifier, parent, and attachment face; special parts with two parents can record both connections.

That sounds like formatting trivia. It is not.

The construction tree makes mechanical composition legible as a sequence of local attachment decisions. Instead of asking the model to infer structure from floating coordinates, the representation tells it: this block attaches to that block, on this face, in this construction order. For an LLM, that is the difference between reading a bill of materials scattered across a warehouse and reading an assembly manual.

The ablation supports this interpretation. In the appendix, the construction-tree representation generally performs better than a global position-based representation. In the baseline construction-tree setting for the catapult designer stage, Gemini 2.5 Pro produces 8/8 file-valid outputs and 5/8 final-valid machines, with a mean score of 8.49. Under the global position-based representation, several models fail completely at file validity, and even Gemini’s mean score drops to 4.96 despite still producing 5/8 final-valid machines.

The business reading is straightforward: in AI-assisted design, the interface is not neutral. A poor representation forces the model to spend cognition reconstructing the problem. A good representation embeds the workflow’s structure into the promptable object itself. The glamour is in the agent; the leverage is in the schema. Naturally, the schema will not be invited to the launch event.

The agent workflow is a design organisation in miniature

The paper benchmarks three broad agentic approaches.

The first is a single-agent setting. The model receives the environment, available components, assembly syntax, and functional requirement, then produces a chain of reasoning and a construction plan. This is the cleanest test of baseline capability.

The second is iterative editing. A designer creates an initial plan. An inspector critiques it. A refiner modifies it. An environment querier runs simulation and summarises feedback. Search, including Monte Carlo tree search, helps select stronger candidates.

The third is hierarchical construction. A meta-designer first creates a high-level blueprint of functional blocks, then builder agents construct the machine block by block.

This is the mechanism-first heart of the paper. The authors are not simply asking which model wins a leaderboard. They are asking how much of machine design can be decomposed into agent roles: planner, critic, simulator reader, refiner, and blueprint generator.

The answer is uneven. Hierarchy helps when the top-level blueprint is reliable and the downstream builders can translate it into precise geometry. It hurts when the additional stage compounds errors. That is a familiar business lesson hiding inside a game simulator: process decomposition improves performance only when handoffs preserve the right information. Otherwise, it creates a beautifully organised failure pipeline.

The benchmark results show capability, but also brittleness

The main quantitative results are best read as diagnostic evidence, not as a product benchmark.

In the catapult task, Gemini 2.5 Pro improves from a single-agent mean score of 2.30 to 9.83 under hierarchical design, while iterative editing reaches a higher maximum score of 21.95. OpenAI o3 performs better under iterative editing, with a mean score of 9.14, but hierarchical design drops to 2.00. For cars, Gemini is strong across settings: 33.96 mean as a single agent, 34.34 with iterative editing, and 29.96 with hierarchical design. OpenAI o3 improves sharply under hierarchy for cars, from 15.28 single-agent mean to 28.39 hierarchical mean.

That pattern matters more than any single number. Different workflows expose different bottlenecks.

Evidence item Likely purpose What it supports What it does not prove
Main model/workflow table Main evidence Some frontier models can produce non-trivial machines in simulation; workflow choice matters Autonomous real-world engineering capability
Stage-by-stage refinement tables Ablation / mechanism check Environment feedback and refinement can improve valid machine performance That feedback always improves all models
Representation ablation Implementation-sensitive evidence Construction-tree structure makes the task more LLM-legible That this schema generalises to industrial CAD
Meta-designer ablation Workflow ablation Hierarchy helps only when the model can preserve spatial/mechanical intent That more agents are automatically better
RL finetuning results Exploratory extension Verifiable simulation rewards can improve validity and best-case performance That RL solves exploration, diversity, or precision

The failure patterns are equally revealing. The paper identifies incorrect part orientations, wrong parent attachments, instruction-following failures, and flawed high-level physical reasoning. These are not random blemishes. They map onto the core capability stack required for design agents:

Mechanical concept
Spatial plan
Attachment sequence
Executable structure
Simulated behaviour

A model can succeed at the first layer and fail at the third. It can explain a catapult and still build nonsense. This is not hypocrisy; it is the difference between verbal knowledge and operational geometry.

Simulation feedback helps most when the model can ask the right questions

The environment querier is one of the more business-relevant components. It does not simply dump all simulator state into the model. That would be expensive, noisy, and context-hostile. Instead, it summarises basic feedback and can selectively query details such as position, orientation, velocity, or spring length for particular blocks and time intervals.

This is a subtle but important design principle: AI agents need diagnostic feedback, not just scores.

The appendix ablation on the environment querier shows why. Removing the querier causes only a slight average drop in some cases, but giving models only reward scores markedly degrades performance across most LLMs. For Gemini 2.5 Pro on the relevant refiner setting, the baseline mean is 15.73, compared with 14.89 without the environment querier and 9.68 with score-only feedback. For Qwen3-Coder-480B-A35B, the mean falls from 5.21 to 2.81 under score-only feedback.

The lesson travels well beyond medieval siege toys. In enterprise automation, many teams still treat feedback as a scalar: pass/fail, score, KPI, approval, rejection. That is useful for ranking but weak for repair. Design agents need stateful evidence about why something failed. The diagnosis is where improvement lives.

The paper’s best warning is that appearance lies

The paper’s appendix includes a useful observation: a machine that looks intuitively correct may fail, while an awkward-looking one may perform better. That is not an aesthetic footnote. It is a warning about how humans will evaluate AI-generated designs.

Business users are especially vulnerable to plausible visuals. A generated mechanism with wheels, beams, and symmetry may feel “basically right.” But physics is not a vibes-based stakeholder. If a container is attached slightly wrong, a catapult does not throw. If gears do not align, a gear train does not transmit rotation. If a structural member collides with the frame, the machine may self-destruct before achieving anything useful.

This is why the paper’s simulator-first methodology is more serious than its game setting might suggest. It shifts evaluation from visual plausibility to executable behaviour.

That shift is also where near-term business value sits. Not in replacing engineers, but in producing candidate designs that can be filtered, diagnosed, and refined before they consume expensive human review time.

Reinforcement learning improves validity, but narrows exploration

The final major contribution is exploratory RL finetuning. The authors curate a cold-start dataset by using Gemini 2.5 Pro to generate machine-CoT pairs from 100 design objectives, yielding 9,984 valid examples after filtering. They then train Qwen2.5-14B-Instruct and run reinforcement learning with verifiable rewards in BesiegeField.

The direct result is modest but meaningful. For the catapult task, the base Qwen2.5-14B-Instruct model has 11/50 valid outputs, mean score 0.06, and max score 2.41. With cold-start plus RL, it reaches 11/50 valid outputs, mean score 0.14, and max score 7.14. For the car task, cold-start plus RL reaches 42/50 valid outputs, mean score 5.05, and max score 45.72, compared with the base model’s 46/50 validity, 4.97 mean, and 19.10 max.

The useful interpretation is not “RL fixes machine design.” It does not. The mean improvements are small in some cases, validity does not uniformly improve, and the system still struggles with precision. The more interesting finding is that RL can raise the best-case ceiling, particularly when paired with cold-start data, but tends to refine details within a strategy rather than explore radically different strategies.

The authors observe that generated machines during finetuning often keep the same high-level design approach while shifting part positions. That is exactly the exploration problem. A design agent should not merely learn to polish one catapult. It should discover multiple families of mechanisms.

This matters commercially because design is not theorem proving. In theorem proving, one valid proof may be enough. In product design, one valid candidate is rarely enough. Businesses need options: cheaper variants, safer variants, manufacturable variants, more attractive variants, variants that survive different operating conditions, and variants that do not make the procurement team stare into the middle distance.

Reward optimisation without diversity is not design intelligence. It is hill-climbing with a hard hat.

The real business pathway is simulator-backed ideation

The business relevance of this paper is not “LLMs will design your machines.” That sentence belongs in a fundraising deck, preferably near a hockey-stick graph and an unearned photograph of a factory floor.

The practical pathway is narrower and more credible:

  1. Encode design spaces using standardised parts, constraints, and executable representations.
  2. Let agents generate candidate assemblies.
  3. Use simulation to reject invalid or weak candidates early.
  4. Provide diagnostic feedback to refine promising designs.
  5. Route shortlisted candidates to human engineers and higher-fidelity tools.
  6. Over time, use verified outcomes to train domain-specialised models.

That is a useful workflow for domains where simulation is cheap relative to physical prototyping and where designs can be represented compositionally. It may apply to early-stage mechanical ideation, robotics fixtures, toy mechanisms, educational design, game-engine prototyping, warehouse tooling concepts, or simplified industrial subassemblies.

But the uncertainty boundary is strict. BesiegeField is a game-derived rigid-body environment. The experiments exclude manufacturing constraints, material behaviour, tolerance analysis, fatigue, safety certification, cost optimisation, regulatory compliance, supply-chain feasibility, maintenance, and robust control policies. The paper also focuses on pure LLM-based reasoning rather than full multimodal CAD interaction.

So the near-term ROI is not autonomous invention. It is cheaper exploration and faster diagnosis under controlled abstraction.

Paper result Business interpretation Boundary
LLMs can generate non-trivial simulated machines Agents can support early candidate ideation Only in simplified design spaces
Construction-tree representation improves performance Workflow schemas are strategic assets Schema design must be domain-specific
Environment feedback improves refinement Simulators should return diagnostic state, not only scores Feedback must be compact and relevant
Hierarchical agents sometimes help Role decomposition can improve complex design workflows More agents can compound errors
RLVR improves some best-case outcomes Verified simulation rewards can train specialised design behaviour Diversity and precision remain unresolved

The limitation is not the simulator; it is the missing bridge

It would be easy to dismiss the work because it uses Besiege. That would be lazy. Simulators have always been simplifications; the relevant question is whether the simplification isolates a real capability. Here, it does. The task forces LLMs to connect language, discrete parts, spatial attachment, and functional behaviour.

The deeper limitation is the bridge from simulator competence to engineering reliability. A game-like environment can tell us whether agents can begin to reason about compositional mechanisms. It cannot tell us whether they can satisfy the layered demands of real product development.

The paper is disciplined about this. It explicitly treats compositional machine design as a tractable subproblem, not the whole of engineering. That distinction should survive any business interpretation.

For Cognaptus readers, the most useful takeaway is architectural: if an AI design system is going to matter, it will not be a chatbot that “has ideas.” It will be a looped system with representations, simulators, feedback channels, search, specialised roles, and eventually reward-based learning. The intelligence is not in one prompt. It is in the machinery around the model.

Yes, the irony is sitting right there: to design machines, the agent first needs to become one.

From blueprints to agency

The old story of agentic AI was planning plus tool use. This paper suggests a more demanding version: agency as constructive experimentation. The agent must not only decide what to do; it must build an artefact, test it against the world, interpret the failure, and adjust the structure.

That is a more serious standard. It is also less flattering to today’s models. They can reason verbally about mechanisms, but still fail at part orientation. They can improve with feedback, but only when the feedback is informative. They can benefit from hierarchy, but only when the hierarchy preserves spatial intent. They can learn from reward, but risk collapsing into narrow design habits.

This is not disappointment. It is progress becoming measurable.

The paper’s value lies in giving AI design a workbench: a place where claims about creativity, reasoning, and embodied function can crash into simulated physics before crashing into anything more expensive. For business, that means the next wave of useful design agents will look less like magical inventors and more like disciplined junior engineers trapped inside a simulator: tireless, occasionally clever, frequently wrong, and much more useful when supervised by systems that know how to test them.

That is not the end of engineering. It is the beginning of a better design loop.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wenqian Zhang, Weiyang Liu, and Zhen Liu, “Agentic Design of Compositional Machines,” arXiv:2510.14980, 2025, https://arxiv.org/abs/2510.14980↩︎