A bracket looks simple until someone has to manufacture it.
On a screen, a generated part can look almost right: the flange appears round, the bolt holes seem evenly spaced, and the central bore is visible enough to satisfy a casual glance. Then a machinist opens the file, measures it, and discovers the inconvenient details: the wall thickness is wrong, a boolean cut failed, two solids merely touch instead of joining, or the bounding box is off by a few millimeters.
In image generation, “almost right” can be charming. In CAD, “almost right” is just a more expensive way to be wrong.
That is the useful starting point for reading CADSmith, a paper that proposes a multi-agent system for generating CadQuery models from natural language prompts.1 The paper is not interesting because it says large language models can write CAD code. We already know LLMs can write plausible code in specialized libraries. The harder question is whether the generated code produces a valid, dimensionally correct, manufacturing-relevant object.
CADSmith’s answer is not a better prompt. It is a workflow: decompose the task, execute the code, measure the geometry, inspect the rendered shape, and refine the model only when the evidence says it should. A little less magic, a little more engineering. Annoying for demos, excellent for reality.
The misconception: CAD is not text-to-3D with stricter taste
The obvious reader mistake is to treat text-to-CAD as a cousin of text-to-3D. Type a description, get a shape, inspect whether the shape looks reasonable. That framing is attractive because it lets us reuse the mental model of image generation: semantic intent goes in, visual plausibility comes out.
CAD is less forgiving. It is not only a visual artifact. A CAD model is a technical object with dimensional commitments, topology, coordinate frames, feature counts, and downstream consequences. A hole that appears in a render but has the wrong diameter is not a “minor visual discrepancy.” It is a failed specification. A part that has the right silhouette but contains invalid geometry is not a near-success. It is a broken file wearing a nice jacket.
CADSmith is built around this correction. The system assumes that natural language generation alone is not enough. It also assumes that visual feedback alone is not enough. The paper’s central mechanism is the combination of programmatic geometric validation with visual judgment, placed inside an iterative agent pipeline.
That distinction matters because many AI automation projects fail in a similar way. They improve the fluency of the generator while leaving the verification layer vague. The result is impressive output with no reliable way to know whether the output can be used. CAD merely exposes this weakness faster than marketing copy, because steel and tolerances have less patience than readers.
CADSmith splits one hard job into five accountable jobs
CADSmith uses five specialized agents: Planner, Coder, Executor, Validator, and Refiner. The names sound ordinary, but the architecture is doing something important. It turns a single ambiguous request into a sequence of checkable responsibilities.
| Stage | What it does | Why it matters operationally |
|---|---|---|
| Planner | Converts the user prompt into a structured design plan: components, dimensions, constraints, and notes | Prevents the coding model from improvising the specification while writing code |
| Coder | Produces executable CadQuery code using retrieved documentation and examples | Grounds generation in the current API rather than relying only on model memory |
| Executor | Runs the code in a sandbox and exports geometry while extracting kernel metrics | Separates “the code looks plausible” from “the code actually runs and produces a solid” |
| Validator | Uses exact kernel measurements plus three rendered views to judge correctness | Combines numerical precision with high-level shape awareness |
| Refiner | Revises the code based on validation feedback and prior attempts | Turns failure into targeted correction rather than another blind generation attempt |
This is mechanism-first design. The point is not that agents are fashionable. The point is that each agent owns a different failure mode.
The Planner reduces ambiguity. The Coder handles implementation. The Executor catches syntax errors, API misuse, timeouts, and invalid outputs. The Validator checks whether the resulting object matches the prompt. The Refiner tries to repair specific defects instead of asking the model to “try again,” which is the AI equivalent of tapping the vending machine and calling it maintenance.
The paper also avoids a common trap in agent discussions: pretending that agentic structure alone is a guarantee of reliability. CADSmith’s agents matter because they are connected to external evidence. The Executor is deterministic. The geometric measurements come from the OpenCASCADE kernel. The rendered views come from the generated STL. The Validator receives these artifacts, not just a self-description of success.
That is the difference between a role-played workflow and an engineering workflow.
The inner loop fixes code; the outer loop fixes geometry
CADSmith has two correction loops, and this is where the system becomes more than a chain of prompts.
The inner loop handles execution errors. If the generated CadQuery script fails, the system captures the traceback and sends it to an Error Refiner. That refiner receives the failing code, the error, and retrieved context from two small knowledge bases: CadQuery API documentation and common error-solution patterns. The paper describes 155 Workplane method entries, 28 worked examples, and 25 error-solution patterns covering issues such as fillet radius violations, boolean failures, wire closure problems, extrusion crashes, and selector misuse.
This is a practical choice. The authors do not fine-tune a model on a large CAD corpus. They use retrieval over documentation and known failure cases. At this scale, the retrieval is keyword-based rather than embedding-based, which is less glamorous but easier to maintain. Again, tragically useful.
The outer loop begins only after the code executes. That matters because runnable code is not the same thing as correct geometry. A script can successfully generate a part that is structurally wrong, dimensionally wrong, or missing features. The outer loop asks a different question: does the produced solid satisfy the original design intent?
This is the heart of CADSmith. The system does not stop at “valid Python.” It moves from execution correctness to geometric correctness.
| Loop | Trigger | Feedback source | Failure type addressed |
|---|---|---|---|
| Inner execution loop | Code fails to run | Traceback plus retrieved API/error context | Syntax errors, API misuse, construction crashes |
| Outer geometry loop | Code runs but geometry may be wrong | Kernel metrics, rendered views, Judge feedback, prior attempts | Wrong dimensions, missing features, malformed shapes, false convergence |
The outer loop can run up to five refinement iterations. The paper reports that, in the full pipeline, 88 of 100 benchmark entries converged immediately at iteration 0, and the average number of refinement iterations was only 0.13. That number is useful because it prevents a naive interpretation: CADSmith is not succeeding by endlessly hammering every prompt until something works. Most cases pass quickly, while the loop remains available for the cases where validation catches a problem.
The Validator needs numbers and images because each catches a different lie
The Validator receives four inputs: the original prompt, the generated CadQuery code, exact kernel measurements, and a three-view render of the generated part. The kernel metrics include volume, bounding box dimensions, center of mass, face/edge/vertex counts, and solid validity. The three rendered views cover an isometric angle, a high-angle rear view, and a front profile.
The important detail is not merely that a vision-language model is used as a Judge. The important detail is what the Judge is asked to cross-check.
Kernel measurements catch numerical errors. If a prompt asks for a 50 mm flange and the generated part measures 48 mm, a render may not reveal the discrepancy clearly, but the kernel measurement can. If a solid is not watertight, the deterministic validity check fails it regardless of the Judge’s opinion. This is the adult supervision portion of the system.
Visual inspection catches global shape errors. A part can have a plausible bounding box and volume while still being constructed in the wrong way. Features may be misplaced. Holes may be missing. A shape may satisfy some aggregate metrics while violating the design structure. The paper calls attention to this “false convergence” problem: numerical metrics can look acceptable even when the object is fundamentally wrong.
Neither channel is sufficient alone. Visual feedback without measurements is too vague for millimeter-level repair. Numerical feedback without visual context can miss structural mismatch. CADSmith’s Validator is valuable because it forces these two kinds of evidence into the same decision process.
The authors also use a stronger model as the Judge than the model used for generation. The generation agents use Claude Sonnet, while the Judge uses Claude Opus. The purpose is to reduce self-confirmation bias: the system should not rely on the same model family role to generate a part and then confidently approve its own work. Anyone who has reviewed their own spreadsheet at 1 a.m. understands the problem.
The benchmark tests explicit, dimensioned CAD tasks rather than vague shape prompts
The paper’s benchmark contains 100 natural-language prompts paired with hand-written CadQuery reference scripts. The authors divide the benchmark into three tiers:
| Tier | Count | Description | Typical complexity |
|---|---|---|---|
| T1 | 50 | Basic primitives | Boxes, cylinders, cones, tori, prisms, domes; one to three operations |
| T2 | 25 | Engineering parts | Brackets, flanges, gears, shafts, plates with hole patterns; three to eight operations |
| T3 | 25 | Complex parts | Workplane changes, lofts, sweeps, shells, revolves, multi-body unions; five to fifteen operations |
This benchmark design deserves attention because it clarifies what the result does and does not prove. The prompts specify explicit millimeter dimensions, orientations, origins, and coordinates for features such as holes and slots. The reference scripts are hand-written, executed, and visually inspected.
That makes the benchmark more controlled than a casual “generate me a cool part” task. It also makes the result narrower. CADSmith is being tested on explicit, single-part, dimensioned prompts with reference geometries. The paper is not claiming general mastery of every industrial CAD workflow, every assembly constraint, or every manufacturing process plan. Good. The industry has enough grand claims taped over narrow experiments.
The evaluation metrics are also deliberately absolute rather than normalized. The paper uses Chamfer Distance, F1 Score, and volumetric IoU, computed in millimeter space. This matters because normalized metrics can erase dimensional accuracy: a 10 mm box and a 100 mm box can look identical after normalization, even though they are not the same part.
In CAD, scale is not decoration. It is the contract.
The headline result is not just better median quality; it is fewer catastrophic failures
The overall results compare three configurations: a zero-shot baseline, a no-vision ablation, and the full CADSmith pipeline with vision.
| Configuration | Execution rate | Median Chamfer Distance | Mean Chamfer Distance | Median F1 | Median IoU |
|---|---|---|---|---|---|
| Zero-shot | 95% | 0.55 | 28.37 | 0.9707 | 0.8085 |
| No vision | 99% | 0.48 | 18.19 | 0.9792 | 0.9563 |
| Full pipeline | 100% | 0.48 | 0.74 | 0.9846 | 0.9629 |
The easy reading is “the full pipeline improves the scores.” True, but too shallow.
The more useful reading is that the median Chamfer Distance barely moves from no-vision to full vision: both are 0.48. The mean Chamfer Distance, however, collapses from 18.19 in the no-vision setting to 0.74 in the full pipeline. Against the zero-shot baseline, the mean drops from 28.37 to 0.74.
That gap between median and mean is the story. The median says typical cases are already fairly good. The mean exposes outlier disasters. CADSmith’s full pipeline is valuable because it reduces catastrophic mismatches, not merely because it polishes already-good outputs.
This matters for business use. A workflow that produces 90 acceptable files and 10 silent disasters is not a 90% success. It is an expensive inspection problem. In engineering automation, the worst cases often dominate the operational cost: rework, delays, manual review, supplier confusion, and the quiet humiliation of discovering that the “AI-generated part” cannot actually be manufactured.
The execution rate also moves from 95% in zero-shot to 100% in the full pipeline. That is not the most conceptually interesting result, but it is operationally important. Code that does not execute cannot enter a CAD workflow. The inner loop’s value is basic and therefore easy to underappreciate: it turns broken scripts into runnable artifacts before geometric validation even begins.
The ablation shows that vision matters mainly when geometry becomes structurally complex
The no-vision ablation is not a decorative experiment. Its likely purpose is to isolate whether kernel metrics alone can guide refinement. The result is nuanced.
For simpler parts, the no-vision system performs comparably. This makes sense. If a task is a box, cylinder, cone, or simple bracket, bounding box dimensions, volume, face counts, and validity checks often provide enough information. You do not need a philosophical discussion with a vision model to determine whether a cylinder has the right radius.
The picture changes in T3.
For complex parts, removing the rendered image from the Judge increases mean Chamfer Distance from 1.42 to 49.68 and lowers mean F1 from 0.85 to 0.74. That is not a small penalty. It means the kernel-only version can be confidently wrong when the geometry involves multiple features, workplane changes, sweeps, shells, or interacting parts.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Zero-shot baseline | Main comparison | One-pass prompting is weaker than closed-loop generation and validation | That CADSmith beats all possible prompting or fine-tuned systems |
| No-vision ablation | Component ablation | Kernel metrics alone are insufficient for complex geometries | That the chosen three-view rendering is the best possible visual feedback design |
| Per-tier breakdown | Difficulty sensitivity test | Performance degrades as part complexity increases | That the system is ready for arbitrary assemblies or production CAD standards |
| Failure-mode analysis | Boundary diagnosis | Some near-miss manufacturing failures still evade both metrics and Judge | That the whole approach is unreliable |
This is where the mechanism-first interpretation pays off. If we only summarize the paper, we say “vision helps.” If we read the mechanism, we can say more precisely: vision helps when aggregate geometric measurements fail to represent structural intent.
That sentence is much more useful for anyone designing an AI engineering assistant.
The strongest result still leaves a manufacturing-shaped hole
The paper is refreshingly explicit about a subtle failure. In one T3 case, a quadcopter frame achieved strong metrics: F1 = 0.963 and IoU = 0.985. The Validator accepted it on the first iteration. Yet the generated part contained small gaps between the arms and the central hub.
This is the uncomfortable lesson: even combined kernel metrics and visual judging can miss near-miss defects.
The failure is not a contradiction of the paper’s thesis. It refines it. CADSmith shows that programmatic and visual feedback substantially improve reliability, but it also shows that fixed-view visual inspection is not equivalent to manufacturability validation. A few small gaps may be hard to see in the rendered views. Aggregate metrics may remain high because most of the surface aligns with the reference. The file can be close, valid-looking, and still wrong in the place that matters.
For practical deployment, this points to the next layer of validation. Complex interfaces need local inspections, adaptive camera views, joint-level checks, boolean-union diagnostics, section cuts, and perhaps feature-specific rules. A system that can generate a quadcopter frame should not merely ask whether the frame looks like a frame. It should ask whether every arm actually connects to the hub.
That sounds painfully specific. It is also exactly what separates engineering automation from content automation.
What the paper directly shows, and what Cognaptus infers for business use
The business relevance of CADSmith is not “AI will replace CAD designers.” That is the lazy version, and like most lazy versions, it is both louder and less useful.
The more grounded interpretation is that agentic CAD systems can reduce the cost of moving from explicit design intent to executable draft geometry, especially when the workflow includes measurable validation. This is most relevant in settings where the part specification is already clear enough to be written down: rapid prototyping, quote-to-CAD workflows, internal engineering assistants, educational CAD tools, and manufacturing design automation for repeatable part families.
| Paper result | Business interpretation | Boundary |
|---|---|---|
| Five-agent pipeline decomposes planning, coding, execution, validation, and refinement | CAD automation should be organized as a checked workflow, not a single model response | Adds orchestration complexity and requires careful logging/debugging |
| RAG over CadQuery API documentation and error patterns avoids fine-tuning | Domain tools can be kept current by maintaining knowledge bases rather than retraining models | Keyword retrieval works at this corpus size; larger libraries may need stronger retrieval infrastructure |
| Full pipeline reaches 100% execution on the benchmark | Execution repair can reduce one class of manual cleanup | Execution success does not imply design correctness |
| Mean Chamfer Distance drops from 28.37 to 0.74 versus zero-shot | Closed-loop validation reduces catastrophic geometry failures | Benchmark is custom and limited to 100 prompts |
| Vision is crucial for T3 complex parts | Visual review is most valuable when structural intent cannot be captured by simple metrics | Fixed views still miss small gaps and local joint defects |
| Near-miss quadcopter failure passes validation | Manufacturing-readiness needs more than global similarity metrics | Local manufacturability checks remain future work |
The ROI pathway is therefore not “replace every CAD operator.” It is narrower and more plausible: shorten early drafting cycles, reduce repetitive script-writing, catch obvious failures before human review, and turn design intent into a first-pass parametric model that engineers can inspect and modify.
That is still valuable. In many businesses, the bottleneck is not genius-level design. It is the long tail of ordinary geometry: brackets, plates, adapters, fixtures, housings, shafts, templates, mounts, and variants thereof. If a system can generate and validate first drafts for those parts, the gain is not creative brilliance. It is reduced friction.
And reduced friction, unlike hype, actually invoices well.
How to read CADSmith as an AI architecture lesson
CADSmith is a CAD paper, but the architecture generalizes to other operational AI systems that must produce usable artifacts under constraints.
The pattern is simple:
- Convert ambiguous user intent into a structured specification.
- Generate an artifact using domain-specific tools.
- Execute or simulate the artifact in a controlled environment.
- Extract objective measurements.
- Use an independent evaluator for aspects that metrics miss.
- Refine based on concrete discrepancies, not vibes.
- Preserve failure histories so the system does not repeat the same mistake politely.
This pattern applies beyond CAD: financial models, compliance documents, data pipelines, legal drafting support, industrial process plans, and code generation all face the same basic problem. A fluent model can produce something that looks correct to a non-expert. The organization still needs a way to know whether it is correct enough to use.
The key is that verification must be native to the workflow. It cannot be pasted on at the end as a performance dashboard. CADSmith’s validation is inside the loop, which means the system can act on failures while the artifact is still being formed.
That is the difference between evaluation as a report card and evaluation as a steering wheel.
The boundary: this is promising, not production-complete
CADSmith should not be read as a finished industrial CAD platform. The paper’s strongest evidence comes from a custom benchmark of 100 explicit prompts, mainly focused on single-part geometry. The T3 tier is meaningfully more complex than primitives, but it is still not the same as multi-part assemblies with tolerances, materials, mating constraints, load requirements, cost constraints, machining processes, supplier standards, or version-controlled engineering change orders.
The system also depends on model behavior, Judge reliability, and prompt-specific clarity. It uses fixed rendered views, which the paper itself shows can miss small local defects. And although absolute-space metrics are more appropriate than normalized metrics for dimensioned CAD, mesh similarity still does not fully answer the manufacturing question.
These limitations do not weaken the article’s main lesson. They prevent us from overstating it.
The practical conclusion is this: CADSmith demonstrates a credible architecture for AI-assisted CAD drafting, not a turnkey replacement for engineering review. It shows that closed-loop geometric validation can make LLM-generated CAD far more reliable. It also shows that reliability is layered. Execution validation, dimensional validation, visual validation, and manufacturability validation are related, but they are not the same thing.
Businesses considering this kind of system should therefore ask a more precise question than “Can AI generate CAD?”
A better question is: which parts of our CAD workflow can be specified clearly, validated automatically, and escalated safely when validation fails?
That question is less exciting. It is also far more likely to save money.
Conclusion: the blueprint is the loop
The title of the paper could tempt us into a familiar agent narrative: more agents, better results. That is not quite the lesson.
CADSmith works because the agents are embedded in evidence loops. The Planner structures intent. The Coder uses retrieved CAD knowledge. The Executor turns code into measurable geometry. The Validator compares numbers and views against the prompt. The Refiner repairs specific failures. The system’s intelligence is not located in any single model call. It is distributed across the checks between them.
That is why the paper matters for Cognaptus readers. The future of useful AI in business will not be built only from larger models or more theatrical prompts. It will be built from workflows that know how to verify themselves.
CAD makes this obvious because the penalty for being wrong is physical. But the principle is broader: wherever AI outputs become operational inputs, correctness needs architecture.
A prompt can sketch an idea.
A loop can build a system.
Cognaptus: Automate the Present, Incubate the Future.
-
Jesse Barkley, Rumi Loghmani, and Amir Barati Farimani, “CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation,” arXiv:2603.26512, 2026, https://arxiv.org/html/2603.26512. ↩︎