Measure Twice, Generate, Then Look Again

TL;DR for operators

A CAD assistant that writes code once and hopes for the best is not an engineering workflow. It is a raffle with syntax highlighting.

IterCAD is interesting because it treats CAD generation and editing as an iterative operating loop: read the drawing, generate CadQuery code, execute it in a sandbox, inspect compiler and geometric feedback, revise, and stop only when the model has evidence that the shape is right.¹ The paper’s practical contribution is not “AI can design parts now.” That would be the usual confetti cannon, and mercifully not the correct lesson. The better lesson is that useful CAD automation needs closed-loop verification, localized visual grounding, and evaluation metrics that count failures instead of quietly hiding them in the basement.

The strongest result is on IterCAD-Draw, the paper’s drawing-to-code benchmark. In the agentic workflow setting, IterCAD reports an invalid ratio of 0.30%, AUC-TR of 0.61, and mean Chamfer Distance of 5.09, outperforming the compared proprietary and open-source systems on joint executability and geometric fidelity. The editing result is more qualified: IterCAD is far better than its Qwen3.5-4B backbone and has very low invalidity, but GPT-5 leads on AUC-TR and geometric error in IterCAD-Edit. That distinction matters. The paper supports a workflow thesis, not a universal model-supremacy thesis.

For engineering and manufacturing teams, the business implication is clear but bounded: the value is in automating draft-inspect-repair cycles for single-part parametric CAD, especially where drawings are dimensioned, standards are controlled, and generated code can be executed automatically. The open questions are equally important: assemblies, proprietary drawing standards, manufacturing intent, feature semantics, and maintainable human-readable design trees are not solved by making the loop prettier.

CAD generation fails because the first answer is treated as the final answer

Most business readers will be tempted to frame IterCAD as another “text or image to CAD” paper. That is technically adjacent and strategically wrong.

The paper’s opening diagnosis is that existing CAD automation often follows an open-loop, one-shot pattern: a model receives a prompt, image, or drawing and emits a complete program. This can work when the object is simple and the evaluation is forgiving. But CAD is not poetry. A part can look plausible while hiding a wrong radius, a missing cut, an invalid topology, or a parameter choice that makes the next edit miserable. The model does not merely need to speak CAD. It needs to survive contact with geometry.

IterCAD’s first move is therefore organizational rather than cosmetic. It formulates CAD generation as an interaction between a multimodal agent and an executable CAD sandbox. The agent does not just produce code. It acts, observes, and revises. At each turn, it emits a structured reasoning trace and either a CadQuery code block or a completion token. The sandbox then returns feedback: compiler errors, execution status, rendered views, and geometry-derived dimensional information.

That changes the problem. The model is no longer being asked to perform a miracle in one pass. It is being asked to participate in a controlled inspection loop. How refreshing: a machine-learning paper has rediscovered that engineers check their work.

The paper calls this philosophy “Look and Loop.” The “Look” component uses multi-view engineering drawings with dimensions as persistent spatial anchors. The “Loop” component uses execution and visual feedback to drive iterative correction. The distinction matters because many agent systems already loop in a superficial sense. They retry. They patch. They hallucinate with stamina. IterCAD tries to make the loop less blind by grounding it in CAD-specific visual and dimensional evidence.

The mechanism is the product: drawing, sandbox, feedback, revision

A mechanism-first reading is the right one because IterCAD’s gains are not attributable to a single architectural charm. The paper assembles a pipeline whose parts work together:

Mechanism	What it does	Operational meaning
Multi-view engineering drawings	Provide orthographic views and dimensions as input anchors	The model has structured references, not just vague images or prose
CadQuery code generation	Produces executable parametric CAD scripts	Outputs can be inspected, edited, and rerun
CAD sandbox	Executes programs and catches syntax, runtime, and geometric failures	The system gets evidence instead of relying on model confidence
Visual and dimensional feedback	Projects generated solids into standard views and computes annotations	Errors become more local than a raw point-cloud distance
Progressive SFT	Teaches basic generation, editing, and refinement behaviors from curated trajectories	The model learns the workflow before RL starts pushing it
Geometry-aware RL	Rewards executable, geometrically closer outputs	Training optimizes for the loop’s actual target, not just imitation
Geometry-Viable Prefix Masking	Avoids punishing viable early turns for later corrupted suffixes	Multi-turn credit assignment becomes less stupid, which is a low bar but a useful one
CD-TR and AUC-TR	Count failed generations in the denominator	Evaluation stops rewarding systems for only measuring survivors

The core business insight is that CAD automation is not just a generation problem. It is a verification architecture problem. A company that deploys a generic model on top of a CAD library may get demos. A company that deploys a model inside an execution-and-feedback loop has a chance at workflow utility.

This is the recurring pattern across practical AI systems: capability is often less valuable than recoverability. A model that gets the first answer right 70% of the time may still be operationally weak if the remaining 30% are hard to detect. A model that gets the first answer partly wrong but can diagnose and repair its own output inside a governed environment may be more useful. Not glamorous. Useful.

The training recipe teaches the model to repair, not merely to recite

The paper’s data strategy is built around three kinds of paired CAD tasks.

First, Drawing-Code pairs synthesize CAD programs and convert the resulting STEP files into standard-compliant multi-view engineering drawings through the SolidWorks COM interface. These drawings include orthographic views and dimensional annotations, which are much closer to engineering practice than a pretty render floating in space.

Second, Text-Code pairs are built from Text2CAD descriptions. The paper refines raw descriptions into standardized instructions and converts them into CadQuery programs, filtering by execution and Chamfer Distance.

Third, Edit-Code pairs are produced by starting from valid CAD scripts and applying controlled degradations: parameter perturbations, misplaced vertices, feature substitutions, and similar defects. The model then receives an edit instruction and learns to recover the intended target while preserving unaffected code.

This matters because many AI systems are trained on final answers and then expected to perform processes. IterCAD trains on process. The authors synthesize multi-turn trajectories using a stronger teacher model, then filter them for format compliance, reasoning-action coherence, and geometric correctness. The model is not only learning what correct CAD code looks like. It is learning the choreography of correction: inspect, diagnose, edit locally, rerun, and decide whether to stop.

The supervised stage uses 28K trajectories: 20K expert trajectories for the first phase and 8K on-policy refinement trajectories for the second. The reinforcement learning stage then focuses on 2K hard Drawing-to-Code samples, which is a sensible choice because this is where multi-view reasoning and geometric self-correction are most stressed.

The training sequence is not decorative. It answers a practical question: how do you prevent an agent from turning multi-turn freedom into multi-turn chaos?

The paper’s answer is: first teach the behavior by imitation, then reinforce the objective under executable feedback, then prevent bad suffixes from corrupting credit assignment.

Geometry-Viable Prefix Masking is a small mechanism with a large managerial lesson

The most operationally interesting component may be Geometry-Viable Prefix Masking, or GVPM. The name is not warm and inviting. It sounds like something found inside a compliance database. But the idea is important.

In multi-turn CAD generation, an early turn can be useful even if a later turn ruins the trajectory. Suppose the agent produces a valid base plate in turn one, then introduces an invalid operation in turn two, then spirals into repair attempts that compound the error. A sequence-level reinforcement learning method can assign the whole trajectory a bad outcome. If handled naively, training may punish the early good behavior along with the later failure.

GVPM tries to avoid that. It monitors execution cascades and geometry stalls. If the trajectory hits consecutive runtime errors or stops improving geometrically while remaining above a quality gate, the method identifies a boundary and masks later tokens from the loss. It also applies an advantage clamp so the viable prefix can still be reinforced but is not punished for downstream collapse.

Translated into business language: do not throw away the good part of a workflow because a later step failed. Diagnose the failure boundary.

That idea applies far beyond CAD. In enterprise agent systems, multi-step failures often arise after several correct intermediate actions. A procurement agent may parse a contract correctly and then choose the wrong approval path. A finance agent may identify the right account and then apply the wrong period. A design agent may build the right base geometry and then destroy it with a bad edit. The governance problem is not merely “did the final answer pass?” It is “where did the process leave the valid path?”

IterCAD’s GVPM is a technical implementation of that principle. The model is trained not only to get a final reward but to preserve useful partial progress.

The main result is not “bigger model wins”

The paper’s most important quantitative evidence comes from IterCAD-Draw, its drawing-to-code benchmark. This is the main evidence for the closed-loop thesis because the task directly tests whether a model can convert dimensioned multi-view drawings into executable CadQuery programs and improve through sandbox interaction.

In direct inference, IterCAD reports an invalid ratio of 6.50% and AUC-TR of 0.54. Its Qwen3.5-4B backbone, by contrast, reports a 95.30% invalid ratio and AUC-TR of 0.03. That is not a gentle improvement. That is the difference between a workflow and a recycling bin.

The agentic workflow setting is more revealing. IterCAD reaches a 0.30% invalid ratio, AUC-TR of 0.61, mean Chamfer Distance of 5.09, and an average of 2.48 turns. GPT-5, in the same table, reports a 4.70% invalid ratio, AUC-TR of 0.50, mean Chamfer Distance of 12.18, and 2.44 turns. Gemini-3-flash-lite reaches AUC-TR of 0.56 and mean Chamfer Distance of 5.79, but with a much higher invalid ratio of 12.70%.

The interpretation is not that IterCAD is a generally stronger model than every frontier system. The paper does not show that. The interpretation is narrower and more useful: a smaller, domain-trained model inside a CAD-specific closed loop can outperform larger general systems on a specialized engineering workflow metric.

That is the point executives often miss. The question is not “which model is smartest?” It is “which model, feedback loop, tool environment, and evaluation standard form the most reliable operating system for this task?”

The answer, in this paper, is not the biggest model alone.

The editing result is strong, but it is not the headline victory lap

IterCAD-Edit is a different test. Here, the model receives an existing source program and a natural-language instruction for a localized modification. The task is less about reconstructing a part from a drawing and more about changing a program without damaging the rest of the geometry.

On this benchmark, IterCAD performs very well relative to its backbone: Qwen3.5-4B reports a 63.00% invalid ratio and AUC-TR of 0.18, while IterCAD reports a 1.00% invalid ratio and AUC-TR of 0.54. It also reduces average turns to 2.34 versus 4.49 for the backbone.

But GPT-5 leads the table on AUC-TR at 0.79, with a 0.50% invalid ratio, mean Chamfer Distance of 2.14, and median Chamfer Distance of 0.05. IterCAD is not the best editing system in the comparison. It is the best evidence that the IterCAD training process dramatically improves a small open model’s editing reliability.

That distinction should be preserved. Overclaiming here would flatten the paper into marketing soup. The actual result is more interesting: domain-specific closed-loop training can make a compact model competitive and highly executable, but frontier general models may still dominate certain code-editing and refactoring regimes.

For businesses, this suggests a procurement decision rather than a religious war. If the task is drawing-driven CAD reconstruction under a controlled pipeline, specialized domain training may offer a strong efficiency-reliability trade-off. If the task is broad program editing, complex refactoring, or ambiguous natural-language changes, frontier general models may still be valuable components.

The operational architecture may eventually combine both: a specialized CAD agent for structured reconstruction and verification, plus a stronger general model for complex intent interpretation. Yes, this is more complicated than “just use the best model.” Reality often is.

CD-TR matters because failed CAD code should not disappear from the scorecard

One of the paper’s most useful contributions is evaluative rather than generative.

Traditional CAD generation metrics often compute mean or median geometric error only on successfully executed programs. This is convenient. It is also a wonderful way to lie politely. If a model fails on difficult cases and only its successful outputs are scored, the reported geometry metrics can make it look more precise than it is.

IterCAD introduces the Chamfer Distance Tolerance-Recall curve, or CD-TR, and its scalar summary AUC-TR. The metric keeps invalid generations in the denominator. A failed case receives zero recall across thresholds. As the tolerance varies, the curve shows what fraction of the entire test set both executes successfully and falls within geometric error limits.

This is not a minor measurement tweak. It changes what counts as performance.

Mean Chamfer Distance can answer: “Among the cases that ran, how close were the shapes?” AUC-TR answers a harsher and more operationally relevant question: “Across all assigned cases, how often did the system produce executable geometry within tolerance?”

That is the question businesses actually care about. A production workflow cannot simply exclude the cases where the model produced broken CAD code. The CNC machine, the supplier, and the design review meeting will not accept a footnote saying “invalid generations omitted for elegance.”

The Text2CAD results illustrate why this distinction matters. IterCAD reports an invalid ratio of 0.64% and median Chamfer Distance of 0.10. CAD-Coder reports an invalid ratio of 1.45%, mean Chamfer Distance of 6.54, and median Chamfer Distance of 0.17. IterCAD is stronger on validity and median precision, while CAD-Coder has a lower mean Chamfer Distance among successful generations. That is not a contradiction. It is exactly why multiple metrics are needed.

AUC-TR does not make mean and median Chamfer Distance useless. It makes them harder to misuse.

The ablation explains where reliability actually comes from

The ablation study is not decorative. It is the paper’s mechanism audit.

Starting from the Qwen3.5-4B backbone in the agentic IterCAD-Draw setting, the invalid ratio is 62.30%, AUC-TR is 0.21, and mean Chamfer Distance is 13.04. After the first supervised fine-tuning stage on expert trajectories, invalid ratio drops to 7.50% and AUC-TR rises to 0.52. This tells us that process supervision alone teaches a substantial amount of executable CAD behavior.

Adding on-policy refinement data pushes invalid ratio further down to 0.80%, but mean Chamfer Distance remains relatively high at 12.44. That is a useful separation: imitation can improve executability without fully solving geometric precision.

Introducing GSPO reinforcement learning improves geometric quality. The GSPO variant reports AUC-TR of 0.58 and mean Chamfer Distance of 8.00. This supports the paper’s claim that geometry-aware RL contributes more than supervised imitation when fine-grained shape convergence matters.

The full model with GVPM reports the best overall combination: invalid ratio 0.30%, AUC-TR 0.61, and mean Chamfer Distance 5.09. The appendix adds training-dynamics evidence: GSPO alone tends to collapse toward single-turn behavior, while GSPO with GVPM maintains a higher and more stable turn count. That supports the idea that GVPM helps preserve substantive refinement rather than rewarding premature stopping.

The pattern is clean:

Component added	Likely purpose of the test	What it supports	What it does not prove
Expert SFT	Main mechanism evidence	CAD syntax and basic generation behavior can be taught from curated trajectories	That geometric accuracy is solved
On-policy refinement SFT	Ablation	Teacher-corrected failures improve executability and repair behavior	That imitation alone optimizes shape fidelity
GSPO RL	Ablation	Geometry-aware reward improves convergence	That RL is sufficient without credit-assignment controls
GVPM	Ablation plus training-dynamics support	Masking bad suffixes improves multi-turn refinement and final performance	That all agent workflows need this exact masking design
Easy/Hard split	Robustness and sensitivity test	Hard cases meaningfully stress multi-operation reasoning	That the benchmark covers industrial assemblies or all CAD standards
Case studies	Exploratory qualitative extension	The loop can recover from concrete failures and preserve design state	That the behavior is reliable across all production workflows

This is how an ablation should be used in a business reading: not as a trophy cabinet, but as a map of which engineering investments produced which reliability gains.

The hard cases show why the loop is doing real work

The appendix stratifies IterCAD-Draw into Easy and Hard subsets. Easy examples are closer to basic sketch-and-extrude patterns. Hard examples involve holes, blind cuts, Boolean combinations, shelling, fillets, chamfers, through-cuts, and multi-operation chaining.

Every model degrades on the Hard subset. That is expected. If a benchmark does not get harder when the geometry gets harder, it is probably not measuring geometry.

The more important observation is that IterCAD retains a strong margin on Hard cases. In the agentic setting, IterCAD reports 0.40% invalid ratio and AUC-TR of 0.47 on Hard examples. GPT-5 reports 9.40% invalid ratio and AUC-TR of 0.35. Gemini-3-flash-lite reports 22.40% invalid ratio and AUC-TR of 0.40. Qwen3.5-4B reports 85.00% invalid ratio and AUC-TR of 0.06.

The practical meaning is that closed-loop CAD grounding appears most valuable where the geometry stops being a toy. Simple extrusion patterns can often be handled by broad generative ability. Complex feature composition requires constraint tracking, valid topology, local repair, and evidence that the generated object still matches the drawing.

This is also why the benchmark construction matters. The paper explicitly tries to move beyond datasets dominated by basic sketch-and-extrude sequences by including advanced operations such as fillets, chamfers, shells, and pattern arrays. That does not make the benchmark industrially complete. But it does move evaluation closer to the failure modes that matter.

The qualitative cases are useful, but they are not proof of reliability

The paper includes case studies showing IterCAD correcting a failed drawing-to-code attempt, fixing a text-conditioned CAD error, performing localized editing, and maintaining a continuous generation-and-editing session. In one example, the first attempt fails due to an unsupported CadQuery operation; later turns replace it with explicit trigonometric placement and refine the topology. In another, the agent fixes an offset cylinder and missing concentric through-hole. The unified session shows the model reconstructing a base plate, increasing thickness, adding fillets, undoing an operation, and inserting a chamfered boss.

These examples help readers understand the intended behavior: IterCAD is not merely regenerating from scratch after every failure. It can use feedback, preserve state, and apply successive edits.

But qualitative cases should be held in their proper place. They are explanatory evidence, not statistical evidence. They show what the loop can look like when it works. The benchmark tables tell us how often the loop works under the tested conditions.

That separation matters because demos are especially seductive in CAD. A generated part looks convincing until the dimensions are wrong, the topology is invalid, the edit history is unusable, or the next engineer tries to modify it and begins questioning the moral direction of civilization.

Business value comes from cheaper correction, not magical design replacement

The direct business interpretation is not that IterCAD replaces mechanical engineers. That would be both premature and boring.

The more credible opportunity is reducing the cost of repetitive CAD reconstruction and localized editing. Many engineering workflows involve converting drawings into editable parametric models, adjusting dimensions, creating variants, repairing scripts, or translating design intent into executable geometry. These are not trivial tasks, but they are structured enough to benefit from a model that can operate inside a sandbox and receive geometry-specific feedback.

The value pathway looks like this:

What the paper directly shows	Cognaptus business inference	Boundary
Closed-loop drawing-to-code improves invalid ratio and AUC-TR on IterCAD-Draw	CAD assistants should be evaluated as repair-capable workflows, not one-shot generators	Strongest for controlled single-part CAD tasks
Progressive SFT plus geometry-aware RL improves reliability over the Qwen3.5-4B backbone	Domain training and tool feedback can substitute for some general model scale	Does not eliminate the value of frontier models in broader editing
CD-TR/AUC-TR count invalid outputs in the denominator	Procurement tests should measure all assigned cases, including failures	Metric still abstracts away manufacturing semantics
GVPM improves the full model in ablations	Multi-step agent training needs failure-boundary logic, not just final rewards	This specific masking method may not transfer unchanged to all domains
Case studies show multi-turn stateful editing	AI CAD tools can support iterative design sessions, not only isolated generation	Reliability across long real production sessions remains unproven

The first practical use case is likely not autonomous design from vague intent. It is structured conversion and repair: take a dimensioned drawing, produce editable CadQuery code, run it, compare it, patch it, and hand a human something closer to usable.

That may sound less glamorous than “AI designs products.” Good. Glamour is often just uncertainty wearing a press release.

The boundary is engineering intent, not geometry alone

The paper’s limitations are unusually relevant to business adoption.

First, IterCAD focuses on single-part parametric modeling in CadQuery and specific drawing standards. This is already useful, but it is not the same as assembly-level design, proprietary CAD environments, supplier-specific drafting conventions, or collaborative engineering workflows with versioned constraints and dependencies.

Second, the evaluation relies heavily on geometric metrics such as Chamfer Distance. Geometry is necessary, but it is not sufficient. A model can produce the right shape while missing the purpose of a feature. A keyway is not just a slot. A thread is not just a helix-like absence. A bearing fit is not just a hole with a diameter. Manufacturing intent lives in tolerances, materials, interfaces, load assumptions, and process constraints. Chamfer Distance does not know any of that. It is a ruler, not an engineer.

Third, the paper notes that IterCAD prioritizes immediate geometric convergence over long-term program maintainability. This is not a footnote. It is a major adoption issue. Engineering teams do not only need a final solid. They need a design tree or parametric program that another human can understand, modify, audit, and maintain. Hard-coded geometry may pass a benchmark and still be a nuisance in production.

This is where business buyers should be especially disciplined. A CAD agent that produces executable code is not automatically producing enterprise-grade CAD assets. The output must be assessed for downstream editability, naming conventions, feature hierarchy, constraint logic, tolerance representation, and compatibility with human workflows.

The model may generate the part. The engineering organization still has to live with it.

What to test before putting this kind of system near production

A company evaluating CAD agents should not run a beauty contest on ten attractive examples. It should build a workflow test that resembles the paper’s stricter logic.

The test should include invalid outputs in the denominator. It should separate executability from geometric accuracy. It should measure first-pass performance and post-feedback performance. It should include easy and hard cases. It should test edits that preserve existing geometry, not just from-scratch generation. And it should evaluate code maintainability, because a generated script that no one can responsibly modify is not automation; it is technical debt with a 3D preview.

A practical evaluation framework would include:

Evaluation dimension	Question to ask
Executability	Does the generated CAD program run without syntax, runtime, or geometric errors?
Geometric fidelity	Does the resulting solid match the target within meaningful tolerance bands?
Recovery	Does feedback improve the model, or does the loop drift into random patching?
Edit locality	Can the model change the requested feature while preserving unaffected geometry?
Standards fit	Does the system understand the company’s drawing conventions and CAD environment?
Maintainability	Is the generated model structured for future human edits?
Manufacturing semantics	Does the model capture intent, tolerances, interfaces, and functional constraints?
Governance	Are failures logged, categorized, and routed to humans rather than silently accepted?

IterCAD contributes strongly to the first four dimensions. It gestures toward the others but does not solve them. That is not a criticism. It is simply the boundary between a research contribution and an industrial system.

The real lesson is that CAD agents need operational memory of their own mistakes

IterCAD is valuable because it makes the middle of the workflow visible. The model does not merely receive a prompt and emit a design. It reads structured evidence, acts in a tool environment, receives feedback, and revises under constraints. Its training process also respects that middle: SFT teaches the procedure, RL optimizes the outcome, and GVPM tries to avoid punishing recoverable progress when later turns go wrong.

That is the larger pattern businesses should notice. Reliable AI systems are increasingly less about isolated model outputs and more about controlled cycles of action, inspection, and correction. In CAD, the inspection object is executable geometry. In other domains, it may be a ledger entry, a contract clause, a forecast, a compliance rule, or a software build. The form changes. The operating principle does not.

The misconception is that CAD automation mainly needs a stronger one-shot generator. IterCAD’s answer is sharper: the generator is only one part of the machine. The rest is feedback, execution, evaluation, and repair.

For once, the lesson from an AI paper is comfortably old-fashioned. Measure the part. Run the code. Check the failure. Preserve what worked. Fix what did not. Then, and only then, pretend the system is intelligent.

Cognaptus: Automate the Present, Incubate the Future.

Tao Hu et al., “IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing,” arXiv:2606.13368, submitted June 11, 2026. https://arxiv.org/abs/2606.13368 ↩︎

TL;DR for operators#

CAD generation fails because the first answer is treated as the final answer#

The mechanism is the product: drawing, sandbox, feedback, revision#

The training recipe teaches the model to repair, not merely to recite#

Geometry-Viable Prefix Masking is a small mechanism with a large managerial lesson#

The main result is not “bigger model wins”#

The editing result is strong, but it is not the headline victory lap#

CD-TR matters because failed CAD code should not disappear from the scorecard#

The ablation explains where reliability actually comes from#

The hard cases show why the loop is doing real work#

The qualitative cases are useful, but they are not proof of reliability#

Business value comes from cheaper correction, not magical design replacement#

The boundary is engineering intent, not geometry alone#

What to test before putting this kind of system near production#

The real lesson is that CAD agents need operational memory of their own mistakes#