Blueprints for Thinking: Why CAD Needs Agents, Not Prompts

Opening — Why this matters now

There’s a quiet mismatch in the current AI narrative. We celebrate models that can draft essays, generate images, and even write code—but then expect them to design engineering-grade objects with millimeter precision. That’s not ambition. That’s wishful thinking.

CAD is not forgiving. A model that is “almost correct” is, in practice, entirely useless. A missing face, a slightly wrong dimension, or an invalid solid is not an aesthetic flaw—it is a production failure.

This is where most text-to-CAD systems quietly break. They treat engineering as a language problem. The paper you uploaded—fileciteturn0file0—makes a different argument: CAD is a verification problem.

And verification, inconveniently, requires structure.

Background — From pretty shapes to usable geometry

Prior approaches to text-to-CAD fall into two camps:

Approach	Strength	Failure Mode
Single-pass LLM generation	Fast, simple	Hallucinates dimensions, invalid geometry
Vision-based refinement	Captures overall shape	Misses precise measurements

The tension is obvious:

Visual feedback → good for “does it look right?”
Numerical feedback → good for “is it correct?”

Most systems pick one. Engineering requires both.

The paper identifies the deeper issue: LLMs are fluent but not grounded. They generate plausible CAD scripts but lack a mechanism to verify whether those scripts produce valid, manufacturable objects.

Which leads to a predictable outcome: elegant code that produces unusable parts.

Analysis — CADSmith’s architecture: thinking in loops, not outputs

The proposed system, CADSmith, does something deceptively simple: it stops treating generation as a one-shot task.

Instead, it decomposes the workflow into five specialized agents:

Agent	Role	Business Analogy
Planner	Converts text → structured design spec	Product manager
Coder	Generates CAD code	Engineer
Executor	Runs code and extracts metrics	QA system
Validator (Judge)	Evaluates correctness	Auditor
Refiner	Fixes errors iteratively	Senior reviewer

This is not just modularity for elegance. It introduces something most AI systems lack: accountability between steps.

The real innovation: dual-loop correction

CADSmith introduces two nested feedback loops:

Inner loop (execution correctness)
- Fixes syntax errors, API misuse
- Ensures code runs
Outer loop (geometric correctness)
- Uses exact measurements (volume, bounding box, topology)
- Combines with visual inspection from a separate model
- Ensures output is correct

If that sounds obvious, it’s because it mirrors how humans work.

We don’t just write something and hope. We test, measure, and revise.

The difference is that here, the feedback is programmatic and quantifiable.

Findings — What actually improves (and by how much)

The results are not subtle. They are structural.

Overall performance

Configuration	Execution Rate	Median F1	Median IoU	Mean Chamfer Distance
Zero-shot	95%	0.9707	0.8085	28.37
No vision	99%	0.9792	0.9563	18.19
Full pipeline	100%	0.9846	0.9629	0.74

The key number is not F1. It’s Chamfer Distance.

Why?

Because it exposes catastrophic failures.

A drop from 28.37 → 0.74 is not incremental improvement. It’s the difference between:

“mostly correct”
and “actually usable”

Complexity matters

Tier	Description	F1 Score (Full Pipeline)
T1	Simple primitives	0.998
T2	Engineering parts	0.998
T3	Complex multi-step parts	0.886

Here’s the interesting part:

Removing vision barely affects simple tasks
But breaks complex ones entirely

In T3, removing visual feedback increases Chamfer Distance from 1.42 → 49.68

That’s not degradation. That’s collapse.

Implications — This is not about CAD

It’s tempting to treat this as a niche engineering paper. That would be a mistake.

What CADSmith demonstrates is a broader principle:

LLMs don’t need to be smarter. They need to be constrained.

1. Agents outperform monolithic intelligence

A single model tries to do everything—and fails quietly.

A multi-agent system:

decomposes responsibility
introduces checkpoints
enables targeted correction

In business terms, this is the difference between:

hiring one “genius generalist”
vs building a functional organization

The latter scales. The former produces impressive demos.

2. Measurement beats intuition

The system works because it replaces vague feedback with:

exact dimensions
explicit constraints
measurable discrepancies

This is what most AI deployments lack.

They optimize for fluency, not correctness.

3. Independent evaluation matters

Using a separate, stronger model as a Judge avoids self-confirmation bias.

This is quietly critical.

A system that generates and evaluates itself will almost always pass its own work.

Which is efficient—until it fails in production.

4. RAG over fine-tuning: a pragmatic choice

Instead of retraining models, the system retrieves:

API documentation
known error patterns

This keeps the system:

up-to-date
cheaper to maintain
easier to extend

In enterprise terms, this is closer to knowledge orchestration than model engineering.

Conclusion — From generation to verification

The real contribution of this paper is not better CAD models.

It’s a shift in mindset:

AI systems should not be judged by what they generate, but by how well they correct themselves.

CAD just happens to make this painfully obvious.

In less strict domains, we tolerate errors. In engineering, we cannot.

Which is precisely why engineering workflows may become the blueprint for the next generation of AI systems.

Not creative.

Not conversational.

But correct.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From pretty shapes to usable geometry#

Analysis — CADSmith’s architecture: thinking in loops, not outputs#

The real innovation: dual-loop correction#

Findings — What actually improves (and by how much)#

Overall performance#

Complexity matters#

Implications — This is not about CAD#

1. Agents outperform monolithic intelligence#

2. Measurement beats intuition#

3. Independent evaluation matters#

4. RAG over fine-tuning: a pragmatic choice#

Conclusion — From generation to verification#