TL;DR for operators
AI systems are getting better at producing outputs that look structured: code, CAD, diagrams, workflows, compliance memos, procurement recommendations, and decision traces. That is not the same as keeping the structure right.
Two recent arXiv papers make this point from opposite ends of the problem. One looks inside language models and finds evidence for a compact retrieval-conditioned rebinding mechanism: the model does not necessarily rewrite its whole internal world after a state change; it can preserve old representations and redirect retrieval when the answer is needed.1 The other builds an engineering benchmark for Text-to-CAD and shows that models can pass earlier surface gates — executable code, plausible geometry — while still failing the practical tests of functionality, manufacturability, and assemblability.2
The combined lesson is simple enough to be annoying: structured AI is not mainly about fluent generation. It is about relational integrity. The system has to preserve the right links between requirements, parts, constraints, evidence, and final actions. Otherwise it will produce something that looks respectable, runs in a sandbox, and quietly forgets what it was supposed to be doing. A familiar corporate format, really.
For business deployment, this means evaluation must move beyond “does the output look right?” toward “did the system preserve the operational relationships that make the output useful?”
The shared problem: AI is learning to look structured before it reliably is structured
Most AI adoption failures are not dramatic. The model does not usually burst into flames, announce its contempt for governance, and delete the procurement database. It does something more tedious: it produces an output that is close enough to pass a casual review and wrong enough to cause downstream trouble.
That failure mode becomes more important as companies move from conversational AI to structured AI. The work is no longer just drafting an email or summarising a policy. It is generating CAD code, linking claims to evidence, mapping purchase orders to constraints, tracking state across multi-step workflows, or recommending operational decisions after a change in context.
In these settings, the output is only the visible end of a relational system. A design is not just a shape. A workflow is not just a sequence. A compliance answer is not just a paragraph with policy-ish vocabulary sprinkled on top like regulatory parsley. The output is useful only if the right entities remain bound to the right attributes, constraints, and consequences.
That is the connection between the two papers. The first paper asks how a model internally handles dynamic entity tracking when objects swap locations. The second asks whether models can generate CAD assemblies that satisfy not only syntax and geometry, but engineering intent. One is mechanistic; the other is applied. Together they form a useful chain:
| Layer | What the paper contributes | Operator translation |
|---|---|---|
| Internal mechanism | LLMs can perform dynamic rebinding through retrieval-time pointer updates rather than full global state reconstruction. | Do not assume the model has a coherent updated world model just because it answers correctly. |
| Applied stress test | Text-to-CAD models fail in stages: code execution, geometry validity, then engineering intent. | Do not assume executable or plausible artifacts satisfy the real task. |
| Business conclusion | Structured AI needs evaluation of bindings, constraints, and downstream intent. | Measure relational integrity, not just output polish. |
This is not a serial summary of two papers. The useful reading is the chain: first, how a model may route relational state internally; second, how fragile that becomes when the external task requires many relationships to survive into a real artifact.
Paper 1: the model may not update the world; it may update the pointer
The rebinding paper studies a deceptively small task. A prompt assigns objects to boxes, then swaps two boxes, then asks what a particular box contains. For example: Box R contains the rabbit, Box S contains the sock, Box T contains the toy. Swap the items of Box S and Box R. Which item does Box R contain?
A human can solve this by mentally updating the world: after the swap, Box R contains the sock. The intuitive assumption is that a model does something similar. It reads the context, constructs a little internal table, updates the table after the swap, then reads off the answer.
The paper tests a different hypothesis. Instead of reconstructing a full updated state, the model may preserve the original object representations and perform a local rebinding at retrieval time. In the authors’ terms, it uses binding IDs: abstract identifiers that connect an entity with its associated object. After a swap, the model does not necessarily rewrite every box-object relation. It can remap the binding ID associated with the queried box and then retrieve the object originally stored under that binding.
That sounds like a technical distinction because it is one. It is also the difference between “the system has updated its world model” and “the system has found a clever way to answer this readout request.” Those are not the same claim. One is stronger, more general, and more comforting. Naturally, it is also the one executives tend to prefer hearing.
The researchers use causal interventions to test where answer-relevant information flows through the model. They report that the answer object is retrieved directly from its original contextual occurrence, while box information is mediated through intermediate token positions after the swap. That pattern supports retrieval-conditioned rebinding over a full global state update. In plainer terms: the model appears to keep the object where it was and redirect the retrieval path when the question arrives.
They then localise the mechanism through path patching and identify a compact attention-head circuit with functional groups such as answer retrievers, dereferencers, position updaters, swap position transmitters, and binding anchors. In Gemma-9B, 38 of 672 attention heads, or 5.7%, recover much of the model’s rebinding behaviour, reaching 0.89 candidate accuracy against 1.00 for the full model and 0.34 for a matched random-head baseline. Across Gemma and Llama models, the general strategy appears, but the implementation differs: Gemma models show clearer query/key binding-ID matching, while Llama models are more key-dominated.
That matters because it gives us a more precise picture of LLM competence. Correct behaviour in a state-tracking task does not imply the model has built a stable, inspectable, globally updated representation of the world. It may be using a compact, local, retrieval-time routing strategy.
This is not a criticism of the model. Local routing can be efficient and effective. Humans also use shortcuts, though we dress them up with much better excuses. The issue is deployment interpretation. A model that answers a controlled swap task correctly may still be brittle when the task shifts from one swap among single-token objects to a messy business process with multiple revisions, exceptions, approvals, and constraints.
The authors are careful about this. Their task is controlled: alphabetically named boxes, single-token objects, and one swap operation. They explicitly note that future work should test multiple swaps, larger entity sets, other state transitions, and more naturalistic narratives. That limitation is not a footnote nuisance. It is the business relevance.
If the mechanism is local and retrieval-conditioned, then the operational question becomes: when the system faces a richer environment, does it still retrieve the right thing from the right binding at the right time?
That question leads directly to CAD.
Paper 2: CAD is where relational slippage becomes expensive
The MUSE paper moves from the inside of the model to the outside of the artifact. It asks whether LLMs can generate CAD models that are not merely syntactically executable or visually plausible, but functionally valid, manufacturable, and assemblable.
This is a useful shift because CAD is unforgiving in a way that chat demos are not. A plausible paragraph can survive on vibes. A chair cannot. Chairs, in a charmingly old-fashioned constraint imposed by physics, are expected to hold weight, fit together, and not become abstract sculpture when sat upon.
The paper argues that existing Text-to-CAD benchmarks often focus on single-part models and geometric similarity. That misses the practical design problem. A generated model may resemble a reference shape and still fail as a design if it is unstable, hard to manufacture, or impossible to assemble. Conversely, a visually different model may satisfy the same functional intent. Shape similarity is therefore an inadequate proxy for design quality.
MUSE reframes the task around structured design specifications. Each design instance includes a high-level design goal, a physical assembly graph, valid parameter ranges, a manufacturing plan, and assembly-related information. The benchmark contains 106 design instances spanning manufacturing processes, materials, and connection methods. The evaluation is staged:
| Stage | What is checked | Why it is insufficient alone |
|---|---|---|
| Code validity | Does the generated CadQuery script execute and export a STEP file? | Executable code can still produce bad geometry. |
| Geometric validity | Is the model watertight, manifold, self-intersection free, and overlap free? | Valid geometry can still ignore design intent. |
| Design-intent alignment | Does it satisfy functionality, manufacturability, and assemblability rubrics? | This is closer to the actual engineering task. |
The paper’s main empirical finding is a failure cascade. Models degrade from code execution to geometric validity to design-intent alignment. Closed-source models outperform open-source models across stages, but even stronger models show limited success on fine-grained engineering criteria. The authors also report that code-oriented capability does not reliably imply CAD geometry capability: a model can execute code relatively well and still fail geometric or design-intent checks.
That is the applied version of the relational-state problem. CAD generation is full of bindings. The seat panel must bind to the intended load-bearing function. The leg geometry must bind to stability. The joint type must bind to assembly direction and tolerance. The material must bind to manufacturing limits. The dimensions must stay inside a valid parameter space. The generated script must preserve those relationships when it becomes geometry.
MUSE shows that the current generation pipeline can lose those bindings at multiple points. The code may run, but the components overlap. The geometry may be valid, but the object may not function. The assembly graph may be topologically close, but the parameter choices may shift the object into a different design category. A chair becomes a bench, a kids’ chair, or a bed depending on parameter settings. Apparently, even furniture has identity issues.
The benchmark’s evaluation design is therefore more important than any single leaderboard number. It separates surface success from operational success. That is exactly the separation business AI needs.
The combined conclusion: structured AI needs relational integrity
Read together, these papers point to a practical concept: relational integrity.
Relational integrity is the ability of an AI system to preserve and apply the right relationships among entities, attributes, constraints, and actions throughout a task. It is not the same as factual accuracy. It is not the same as code execution. It is not the same as visual similarity. It is the connective tissue that makes structured outputs usable.
A simple way to express the deployment problem is:
This is not a measurement formula from the papers. It is a business interpretation. The point is that the terms behave like gates. If an output looks good but loses the relationship between the requirement and the generated artifact, the operational value drops. If code runs but violates the assembly constraint, the operational value drops. If a workflow completes but applies the wrong approval state to the wrong transaction, the operational value drops. Multiplication is cruel that way. Management dashboards often prefer averages because averages are kinder to failure.
The rebinding paper shows that even in a controlled setting, the model’s internal success may rely on local retrieval-time routing rather than a robust global state update. The MUSE paper shows that in a realistic engineering setting, outputs can pass earlier checks and still fail the deeper relational task.
The chain looks like this:
- LLMs must bind entities to attributes and update or reinterpret those bindings when context changes.
- In controlled dynamic tracking, models may solve this through retrieval-conditioned rebinding rather than a full updated world model.
- In CAD generation, useful output requires coherent preservation of many bindings: components, interfaces, dimensions, materials, tolerances, functions, and assembly constraints.
- Standard success signals — executable code, plausible shape, geometric validity — do not prove those bindings survived.
- Therefore, business-facing AI systems need evaluations that test relational state at the point of use.
This is the uncomfortable part: the model can be “right” in a narrow behavioural sense while still being unready for operational autonomy. It may answer the query, generate the file, or complete the workflow, but the user still has to ask whether the right relationships were preserved.
That is not pessimism. It is inventory control for assumptions.
Relational drift: the failure mode operators should actually worry about
The common AI risk discussion often focuses on hallucination, bias, security, and cost. Those are real. But for structured enterprise use, relational drift deserves its own category.
Relational drift occurs when the output remains plausible while the underlying relationship between entities, constraints, and actions shifts or breaks. It is dangerous because it does not necessarily look like failure at the surface.
| Domain | Surface success | Relational failure | Better evaluation question |
|---|---|---|---|
| CAD generation | The script executes and creates a model. | Components violate assembly constraints or manufacturing limits. | Does the artifact satisfy the design specification across function, process, and assembly? |
| Compliance review | The memo cites relevant policy language. | The cited rule is bound to the wrong jurisdiction, product, or transaction type. | Are claims linked to the correct governing source and case facts? |
| Procurement | The recommendation ranks suppliers neatly. | Supplier constraints are mismatched to delivery region, certification, or contract terms. | Are vendor attributes bound to the correct operational requirement? |
| Finance workflow | The model produces a clean variance explanation. | The explanation binds the variance to the wrong driver or reporting period. | Can the system trace each claim to the right account, period, and assumption? |
| Customer operations | The agent completes a case workflow. | The resolution step is applied to the wrong customer state or entitlement. | Was the action valid for this specific user, plan, and history? |
This is why “AI generated it successfully” is not a control. It is a statement of artifact production. Controls need to ask what survived the transformation.
In the MUSE setting, the transformation is from design specification to CadQuery code to STEP geometry to engineering assessment. In business workflows, it may be from policy to decision, from contract to obligation, from ticket history to action, or from forecast assumptions to investment recommendation. The general problem is the same: every transformation creates a chance for bindings to slip.
The model may preserve the label and lose the relation. It may preserve the shape and lose the function. It may preserve the step and lose the condition under which that step was valid.
And because the output looks organised, people trust it faster. Humans are generous to tables, diagrams, and code blocks. We see structure and assume discipline. This is how PowerPoint became a civilisation.
What the papers show, and what they do not show
The papers are useful because they avoid the lazy version of the argument. They do not simply say “LLMs are bad at reasoning” or “AI design is hard.” They show more specific mechanisms and failure points.
The rebinding paper shows that state-tracking behaviour can be supported by a compact attention-head circuit using binding-ID routing. It does not show that all complex reasoning in LLMs works this way. It does not show that models never build global state representations. It studies a controlled task where causal interventions are tractable.
The MUSE paper shows that Text-to-CAD evaluation must move beyond code execution and geometry resemblance toward engineering intent. It does not prove that all AI-generated CAD is commercially useless. It also does not physically manufacture the benchmark outputs. The authors note that designer validation and rubric-based VLM judging may miss real-world manufacturing issues, and that assembly order and physics-based robustness are not fully modelled.
Those boundaries matter. The business conclusion is not “LLMs cannot do structured work.” That would be too broad, and frankly too easy. The better conclusion is:
LLMs can produce structured outputs before they can reliably preserve the operational relationships that make those outputs safe, buildable, auditable, or useful.
That sentence is less dramatic. It is also more deployable.
A practical evaluation framework: test the bindings, not just the artifact
The immediate business response is not to ban structured AI systems. It is to evaluate them at the right layer.
A useful evaluation framework has five parts.
1. Define the entity map
Before testing the model, define the entities that matter. In CAD, these are components, joints, dimensions, materials, and manufacturing methods. In compliance, they are rules, jurisdictions, products, transactions, customers, and dates. In operations, they are cases, owners, approvals, states, and escalation paths.
If the entities are not explicit, the model’s binding failures will be invisible. You cannot test whether the right thing stayed attached to the right thing if nobody agreed what the things were. A shocking discovery, yes, but one worth writing down.
2. Define the binding rules
Next, define which relationships must hold. This is where many AI evaluations become too shallow. They test whether the final answer is acceptable, but not whether the internal relationships required by the task survived.
For CAD, the binding rules may include:
- each component must match its role in the assembly graph;
- each parameter must remain inside the valid design space;
- each material choice must match the manufacturing method;
- each joint must support the stated assembly behaviour.
For enterprise workflows, the equivalent rules might include:
- each recommendation must bind to the correct customer state;
- each approval must bind to the right authority level;
- each policy citation must bind to the applicable jurisdiction and date;
- each action must bind to a permitted next step.
These are not model preferences. They are task invariants.
3. Test after state changes
The rebinding paper is valuable because it focuses on state change. Many evaluations test static recall: given a context, can the model answer a question? Real operations involve updates. A shipment is delayed. A contract is amended. A customer changes plan. A design constraint is revised. A transaction crosses a threshold. A policy expires.
The key test is not whether the system understood the initial state. It is whether it preserves or correctly reinterprets the relevant bindings after change.
For business systems, build test cases around revisions, swaps, exceptions, cancellations, overrides, and conflicting updates. These are where relational drift hides.
4. Separate the funnel
MUSE’s three-stage protocol is a useful pattern beyond CAD. Do not collapse early and late success into one score. Split evaluation into gates:
| Gate | Generic version | Example |
|---|---|---|
| Syntax / execution | Can the system produce a valid artifact? | Code runs, form is complete, workflow executes. |
| Structural validity | Is the artifact internally coherent? | Geometry is valid, fields align, dependencies resolve. |
| Intent alignment | Does it satisfy the real-world purpose and constraints? | The design can be built; the decision follows policy; the recommendation fits the user’s situation. |
This prevents a common measurement error: allowing strong surface performance to compensate for weak operational alignment. In production, a beautifully formatted wrong answer is still a wrong answer. It just has better stationery.
5. Track where failures enter
The most useful part of a staged evaluation is not the final score. It is the failure pattern.
If models fail at execution, the problem is generation reliability. If they pass execution but fail structural validity, the problem is spatial, logical, or schema consistency. If they pass structure but fail intent, the problem is domain grounding and constraint preservation.
Different failures require different controls. A prompt tweak may help code formatting. It will not necessarily fix manufacturability. A retrieval layer may improve policy citation. It will not automatically bind the correct rule to the correct customer state. A human review step may catch final errors, but only if the review interface exposes the relationships that need checking.
The control must match the failure layer. Otherwise the organisation is just adding process theatre, which is compliance’s less entertaining cousin.
The design principle: make relationships inspectable
The article’s central thesis is not that models need to become perfect internal world simulators. That may be unnecessary, and in many cases unrealistic. The better design principle is that important relationships must be inspectable and testable.
For structured AI systems, this means the product should expose:
- the entities the system believes are relevant;
- the relationships it is using;
- the constraints it considers binding;
- the state changes it detected;
- the evidence supporting each relation;
- the final action that depends on those relations.
In other words, do not only show the output. Show the binding contract behind the output.
This is especially important for agentic systems. An agent that calls tools, updates memory, changes files, sends messages, or creates operational artifacts can create relational drift across steps. The output may be locally valid at each step but globally inconsistent across the workflow. A procurement agent may correctly parse a supplier document, correctly summarise a delivery constraint, and correctly draft a recommendation — while binding the constraint to the wrong SKU. Each local action looks competent. The assembled result is not.
That is the same logic as the MUSE failure cascade. Execution is not geometry. Geometry is not design intent. Local completion is not global task success.
What this means for AI adoption
For managers, the practical lesson is to stop asking only whether AI can generate a class of artifact. Ask whether it can preserve the relationships that make that artifact operationally meaningful.
That changes procurement questions:
- not “Can the model generate CAD?” but “Can it preserve assembly, tolerance, material, and manufacturing constraints through generation?”
- not “Can the model summarise contracts?” but “Can it bind obligations, exceptions, dates, parties, and governing clauses correctly?”
- not “Can the agent complete workflows?” but “Can it maintain valid state across updates, approvals, and tool calls?”
- not “Can the system produce a recommendation?” but “Can it trace the recommendation to the right evidence, assumptions, and constraints?”
This also changes ROI analysis. AI value in structured work does not come from output volume alone. It comes from reducing the cost of correct structured transformation. A thousand plausible CAD files are not useful if the engineering team must inspect every joint from scratch. A thousand compliance memos are not useful if legal still has to rebuild the fact-rule mapping. A thousand workflow completions are not useful if operations must audit whether the right state was changed.
The economic bottleneck is not generation. It is verification. More precisely, it is verification of relationships.
That is why benchmarks like MUSE are directionally important. They move evaluation closer to the real unit of value: not whether the artifact exists, but whether it works under the constraints that define the task.
The quiet warning
The two papers do not say the same thing, and that is why they work well together.
The first paper shows that LLMs may solve dynamic tracking through clever, compact retrieval-time rebinding. The second shows that when structured generation moves into engineering territory, surface success breaks down unless the system preserves function, manufacturability, and assembly relations. One gives the internal caution. The other gives the external bill.
Together, they suggest that the next phase of AI evaluation should be less impressed by polished outputs and more interested in relational survival. Did the right object stay bound to the right entity? Did the parameter stay inside the valid design space? Did the manufacturing constraint survive the code generation step? Did the workflow action still apply after the state changed?
That is not glamorous. It is, however, where useful automation lives.
AI systems are now good enough to generate artifacts that deserve serious evaluation. That is progress. It also means the old demo question — “Can it make something plausible?” — has expired. Plausible is cheap now. Structured correctness is the expensive part.
And in business, expensive parts have a habit of becoming the actual product.
Footnotes
Cognaptus: Automate the Present, Incubate the Future.
-
Soyoung Oh and Vera Demberg, “A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models,” arXiv:2606.08644v1, 7 June 2026, https://arxiv.org/html/2606.08644. ↩︎
-
Xiaoyu Dong, Zhi Li, and Xiao-Ming Wu, “MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation,” arXiv:2605.28579v2, 4 June 2026, https://arxiv.org/html/2605.28579. ↩︎