From Yarn to Code: What CrochetBench Reveals About AI’s Procedural Blind Spot

A pattern is not a caption.

That sounds obvious until a multimodal model looks at a finished object, produces a confident set of instructions, and everyone in the room quietly rounds “looks plausible” up to “can build it.” This is one of the industry’s more expensive habits: mistaking descriptive competence for operational competence. The model can say what is there. Therefore, surely, it can infer how to make it. Very neat. Very wrong.

CrochetBench is useful because it makes that mistake measurable.¹ The paper asks whether vision-language models can move from recognising a finished crochet item to generating the step-by-step procedure required to construct it. On the surface, that may sound charmingly niche. In practice, crochet is a compact test case for a much larger problem: converting visual understanding into executable, stateful, structure-preserving action.

The lesson is not that AI cannot crochet. That would be adorable, and also too small. The lesson is that current multimodal systems still struggle when the output must obey a grammar, preserve counts, track state across steps, and produce something that can be executed rather than merely admired. The yarn is just where the benchmark catches the bug.

Crochet is a craft, but the failure is computational

Crochet patterns are not free-form prose. They are compressed programs written for humans.

A line such as “Rnd 1: ch 4, 6 sc in ring” is not just a sentence. It encodes operations, ordering, stitch counts, topology, and state transitions. Later rounds depend on earlier ones. Repeats must balance. Increases and decreases change geometry. Turning at the wrong time breaks the object. A small symbolic error can propagate until the final artefact no longer resembles the intended design.

That is exactly why the domain is so revealing. Most procedural domains are hard to evaluate at scale. To check whether a recipe works, someone has to cook. To check whether an assembly instruction works, someone has to assemble. To check whether a robot policy works, someone has to run the robot, preferably away from the expensive glass.

Crochet offers a cleaner compromise. The paper uses CrochetPARADE, a domain-specific language and renderer for crochet patterns, so model outputs can be compiled, validated, and rendered. Instead of asking whether generated instructions sound convincing, the benchmark can ask a sharper question: does the procedure actually execute?

That shift matters. A benchmark that rewards resemblance tests whether a model can imitate the surface of expertise. A benchmark that compiles the output tests whether the model has produced something structurally usable. The former is cheap theatre. The latter is where the invoices begin.

CrochetBench is designed as a ladder from seeing to doing

The paper introduces CrochetBench, a dataset of 6,085 crochet patterns across 55 project categories, built from publicly available crochet pattern sources. The dataset pairs finished-product images with structured pattern information, including natural-language instructions and stitch metadata. The authors then define four tasks of increasing procedural demand.

Task	What the model must do	Likely purpose in the paper	What it reveals
Stitch Recognition	Identify stitch types from an image	Main evidence for visual grounding	Whether the model can see local crochet primitives
Instruction Selection	Choose the correct instruction from similar options	Main evidence for image-text grounding	Whether the model can match visual structure to procedural text
Instruction Generation	Generate natural-language crochet instructions from an image	Main evidence for procedural language synthesis	Whether plausible text preserves multi-step construction logic
Instruction-to-DSL Translation	Produce executable CrochetPARADE code	Main evidence for structural correctness	Whether the output survives grammar, state, and topology constraints

The ladder is the point. CrochetBench is not merely asking, “Can the model recognise crochet?” It is asking where recognition stops helping.

Tasks A and B test prerequisites: visual grounding and instruction matching. Task C moves into free-form procedural text. Task D removes the polite ambiguity of natural language and requires machine-checkable structure. That final step is where fluency gets audited.

And, as audits tend to do, it finds things the presentation deck left out.

The models can see parts of the object, but not reliably infer the process

The first results are not catastrophic. On stitch recognition, the best systems reach moderate performance. Claude Sonnet 4 reports the highest F1 score at 60.94%, while DeepSeek-VL 7B reaches 60.60% among the open models. That is not mastery, but it is also not random guessing.

Instruction selection is similarly limited but informative. Models choose among four candidate instructions, with distractors drawn from the same project category. This matters because the task is not solved by broad category recognition. A blanket-like image must be matched against blanket-like instructions. Qwen2-VL 72B achieves the strongest reported instruction-selection accuracy at 68.85%, while several large and closed-source models cluster in the mid-to-high 50s.

So the first stage of the story is not “models are blind.” They can extract some useful visual and textual signals. They can identify local stitch vocabulary and sometimes match an image to a plausible procedural description.

The failure arrives when they must synthesize the procedure.

For natural-language instruction generation, the scores drop sharply. Gemini 2.5 Flash-Lite, the strongest base model on the reported text-generation metrics, reaches only 4.93% BLEU and 30.50% ChrF. Those numbers are not just low; they signal that the generated procedures have limited overlap with reference instructions even before executable correctness is considered.

The paper’s supervised finetuning experiment adds an important wrinkle. Finetuning Qwen2-VL 7B on the instruction-generation task improves surface metrics substantially: BLEU rises from 1.67% to 5.64%, ROUGE-L from 21.10% to 25.10%, and ChrF from 15.99% to 22.39%. The finetuned model even beats the evaluated closed-source systems on these text metrics.

That sounds impressive until one asks what improved. The paper’s qualitative analysis indicates that finetuning makes outputs more crochet-like: better formatting, more conventional stitch vocabulary, more plausible pattern structure. It does not show that the model has acquired robust procedural correctness. The model becomes more fluent in the genre. It does not necessarily become better at making the object.

This is a familiar enterprise failure mode. A system is trained until its outputs look more professional, then deployed as if professionalism and correctness were the same property. They are not. A neatly formatted wrong procedure is still wrong. It just wastes less time being obviously embarrassing.

Fluent instructions can still render the wrong object

The paper’s case study is the cleanest illustration of the mechanism.

The ground-truth pattern is a seven-point star with alternating blue and brown yarn and tassels attached to each point. Several models produce instructions that look structured and domain-aware. They include materials, rounds, colours, and crochet-like operations. Some correctly identify local motifs or colour relationships.

Then the instructions are rendered.

Gemini recognises the seven-point star more explicitly than the others, yet still fails to produce the correct topology. GPT-4o and Claude generate coherent-looking instructions, but misconstruct the global shape: the outputs drift toward circular or differently pointed motifs. Qwen2-VL and DeepSeek-VL degrade further, producing distorted or collapsed structures.

This is not a small formatting issue. It is the central cognitive distinction CrochetBench exposes:

Model capability	What it can look like	Why it is insufficient
Local visual recognition	“This uses blue and brown yarn; there are bobbles and star-like motifs”	Local details do not determine construction order
Procedural style imitation	“Round 1… Round 2… repeat around…”	Pattern-like prose can violate counts and topology
Symbolic execution	Valid DSL that compiles and renders	Even compilation may not preserve intended geometry
Global procedural fidelity	Rendered output resembles the target object	This is the actual capability businesses want when procedures matter

The difference between the second and fourth rows is where many AI automation claims quietly live. The system sounds like it knows what to do. But the world does not execute tone.

Compilation turns plausible text into measurable failure

Task D is where CrochetBench becomes more than another multimodal benchmark.

The authors define two versions of instruction-to-DSL translation. In the step-level task, the model receives a prefix of correct natural-language and DSL pairs, then must translate the next instruction. In the project-level task, the model receives the full natural-language pattern and image, then must generate an entire CrochetPARADE program.

The likely purpose of the step-level test is diagnostic. It asks whether models can update procedural state incrementally. The early, middle, and late prefixes reveal whether more context stabilises the model. The answer is: somewhat, but not enough.

In early steps, most models achieve under 15% valid pattern rate. Performance improves with more context and reaches roughly 55–65% in later steps. That pattern is revealing. Later steps are not necessarily “easier” in a human sense; they are easier for the model because the program state has already been partially established. If the early structure is valid, the model can continue it. If the initial state is wrong, failure propagates.

That is not robust procedural reasoning. It is fragile continuation under favourable context.

The project-level DSL results are harsher. Even the strongest models produce only 5–8% executable programs. Most are below 3%. The dominant failure categories include undefined stitches, unbalanced brackets, multiple-reference errors, non-adjacent label problems, and runtime failures.

These are not exotic crochet mistakes. They are the procedural equivalent of a model inventing an API call, misplacing a loop bracket, referencing a variable that does not exist, or losing track of object identity over a long workflow. In other words: Tuesday, but with yarn.

The scale result is especially useful because it disrupts an easy assumption. Larger models do not reliably solve the executable-structure problem. In some cases, they make a specific class of error worse. The paper reports substantially higher undefined-stitch error rates for larger variants: Qwen2-VL 72B at 72.0% versus Qwen2-VL 7B at 13.6%, and Gemma 3 27B at 42.6% versus Gemma 3 4B at 27.4%.

That does not mean smaller models are generally better. It means the relationship between scale and procedural validity is not monotonic. More capacity can produce richer symbolic invention, which is charming in poetry and less charming in a strict executable grammar.

DINO similarity checks whether valid code still makes the right thing

Compilation is necessary, but the paper correctly treats it as insufficient.

A program can compile and still produce the wrong artefact. To address that, the authors render executable DSL outputs and compare them with the target product image using DINO similarity. This is best understood as a semantic-fidelity check, not a separate grand thesis. It asks: among the rare outputs that execute, do they visually resemble the intended crochet item?

The answer is mostly no. Similarity scores remain uniformly low, around 0.10–0.17 across models, below the paper’s approximate threshold for good visual matches.

This matters because it prevents a second easy misconception. One might argue that if the model can produce valid DSL, the problem is mostly solved. CrochetBench shows otherwise. Validity is only the first gate. A pattern can obey the grammar while constructing the wrong geometry.

For business use, that distinction is crucial. A workflow engine can accept a generated procedure. A simulator can run it. A robot can execute it. None of that guarantees the procedure achieves the intended outcome. Executability is not the same as task success. It is merely the point at which failure becomes more expensive.

The paper’s evidence is a staircase, not a leaderboard

The temptation is to turn CrochetBench into a model ranking. That would be the least interesting reading.

The stronger reading is that each experiment isolates a different failure boundary.

Evidence	Likely purpose	What it supports	What it does not prove
Stitch recognition and instruction selection	Main evidence for perceptual grounding	Models can recover some local visual and image-text signals	They understand the construction process
Instruction generation metrics	Main evidence for procedural text synthesis	Free-form procedure generation remains weak	Low BLEU alone proves non-executability
Case-study rendering	Qualitative diagnostic illustration	Fluent instructions can produce wrong geometry	All failure modes are captured by one example
Supervised finetuning	Training variant / sensitivity test	Surface metrics and formatting can improve	Finetuning solves structural reasoning
Step-level DSL translation	Main evidence with context-depth sensitivity	Early state initialisation is fragile; context helps continuation	Later-step improvement equals robust reasoning
Project-level DSL errors	Main evidence for long-range symbolic failure	Full-program synthesis is brittle; scale does not reliably help	Crochet errors map one-to-one to industrial errors
DINO similarity on rendered outputs	Robustness / semantic-fidelity check	Compilable outputs still often miss the intended structure	DINO is a perfect measure of crochet equivalence

This staircase is what gives the paper its business relevance. It does not merely say “models are bad at crochet.” It shows where capability decays as evaluation becomes more operational.

A descriptive benchmark asks: can the model say something reasonable?

A procedural benchmark asks: can the model preserve the steps?

An executable benchmark asks: can the model produce something a system can run?

A semantic execution benchmark asks: does the run produce the intended artefact?

Most enterprise AI evaluation still stops around the first or second question, then acts surprised when the fourth one sends a bill.

The business implication is evaluation design, not crochet automation

The practical lesson is not that every company should add crochet tests to its AI procurement process. Please do not make the vendor demo a craft circle unless morale is already beyond saving.

The lesson is that any business deploying multimodal or agentic AI into procedural domains needs executable evaluation wherever possible.

This applies to several areas:

Robotics and physical automation A model that can describe a workspace may still fail to generate safe, executable manipulation plans. The relevant test is not whether its plan sounds sensible, but whether it works in simulation, respects constraints, and transfers under bounded uncertainty.
CAD, design, and manufacturing workflows Generating a design description is not the same as generating a manufacturable object. Geometry, tolerances, materials, and process constraints need validators. A pretty specification is not a production plan.
Software agents and workflow automation Agentic systems often generate multi-step plans across APIs, files, databases, and user permissions. CrochetBench’s undefined-stitch errors have a direct software analogue: invented functions, missing references, malformed calls, invalid state transitions.
Procedural content generation Games, simulations, training environments, and digital twins require generated structures that are internally consistent. Textual plausibility is insufficient when downstream systems must render, simulate, or interact with the output.
Compliance and regulated operations In regulated workflows, a model must do more than produce a plausible checklist. Steps must be complete, ordered, auditable, and valid under rules. That is closer to DSL execution than to captioning.

The Cognaptus inference is straightforward: businesses should evaluate procedural AI with validators, simulators, compilers, renderers, or rule engines before trusting human review of fluent outputs. Human review still matters, but humans are notoriously forgiving when text has the right shape. Machines are less polite, which is why they are useful.

The paper’s results point toward a mechanism: current multimodal models can often identify fragments, but struggle to maintain executable state over long horizons.

In crochet, state includes the number of stitches in a round, where the next stitch attaches, whether a repeat is open or closed, whether a label refers to an adjacent structure, and whether the geometry is expanding, contracting, or turning. In business workflows, state includes inventory counts, permissions, object IDs, dependencies, exception conditions, and process history.

These are not decorative details. They are the procedure.

A model that fails to track state can still be very persuasive because language allows approximation. Execution does not. A missing bracket, an invented stitch, or an impossible attachment is not a matter of taste. It is a failed operation.

That is why “vision to action” remains much harder than “vision to description.” The final image underdetermines the procedure. Many procedures can produce similar artefacts; many plausible procedures produce invalid ones. Inferring the right construction path requires constraints that are not visible as surface pixels. The model must connect perception, symbolic structure, temporal order, and executable validation.

CrochetBench’s uncomfortable contribution is showing that the chain breaks even in a small, formalised craft domain. The industrial versions will not be easier merely because the PowerPoint says “agentic.”

Where the evidence should not be overstretched

CrochetBench is a strong diagnostic benchmark, but it has boundaries.

First, crochet is one domain. It is symbolically rich and structurally useful, but it is not robotics, mechanical assembly, chemistry, construction, logistics, or surgery. The paper supports cautious transfer of the evaluation lesson, not direct claims about failure rates in every procedural domain.

Second, CrochetPARADE is an abstraction. It enables automated validation, rendering, and error analysis, but it is not identical to physical crocheting. Some valid human-written patterns may not map cleanly into the DSL. Multiple procedures can produce visually similar outputs. Rendering and DINO similarity are useful proxies, not perfect arbiters of craft truth.

Third, the finetuning evidence is limited. The paper tests supervised finetuning on a single architecture for natural-language instruction generation. It does not exhaust alternatives such as execution-guided learning, program repair, constrained decoding, verifier-in-the-loop training, or neuro-symbolic planning. The negative result is not “training cannot help.” It is “surface-metric improvement is not enough.”

Fourth, the dataset itself is built from existing crochet patterns and processed through an automated extraction pipeline. That makes scale possible, but also means benchmark quality depends on parsing, normalisation, and the representational choices used to structure the data. The authors address this through schema design and executable validation, but the boundary remains.

These limitations do not weaken the central point. They keep it honest. CrochetBench is not a universal theory of procedural intelligence. It is a well-aimed instrument for exposing a capability gap that ordinary captioning benchmarks are too polite to detect.

What better systems will need

The paper does not propose a full solution, but its failures imply what future systems will need.

They will need constrained generation when the output language has a grammar. They will need external validators that catch invalid intermediate states before errors compound. They will need planning representations that separate “what the object looks like” from “what sequence of operations can produce it.” They will need execution feedback, not just next-token imitation. They may need domain-specific compilers, simulators, and repair loops that turn fluent drafts into checked procedures.

In other words, the next step is not simply bigger multimodal models. It is tighter coupling between models and executable structure.

That is less glamorous than a demo where an AI looks at a picture and confidently writes a plan. It is also much closer to how reliable automation actually gets built.

The benchmark is about trust at the point of execution

CrochetBench is memorable because the domain is disarming. Yarn, stitches, stars, tassels. Nothing screams enterprise risk quite like a malformed granny square.

But the softness of the domain is deceptive. The benchmark tests a hard capability: whether a model can convert visual evidence into a correct, stateful, executable procedure. Current systems can recognise some parts, choose plausible instructions, and write fluent procedural text. When asked to compile the logic and preserve the final structure, they mostly fail.

That is the procedural blind spot.

For businesses, the message is simple enough to be inconvenient. Do not evaluate action-oriented AI by how well it talks about action. Evaluate it where the action becomes executable: in validators, simulators, compilers, renderers, test harnesses, and constrained environments where errors cannot hide behind fluent prose.

The future of useful multimodal AI will not be decided by whether a model can describe the sweater. It will be decided by whether the pattern works.

Cognaptus: Automate the Present, Incubate the Future.

Peiyu Li, Xiaobao Huang, Ting Hua, and Nitesh V. Chawla, “CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?”, arXiv:2511.09483, 2025. https://arxiv.org/html/2511.09483 ↩︎

Crochet is a craft, but the failure is computational#

CrochetBench is designed as a ladder from seeing to doing#

The models can see parts of the object, but not reliably infer the process#

Fluent instructions can still render the wrong object#

Compilation turns plausible text into measurable failure#

DINO similarity checks whether valid code still makes the right thing#

The paper’s evidence is a staircase, not a leaderboard#

The business implication is evaluation design, not crochet automation#

The real blind spot is state#

Where the evidence should not be overstretched#

What better systems will need#

The benchmark is about trust at the point of execution#