From Yarn to Code: What CrochetBench Reveals About AI’s Procedural Blind Spot

Opening — Why this matters now

The AI industry is celebrating multimodal models as if they can already do things. Look at a picture, generate a plan, and—supposedly—convert visual understanding into executable action. But when you swap the glossy demos for a domain that demands fine-grained, symbolic precision—like crochet—the illusion cracks.

CrochetBench, a new benchmark evaluating whether vision‑language models can move from describing to doing, is far more than a quirky dataset. It is a stress test for the kind of procedural reasoning that underpins robotics, manufacturing automation, and any AI system meant to execute real-world workflows.

And the results? Let’s just say the models tug the yarn… but don’t quite weave the fabric.

Background — Context and prior art

Most multimodal benchmarks reward superficial alignment: match an image to text, generate a caption, retrieve a recipe. Nice, but fundamentally passive. Domains like cooking or assembly eventually push toward procedural reasoning, but validating correctness requires real-world execution—expensive, slow, and noisy.

Crochet, ironically, fixes this. Each stitch is a symbolic unit; each pattern is a program. Better yet, the CrochetPARADE DSL allows automatic compilation and rendering. Models can be judged not just by linguistic resemblance but by structural validity. That makes crochet a surprisingly powerful proxy for evaluating:

Symbolic grounding
3D-aware procedural synthesis
Long-range state tracking
Executable correctness under strict grammars

It is program synthesis wearing a cozy sweater.

Analysis — What the paper does

CrochetBench formalizes the evaluation ladder across four task types:

Recognition — Identify stitches from images (multi-label classification).
Comprehension — Select correct instructions from distractors that look deceptively similar.
Generation — Produce human-like crochet instructions from images.
Formalization — Translate instructions into an actual executable DSL.

The real pivot is the last step. Once models must produce compilable code with numerical, symbolic, and topological consistency, performance collapses. Commercial VLMs perform better on surface tasks, but open-source Qwen2-VL unexpectedly takes the lead on project-level DSL validity—hinting that procedural competence doesn’t necessarily scale with model size.

Findings — Results with visualization

Here’s a distilled snapshot of model performance across the ladder:

Model Performance by Task Type

Task	What it Tests	Best Closed-Source	Best Open-Source	Key Takeaway
Stitch Recognition	Fine-grained visual grounding	Claude Sonnet 4	DeepSeek-VL	Vision matters, but not enough.
Instruction Selection	Visual–text structural alignment	GPT‑4o	Qwen2‑VL	Still mostly perceptual.
Instruction Generation	Procedural NL text	Gemini 2.5	DeepSeek-VL	Fluency ≠ fidelity.
DSL Translation (Step)	Local symbolic correctness	Claude	Qwen2‑VL	Context helps, but errors compound.
DSL Translation (Project)	Full pattern executability	—	Qwen2‑VL	Long-horizon state tracking is the true bottleneck.

Drop-off Across the Pipeline

Recognition → Comprehension → Generation → DSL Execution High → Medium → Low → Near Failure

The decline is not linear; it’s a cliff.

Why models fail

Error Category	Description	Frequency
Syntax/Brackets	Missing or unbalanced symbols	Extremely high
Undefined Stitches	Fabricated or invalid symbols	High (open-source)
Label/Reference Errors	Referring to non-existent stitches	High (closed-source)
Structural Errors	Incorrect topology or impossible sequences	Universal

Models hallucinate, miscount, misalign, or forget stitch state. When outputs are forced to compile, “creative freedom” becomes “structural incoherence.”

Implications — Why this matters for business and industry

CrochetBench’s lesson reaches far beyond fiber arts.

1. Procedural automation will hit symbolic bottlenecks

Enterprise automation, robotics, manufacturing—even RPA—requires models that can emit correct, stateful, executable steps. Today’s VLMs simply cannot maintain symbolic consistency across long horizons.

2. Closed-source ≠ unbeatable

Qwen2‑VL outperforming commercial models in DSL correctness indicates:

scaling laws differ for procedural tasks;
smaller, domain-attuned models may outperform giants;
open ecosystems may excel where correctness > charisma.

3. “Image → Action” is harder than expected

Tech demos showing a robot folding laundry are carefully curated exceptions, not general capabilities. If a model can’t output correct crochet, expecting it to reliably generate safe industrial procedures is wishful thinking.

4. Evaluation must shift from similarity to executability

BLEU and ROUGE flatter models by measuring overlap, not correctness. DSL-based evaluation is a necessary evolution for any automation where errors propagate.

Conclusion — Wrap-up

CrochetBench is a polite but firm reminder: generating descriptions is easy. Generating correct procedures—especially executable ones—is brutally hard. Until multimodal models can stitch symbolic logic, memory, and spatial coherence into a unified fabric, true “AI agents that act” will remain mostly marketing yarn.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

Findings — Results with visualization#

Model Performance by Task Type#

Drop-off Across the Pipeline#

Recognition → Comprehension → Generation → DSL Execution High → Medium → Low → Near Failure

Why models fail#

Implications — Why this matters for business and industry#

1. Procedural automation will hit symbolic bottlenecks#

2. Closed-source ≠ unbeatable#

3. “Image → Action” is harder than expected#

4. Evaluation must shift from similarity to executability#

Conclusion — Wrap-up#