Opening — Why this matters now

The AI industry is celebrating multimodal models as if they can already do things. Look at a picture, generate a plan, and—supposedly—convert visual understanding into executable action. But when you swap the glossy demos for a domain that demands fine-grained, symbolic precision—like crochet—the illusion cracks.

CrochetBench, a new benchmark evaluating whether vision‑language models can move from describing to doing, is far more than a quirky dataset. It is a stress test for the kind of procedural reasoning that underpins robotics, manufacturing automation, and any AI system meant to execute real-world workflows.

And the results? Let’s just say the models tug the yarn… but don’t quite weave the fabric.

Background — Context and prior art

Most multimodal benchmarks reward superficial alignment: match an image to text, generate a caption, retrieve a recipe. Nice, but fundamentally passive. Domains like cooking or assembly eventually push toward procedural reasoning, but validating correctness requires real-world execution—expensive, slow, and noisy.

Crochet, ironically, fixes this. Each stitch is a symbolic unit; each pattern is a program. Better yet, the CrochetPARADE DSL allows automatic compilation and rendering. Models can be judged not just by linguistic resemblance but by structural validity. That makes crochet a surprisingly powerful proxy for evaluating:

  • Symbolic grounding
  • 3D-aware procedural synthesis
  • Long-range state tracking
  • Executable correctness under strict grammars

It is program synthesis wearing a cozy sweater.

Analysis — What the paper does

CrochetBench formalizes the evaluation ladder across four task types:

  1. Recognition — Identify stitches from images (multi-label classification).
  2. Comprehension — Select correct instructions from distractors that look deceptively similar.
  3. Generation — Produce human-like crochet instructions from images.
  4. Formalization — Translate instructions into an actual executable DSL.

The real pivot is the last step. Once models must produce compilable code with numerical, symbolic, and topological consistency, performance collapses. Commercial VLMs perform better on surface tasks, but open-source Qwen2-VL unexpectedly takes the lead on project-level DSL validity—hinting that procedural competence doesn’t necessarily scale with model size.

Findings — Results with visualization

Here’s a distilled snapshot of model performance across the ladder:

Model Performance by Task Type

Task What it Tests Best Closed-Source Best Open-Source Key Takeaway
Stitch Recognition Fine-grained visual grounding Claude Sonnet 4 DeepSeek-VL Vision matters, but not enough.
Instruction Selection Visual–text structural alignment GPT‑4o Qwen2‑VL Still mostly perceptual.
Instruction Generation Procedural NL text Gemini 2.5 DeepSeek-VL Fluency ≠ fidelity.
DSL Translation (Step) Local symbolic correctness Claude Qwen2‑VL Context helps, but errors compound.
DSL Translation (Project) Full pattern executability Qwen2‑VL Long-horizon state tracking is the true bottleneck.

Drop-off Across the Pipeline


Recognition → Comprehension → Generation → DSL Execution High → Medium → Low → Near Failure

The decline is not linear; it’s a cliff.

Why models fail

Error Category Description Frequency
Syntax/Brackets Missing or unbalanced symbols Extremely high
Undefined Stitches Fabricated or invalid symbols High (open-source)
Label/Reference Errors Referring to non-existent stitches High (closed-source)
Structural Errors Incorrect topology or impossible sequences Universal

Models hallucinate, miscount, misalign, or forget stitch state. When outputs are forced to compile, “creative freedom” becomes “structural incoherence.”

Implications — Why this matters for business and industry

CrochetBench’s lesson reaches far beyond fiber arts.

1. Procedural automation will hit symbolic bottlenecks

Enterprise automation, robotics, manufacturing—even RPA—requires models that can emit correct, stateful, executable steps. Today’s VLMs simply cannot maintain symbolic consistency across long horizons.

2. Closed-source ≠ unbeatable

Qwen2‑VL outperforming commercial models in DSL correctness indicates:

  • scaling laws differ for procedural tasks;
  • smaller, domain-attuned models may outperform giants;
  • open ecosystems may excel where correctness > charisma.

3. “Image → Action” is harder than expected

Tech demos showing a robot folding laundry are carefully curated exceptions, not general capabilities. If a model can’t output correct crochet, expecting it to reliably generate safe industrial procedures is wishful thinking.

4. Evaluation must shift from similarity to executability

BLEU and ROUGE flatter models by measuring overlap, not correctness. DSL-based evaluation is a necessary evolution for any automation where errors propagate.

Conclusion — Wrap-up

CrochetBench is a polite but firm reminder: generating descriptions is easy. Generating correct procedures—especially executable ones—is brutally hard. Until multimodal models can stitch symbolic logic, memory, and spatial coherence into a unified fabric, true “AI agents that act” will remain mostly marketing yarn.

Cognaptus: Automate the Present, Incubate the Future.