Opening — Why this matters now
The AI industry is celebrating multimodal models as if they can already do things. Look at a picture, generate a plan, and—supposedly—convert visual understanding into executable action. But when you swap the glossy demos for a domain that demands fine-grained, symbolic precision—like crochet—the illusion cracks.
CrochetBench, a new benchmark evaluating whether vision‑language models can move from describing to doing, is far more than a quirky dataset. It is a stress test for the kind of procedural reasoning that underpins robotics, manufacturing automation, and any AI system meant to execute real-world workflows.
And the results? Let’s just say the models tug the yarn… but don’t quite weave the fabric.
Background — Context and prior art
Most multimodal benchmarks reward superficial alignment: match an image to text, generate a caption, retrieve a recipe. Nice, but fundamentally passive. Domains like cooking or assembly eventually push toward procedural reasoning, but validating correctness requires real-world execution—expensive, slow, and noisy.
Crochet, ironically, fixes this. Each stitch is a symbolic unit; each pattern is a program. Better yet, the CrochetPARADE DSL allows automatic compilation and rendering. Models can be judged not just by linguistic resemblance but by structural validity. That makes crochet a surprisingly powerful proxy for evaluating:
- Symbolic grounding
- 3D-aware procedural synthesis
- Long-range state tracking
- Executable correctness under strict grammars
It is program synthesis wearing a cozy sweater.
Analysis — What the paper does
CrochetBench formalizes the evaluation ladder across four task types:
- Recognition — Identify stitches from images (multi-label classification).
- Comprehension — Select correct instructions from distractors that look deceptively similar.
- Generation — Produce human-like crochet instructions from images.
- Formalization — Translate instructions into an actual executable DSL.
The real pivot is the last step. Once models must produce compilable code with numerical, symbolic, and topological consistency, performance collapses. Commercial VLMs perform better on surface tasks, but open-source Qwen2-VL unexpectedly takes the lead on project-level DSL validity—hinting that procedural competence doesn’t necessarily scale with model size.
Findings — Results with visualization
Here’s a distilled snapshot of model performance across the ladder:
Model Performance by Task Type
| Task | What it Tests | Best Closed-Source | Best Open-Source | Key Takeaway |
|---|---|---|---|---|
| Stitch Recognition | Fine-grained visual grounding | Claude Sonnet 4 | DeepSeek-VL | Vision matters, but not enough. |
| Instruction Selection | Visual–text structural alignment | GPT‑4o | Qwen2‑VL | Still mostly perceptual. |
| Instruction Generation | Procedural NL text | Gemini 2.5 | DeepSeek-VL | Fluency ≠ fidelity. |
| DSL Translation (Step) | Local symbolic correctness | Claude | Qwen2‑VL | Context helps, but errors compound. |
| DSL Translation (Project) | Full pattern executability | — | Qwen2‑VL | Long-horizon state tracking is the true bottleneck. |
Drop-off Across the Pipeline
Recognition → Comprehension → Generation → DSL Execution High → Medium → Low → Near Failure
The decline is not linear; it’s a cliff.
Why models fail
| Error Category | Description | Frequency |
|---|---|---|
| Syntax/Brackets | Missing or unbalanced symbols | Extremely high |
| Undefined Stitches | Fabricated or invalid symbols | High (open-source) |
| Label/Reference Errors | Referring to non-existent stitches | High (closed-source) |
| Structural Errors | Incorrect topology or impossible sequences | Universal |
Models hallucinate, miscount, misalign, or forget stitch state. When outputs are forced to compile, “creative freedom” becomes “structural incoherence.”
Implications — Why this matters for business and industry
CrochetBench’s lesson reaches far beyond fiber arts.
1. Procedural automation will hit symbolic bottlenecks
Enterprise automation, robotics, manufacturing—even RPA—requires models that can emit correct, stateful, executable steps. Today’s VLMs simply cannot maintain symbolic consistency across long horizons.
2. Closed-source ≠ unbeatable
Qwen2‑VL outperforming commercial models in DSL correctness indicates:
- scaling laws differ for procedural tasks;
- smaller, domain-attuned models may outperform giants;
- open ecosystems may excel where correctness > charisma.
3. “Image → Action” is harder than expected
Tech demos showing a robot folding laundry are carefully curated exceptions, not general capabilities. If a model can’t output correct crochet, expecting it to reliably generate safe industrial procedures is wishful thinking.
4. Evaluation must shift from similarity to executability
BLEU and ROUGE flatter models by measuring overlap, not correctness. DSL-based evaluation is a necessary evolution for any automation where errors propagate.
Conclusion — Wrap-up
CrochetBench is a polite but firm reminder: generating descriptions is easy. Generating correct procedures—especially executable ones—is brutally hard. Until multimodal models can stitch symbolic logic, memory, and spatial coherence into a unified fabric, true “AI agents that act” will remain mostly marketing yarn.
Cognaptus: Automate the Present, Incubate the Future.