Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

Opening — Why this matters now

Multimodal reasoning has quietly hit an efficiency wall. We taught models to think step by step with text, then asked them to imagine with images, and finally to reason with videos. Each step added expressive power—and cost. Images freeze time. Videos drown signal in redundancy. Somewhere between the two, reasoning gets expensive fast.

The paper “Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling” proposes a surprisingly mundane fix: comics. Not as a novelty, but as a deliberately engineered intermediate reasoning medium. The argument is simple and sharp—comics preserve temporal logic without paying the video tax.

Background — From Chain-of-Thought to visual overload

Chain-of-Thought (CoT) unlocked reasoning gains by externalizing intermediate steps. Multimodal variants extended this idea:

Thinking with Images: generate or reference images as visual scratchpads.
Thinking with Video: encode temporal reasoning via short generated clips.

Both approaches work—but imperfectly. Static images collapse time. Videos introduce linear cost growth, redundancy, and brittle generation. The paper reframes the problem: reasoning does not need continuous time—it needs structured time.

Analysis — What the paper actually does

The authors introduce Thinking with Comics (TwC), treating multi-panel comics as high-density reasoning carriers. Each panel corresponds to a reasoning state; the sequence encodes causality and temporal flow.

Two implementation paths are explored:

Path	Description	Role of Comics
Path I	End-to-End Visualized Reasoning	Comic is the reasoning process
Path II	Comic-as-Context Reasoning	Comic conditions a downstream VLM

In Path I, an image generator (Gemini-3 Pro Image) produces a full comic whose final panel contains the answer. In Path II, the comic becomes an explicit intermediate variable, paired with the original question and passed into a multimodal LLM.

Crucially, comics are generated globally, not incrementally—avoiding error accumulation and preserving cross-panel coherence.

Findings — Performance, structure, and cost

1. Reasoning accuracy improves

Across benchmarks (MATH500, GSM8K, MathVista, DocVQA, CulturalBench), TwC consistently outperforms Thinking-with-Images and matches or exceeds video-based reasoning—especially on multi-step and long-context tasks.

2. Narrative structure matters

Different comic styles induce different reasoning behaviors:

Narrative Style	Best Use Case	Accuracy Gain
Documentary	Factual grounding	Baseline
Slice-of-life	Commonsense reasoning	+19 pts
Detective	Logical / causal inference	+28 pts

Narrative framing acts as a visual system prompt, not decoration.

3. Panel scaling behaves like CoT

Accuracy saturates around 4–6 panels, forming a Pareto frontier between information density and compute cost. More panels yield diminishing returns—mirroring textual CoT depth.

4. Temporal order is not optional

Shuffling or deleting panels degrades performance. The model is not treating panels as independent images; it relies on their ordered causal structure.

5. Textual anchoring is decisive

Speech bubbles and narration significantly reduce visual ambiguity. Removing embedded text causes double-digit accuracy drops on cultural and visual math tasks.

6. Cost efficiency dominates video

Using standard API pricing assumptions:

Modality	Cost Model	Relative Cost
Video	$C(t)=\alpha t$	Grows linearly
Comics	$C=\beta$	Constant

For a 10-second reasoning task, comics cost ~13% of video, an 86% reduction, while preserving most temporal information.

Implications — Why this matters beyond benchmarks

This paper is not about comics per se. It is about representation efficiency.

Comics succeed because they:

Compress temporal trajectories into discrete, meaningful states
Align with existing training distributions (low domain shift)
Enable global planning rather than stepwise accumulation
Serve as a reusable, model-agnostic reasoning scaffold

For agent frameworks, long-horizon planning, and cost-sensitive inference, this suggests a new design pattern:

Don’t simulate reality continuously. Summarize it structurally.

Conclusion — A quiet but structural shift

Thinking with Comics reframes multimodal reasoning as an information economics problem. When time matters but continuity does not, panels beat frames.

If Chain-of-Thought taught models how to think, comics suggest where thinking should live: in structured, interpretable, high-density representations that scale gracefully.

Expect to see this idea resurface—not in comics UI, but inside agent planners, simulators, and reasoning middleware.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From Chain-of-Thought to visual overload#

Analysis — What the paper actually does#

Findings — Performance, structure, and cost#

1. Reasoning accuracy improves#

2. Narrative structure matters#

3. Panel scaling behaves like CoT#

4. Temporal order is not optional#

5. Textual anchoring is decisive#

6. Cost efficiency dominates video#

Implications — Why this matters beyond benchmarks#

Conclusion — A quiet but structural shift#