Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

A dashboard screenshot is often too little. A video walkthrough is often too much. Somewhere between the two sits a strangely old-fashioned interface: panels, captions, arrows, speech bubbles, and a sequence that tells the machine what happened before what.

Yes, comics.

That sounds unserious only if we think comics are a decoration layer: something added after the reasoning is complete to make the output friendlier. The paper Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling makes a more interesting claim: comics can act as the reasoning medium itself, not merely the illustration of reasoning after the fact.¹

The idea is simple enough to be dangerous. Static images are dense but temporally flat. Videos preserve time but waste a lot of computation on near-duplicate frames. Comics compress a process into selected states, ordered panels, and embedded text. They are not as visually continuous as video, but they may carry the parts of continuity that matter for reasoning. In business terms, they are a candidate format for “show me the process” without paying for “render every second of the process.” A small mercy for anyone who has watched a three-minute AI-generated video explain something that needed six frames and one arrow.

The paper calls this paradigm Thinking with Comics (TwC). Its contribution is not merely that models can generate comic strips. The contribution is that comics are tested as an intermediate representation for multimodal reasoning: a structured visual form between single-image reasoning and video-based reasoning.

The useful comparison is not text versus comics, but image versus video versus panels

The paper’s strongest editorial frame is comparative. It is not asking whether comics replace chain-of-thought, text reasoning, or frontier multimodal models. That would be an overclaim. The more precise question is this:

When a task needs visual structure and temporal progression, what representation gives the model enough process information without drowning it in redundant media?

That produces three candidate formats.

Representation	What it preserves well	What it loses or wastes	Practical interpretation
Static image	Spatial layout, object relations, compact visual signal	Temporal order, causal progression, explicit step sequence	Good for “what is here?” Weak for “how did this unfold?”
Video	Temporal continuity, motion, changing state	Redundant frames, generation cost, longer processing overhead	Good when motion itself matters. Expensive when only key states matter.
Comics	Key states, ordered sequence, embedded text, narrative cues	Fine-grained motion, continuous dynamics	Useful when reasoning needs selected temporal structure, not full video fidelity.

The business relevance starts here. Many enterprise tasks are not true video tasks. A document workflow, insurance claim review, compliance escalation, factory SOP, customer onboarding path, or procurement approval chain usually does not require continuous motion. It requires state changes: first this information appears, then this rule applies, then this exception changes the decision.

That is naturally panel-shaped.

TwC has two paths: the comic can answer, or it can brief another model

The paper defines two implementation paths.

The first path is end-to-end visualized reasoning. A visual generation model receives the original question and generates a multi-panel comic. The reasoning process is embedded directly in the panels, and the answer is extracted from the final panel. In the paper’s setup, Gemini-3 Pro Image generates the comic, and GPT-5.2 is used as an external answer reader for answer extraction. The authors report a human verification check on 20% of the Path I samples, where human judgment and the GPT-5.2 extractor agreed fully. That does not prove universal extraction reliability, but it reduces one immediate concern: that the reported results are merely an artifact of unreadable final panels.

The second path is comics-as-context. The comic is not the final answer. It is an intermediate representation passed, together with the original question, into a vision-language model. The downstream model then produces the final textual answer. This is operationally more interesting. It treats comics as a reusable reasoning scaffold: first compress the task into structured panels, then let another model reason over that structure.

For enterprise systems, Path II is the more natural starting point. A business does not necessarily want an AI system to “answer inside a comic.” It wants the system to create a compact, auditable, visual briefing that another model—or a human reviewer—can inspect.

That makes TwC less like cartoon generation and more like visual process distillation.

The main results are strongest where visual/context structure matters

The paper evaluates TwC across reasoning benchmarks and context-understanding benchmarks: MATH-500, GSM8K, MathVista, DocVQA, and CulturalBench. The main table compares direct frontier multimodal models, reasoning LLMs, image-based approaches, video-based thinking, and TwC variants.

The headline results are striking but need careful reading.

Benchmark	TwC Img & Txt result	What it suggests	What it does not prove
MathVista	85.8%	Strong gain on visually grounded mathematical reasoning	Not proof that comics help all math tasks
DocVQA	99.4%	Very strong result on document visual question answering	Not proof of robust enterprise document QA across messy internal documents
MATH-500	92.3%	Competitive but below direct frontier models in the table	Comics are not automatically better for pure text math
GSM8K	95.4%	Solid, but again not clearly superior to direct frontier models	No reason to turn every arithmetic word problem into manga, thank heavens
CulturalBench	88.3% easy / 82.2% hard	Useful evidence for cultural/contextual grounding	Not state-of-the-art against every direct frontier baseline in the table

This distinction matters. The paper’s most convincing business message is not “comics beat everything.” It is that comics appear especially helpful when the task benefits from visual grounding plus structured progression.

MathVista and DocVQA are therefore more important than pure text math results. MathVista asks for visually grounded mathematical reasoning. DocVQA asks the model to locate and reason over document information. These are closer to business settings where a model must inspect a form, chart, slide, scanned page, or process artifact and explain how it reached an answer.

Pure text math is less compelling as a business case for TwC. If a frontier model already solves the task directly, forcing it to generate panels may add latency and complexity. A panel is useful when it carries missing structure, not when it simply dresses up an equation.

The ablations tell us comics are doing work, not wearing costumes

The paper’s analysis experiments are more useful than the main leaderboard because they ask a sharper question: which parts of the comic representation matter?

Here is the cleanest way to read the evidence.

Experiment	Likely purpose	What it supports	What it does not prove
Narrative style comparison	Ablation / prompt sensitivity test	Style changes reasoning performance; detective style beats documentary and slice-of-life on tested math tasks	That one style is universally best
Panel scaling	Sensitivity test	Accuracy plateaus around 4–6 panels on the tested MATH-500 setup	A universal panel count for all domains
Panel distribution by task difficulty	Exploratory resource-allocation analysis	Harder tasks tend to use more panels; MathVista shifts toward higher panel counts	That panel count alone measures task difficulty
Shuffle/deletion perturbation	Causal structure test	Ordered panels matter; disrupting sequence hurts performance	Full explanation of how models internally use temporal order
Textual anchoring ablation	Ablation	Embedded text, bubbles, and narration improve accuracy across tested tasks	That pure visual reasoning is unnecessary
Cross-model generalization	Robustness test	Comics-as-context transfers reasonably across several VLMs	Model-agnostic reliability in production
Cost comparison with video	Efficiency analysis	Comics can reduce media generation cost versus video for tasks where panels are sufficient	End-to-end total cost across every deployment stack

The narrative-style experiment is particularly revealing. The paper reports that on MathVista and GSM8K, a documentary-style baseline scores 60.0 and 68.0, slice-of-life improves to 80.0 and 86.3, and detective style reaches 85.0 and 100.0. The authors interpret this as evidence that role-playing narrative style can act as a “Visual System Prompt.”

That phrase is a little dramatic, but the point is useful. The comic format is not neutral packaging. The narrative frame shapes which details the model emphasizes and how it organizes the reasoning path. A detective frame encourages clue collection and deduction. A documentary frame may describe rather than infer. Slice-of-life may contextualize but not always sharpen the logical chain.

For business workflows, this implies something uncomfortable: visual reasoning systems may need genre design, not just prompt design. A compliance review comic should not look like customer onboarding. A maintenance diagnosis comic should not use the same visual grammar as a training tutorial. Style is not decoration; it is control surface.

Four to six panels is not a law; it is a cost-performance clue

The panel-scaling experiment varies the number of generated panels. The paper reports that reasoning accuracy reaches a visible plateau around 4–6 panels, while token cost stays in a relatively narrow range around 1100–1300 in the described MATH-500 setup.

This should not be read as “every enterprise task needs six panels.” That would be spreadsheet thinking, and spreadsheet thinking already has enough crimes to answer for.

The better interpretation is that comics offer a budgeted reasoning canvas. One panel collapses back toward ordinary image-based thinking. Too many panels introduce diminishing returns. Somewhere in the middle, the model has enough room to represent problem setup, key transition, intermediate inference, and answer.

The paper’s panel-distribution analysis strengthens this reading. GSM8K has many samples solved with one panel, but also many using four. MathVista peaks around four panels and extends more strongly toward six panels, with 30.41% of samples requiring six panels. The model appears to allocate more visual steps to harder visually grounded tasks.

For businesses, this suggests a practical design principle: do not fix the number of panels because a template says so. Let the task type and uncertainty level determine panel budget.

A lightweight customer-service explanation may need three panels: issue, cause, resolution. A compliance exception may need five: document, rule, conflict, escalation, decision. A maintenance diagnosis may need six or more if the sequence of symptoms matters.

The panel is a unit of reasoning budget.

Temporal order matters more when panels are actual reasoning states

One of the paper’s most important tests perturbs the panel sequence. The authors shuffle or delete intermediate panels and observe performance degradation. Reported accuracy declines from 75.0% to 71.5% under these perturbations, with deletion harming reasoning more than shuffling.

The absolute drop is not enormous, but its direction matters. If the model were treating panels as a pile of independent visual hints, order would matter less. A decline after shuffling and deletion suggests that the model is using cross-panel structure at least to some extent.

Deletion being worse than shuffling is also intuitive. A shuffled process can sometimes be reconstructed, especially if captions or visual cues imply order. A missing intermediate state may remove the bridge entirely. In business terms, this resembles audit trails. A messy sequence can still be interpreted; a missing approval step is harder to recover.

This is where comics differ from a collage. A collage says, “Here are relevant things.” A comic says, “Here is how one state leads to the next.” That is a much more useful representation for tasks involving diagnosis, explanation, escalation, or root-cause analysis.

Textual anchoring is the boring part that makes the visual part useful

The textual anchoring ablation is probably the most business-relevant result in the paper.

The authors compare pure visual panels against comics with embedded text such as speech bubbles, narration, and symbols. Textual anchoring improves accuracy by 18.1 points on CulturalBench-Easy, 8.3 points on CulturalBench-Hard, and 13.2 points on MathVista.

This result should not surprise anyone who has ever looked at an unlabeled process diagram and quietly wished the designer a long career in a different department.

Images are powerful but ambiguous. Text reduces the search space. A panel can show a document, a highlighted number, and a character pointing to a field; the caption can say which field matters and why. The visual layer locates attention. The textual layer disambiguates meaning.

For enterprise AI, that combination is valuable because many operational failures are not failures of perception alone. They are failures of mapping perception to policy. The model sees the invoice date, the vendor name, the clause, or the chart. The harder question is what that object means in the current workflow.

Textual anchoring turns a panel from “a picture of evidence” into “evidence with interpretation attached.”

The appendix quietly answers a practical implementation question: generate globally, not step by step

The appendix includes two useful checks that should not be treated as decorative extras.

First, the paper compares comic prompts against non-comic realistic storyboard prompts. On 20 samples each from MATH-500 and MathVista, comic prompts produce better panel consistency: 95.0% versus 70.0% on MATH-500 and 90.0% versus 65.0% on MathVista. Reasoning accuracy is also 15 points higher for comics in both benchmark subsets.

The likely purpose of this appendix test is robustness: it asks whether the effect comes from “multiple images” in general or from the comic format specifically. The result supports the latter interpretation. Comics are a familiar visual manifold for generative models: panel boundaries, captions, characters, and progression are all part of the format. A “four-step realistic storyboard” is more ad hoc and apparently less stable.

Second, the appendix compares global comic generation with incremental image chaining. Global generation creates the entire multi-panel comic in one pass. Incremental generation creates each panel sequentially, conditioning on previous outputs. Human evaluation scores favor global generation across accuracy, logic, state consistency, and visual-textual quality. Average accuracy is 90.0% for global generation versus 65.0% for incremental generation in the reported setting.

That matters operationally. A naive implementation might generate panel one, then panel two, then panel three, and so on. The paper’s evidence suggests that this can accumulate errors and break cross-panel coherence. If the comic is supposed to represent a reasoning trajectory, the model needs a global plan. Otherwise, the story may drift. In an enterprise workflow, “drift” is a polite word for “the audit trail now contradicts itself.”

The cost argument is not “comics are cheap”; it is “video is often wasteful”

The paper compares media generation cost between TwC and Thinking with Video. Using standard industrial pricing assumptions cited by the authors, a 10-second video reasoning instance costs $1.00, while TwC costs $0.134. The reported reduction is 86.6% in media generation cost.

This number should be interpreted narrowly. It is about media generation cost under the paper’s assumptions, not necessarily total system cost after orchestration, retries, model calls, storage, human review, and integration. Still, the direction is credible: if the task only needs selected states, video pays for redundant continuity.

The deeper point is not that comics are always cheap. It is that comics match the information structure of many reasoning tasks. A process explanation often requires:

the initial state;
the relevant evidence;
the rule or transformation;
the exception or conflict;
the final decision.

That is a panel sequence. Rendering the same logic as continuous video can be impressive, but impressiveness is not a KPI unless one is selling conference demos.

Where this maps to business use

The paper directly shows benchmark performance and controlled analyses. Cognaptus’ business inference is more specific: comic-style intermediate representations may be useful when companies need compact, reviewable, multimodal reasoning traces.

Several use cases follow naturally.

Business setting	Why panels may help	What must still be validated
Document QA	Panels can highlight source regions, extraction steps, and answer logic	Faithfulness to source documents; handling messy scans and tables
Compliance review	Panels can show evidence, rule, exception, and decision path	Legal reliability; escalation thresholds; audit acceptance
Customer support	Panels can turn troubleshooting into a visible sequence	Whether users trust generated visual explanations
Process training	Panels can compress SOPs into memorable sequences	Accuracy under local procedures and changing policies
Cross-cultural localization	Panels can show context, norms, and action choices	Cultural stereotyping risk; local validation
Operations diagnosis	Panels can represent symptom progression and decision points	Sensor fidelity; causal correctness

The near-term value is not “AI-generated comics for the workplace,” which sounds like a startup pitch produced after too much cold brew. The value is structured visual reasoning traces. Comics are simply one efficient format for those traces.

This distinction matters because businesses should not adopt the aesthetic blindly. A comic format for internal audit may look more like a clean panelized evidence board than a colorful strip. The design principle is panelized temporal reasoning with textual anchors, not mascot characters explaining invoice fraud.

Boundaries: promising, but not production-proof

The paper is useful, but the practical boundaries are real.

First, the evidence is benchmark-based. MathVista and DocVQA are informative, but enterprise documents and workflows are uglier: inconsistent templates, partial scans, handwritten notes, multilingual fields, and policy exceptions that were invented by someone in 2018 and never documented properly.

Second, the strongest experiments rely heavily on specific frontier models, especially Gemini-3 Pro Image and Gemini-3 Pro in the main TwC setup. The cross-model analysis suggests some transferability, but production systems would need validation across the exact models, prompts, and document types used in deployment.

Third, comics can improve interpretability without guaranteeing faithfulness. A panel can look coherent and still misrepresent the underlying evidence. This is the classic danger of generated explanations: the explanation may be easier to trust precisely because it is easier to read. The prettier the audit trail, the more suspicious the audit team should become. Healthy paranoia remains underrated.

Fourth, narrative style is a control surface and therefore also a risk surface. If detective framing improves logical reasoning, what does it do in a sensitive HR review, legal interpretation, or cultural judgment task? A narrative frame can focus attention, but it can also smuggle assumptions.

Finally, the cost comparison is about media generation cost, not total return on investment. A real deployment must include retrieval, model calls, verification, logging, governance, error handling, and human review. TwC may reduce one expensive component while adding a new orchestration layer.

These are not reasons to dismiss the paper. They are reasons to test the right thing.

A practical evaluation checklist for teams

A company experimenting with comic-style reasoning should avoid the childish version of the idea: “Let’s make our AI outputs look like comics.” That is backwards. The test should begin from workflow structure.

A useful pilot would ask:

Evaluation question	Why it matters
Does the task require temporal or causal steps?	If not, a static annotated image may be enough.
Does textual anchoring improve accuracy or user understanding?	The paper suggests it matters, but each domain needs proof.
Can the panel sequence be traced back to source evidence?	Without traceability, the comic is explanation theater.
Does global generation outperform incremental generation?	The appendix suggests global planning preserves coherence.
What panel budget gives the best accuracy-cost tradeoff?	Four to six panels may be a clue, not a universal rule.
Do users trust the output appropriately?	Over-trust is as dangerous as under-trust.
Does the method beat simpler baselines?	The real competitor may be annotated screenshots, not video.

That last point is important. In practice, TwC should not only be compared with video. It should also be compared with well-designed annotated screenshots, structured markdown explanations, and ordinary retrieval-augmented answers. If a two-column table solves the problem, do not summon a visual storytelling paradigm. AI systems already have enough hobbies.

The real insight: comics are compressed process models

The paper’s title invites a playful reading, but the serious contribution is about representation design. A comic is a compressed process model. It selects states, orders them, attaches text, and preserves enough narrative continuity for reasoning.

That puts TwC in a broader family of enterprise AI techniques: not just prompting, not just retrieval, not just multimodal input, but intermediate representation engineering. The model’s performance depends on how the task is represented before final reasoning happens.

For static visual tasks, a single image may be enough. For continuous physical motion, video may still be necessary. But for many business processes, the relevant world is neither snapshot nor movie. It is a sequence of discrete states with rules connecting them.

That is exactly where panels earn their keep.

The paper does not prove that comics will become the default interface for enterprise reasoning. It does show that the format deserves more respect than the word “comic” usually receives in professional rooms. A panel sequence can be a reasoning scaffold, a cost-control mechanism, and an explanation surface at the same time.

Not bad for a medium many executives still mentally file next to childhood entertainment. Then again, corporate slide decks have been comics without jokes for decades. TwC simply asks whether we can make the panels do actual reasoning.

Cognaptus: Automate the Present, Incubate the Future.

Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, and Tiejun Zhao, “Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling,” arXiv:2602.02453, 2026. https://arxiv.org/abs/2602.02453 ↩︎

The useful comparison is not text versus comics, but image versus video versus panels#

TwC has two paths: the comic can answer, or it can brief another model#

The main results are strongest where visual/context structure matters#

The ablations tell us comics are doing work, not wearing costumes#

Four to six panels is not a law; it is a cost-performance clue#

Temporal order matters more when panels are actual reasoning states#

Textual anchoring is the boring part that makes the visual part useful#

The appendix quietly answers a practical implementation question: generate globally, not step by step#

The cost argument is not “comics are cheap”; it is “video is often wasteful”#

Where this maps to business use#

Boundaries: promising, but not production-proof#

A practical evaluation checklist for teams#

The real insight: comics are compressed process models#