The joke is not the punchline
Humor is a useful humiliation device for artificial intelligence.
A model can summarize earnings calls, draft policy memos, and explain SQL joins with the confidence of a very expensive intern. Then it looks at a cartoon, reads five captions, and selects the one that sounds plausible but misses the joke entirely. Not because the grammar is hard. Not because the image has too many pixels. Because humor requires the model to notice that something is off, infer why it is off, and decide which caption resolves that mismatch in a way humans actually find satisfying.
That is a more interesting problem than “can AI be funny?” The better question is whether AI can learn the reasoning structure behind subjective judgment.
A recent arXiv paper, “Learning to Think Like a Cartoon Captionist: Incongruity–Resolution Supervision for Multimodal Humor Understanding,” uses the New Yorker Cartoon Caption Contest as its testbed and introduces a framework called IRS: Incongruity–Resolution Supervision.1 Despite the unfortunate tax-season acronym, the idea is sharp. Instead of training a multimodal model only to select the right caption, the authors train it to reason like a captionist: reconstruct the scene, identify the visual mismatch, resolve it into an interpretation, and then align that interpretation with human preference.
The paper’s business relevance is not that your customer-service bot will soon become a nightclub act. Please spare everyone. The stronger lesson is that subjective AI tasks can improve when the hidden expert process is made explicit, supervised, and then reward-aligned. Humor is simply the cleanest way to expose the failure mode because bad judgment becomes visible immediately. The punchline does not land. Everyone knows. Even the spreadsheet people.
Humor is difficult because the answer is not enough
Most enterprise AI evaluation still behaves as if good output can be reduced to answer selection. Did the model classify the ticket correctly? Did it choose the right policy paragraph? Did it recommend the approved product? Did it pass the benchmark?
That works when the task has a stable answer key. It works less well when the output must satisfy layered human judgment: tone, timing, cultural context, implicit contradiction, brand fit, narrative economy, and preference among several “not wrong” options.
Cartoon humor sits exactly in that uncomfortable territory. A caption is not funny merely because it describes the image. It works when it resolves a visual or social incongruity. A giant amoeba on an airplane is strange. A caption that uses “single-celled” to reframe the seating problem turns the strangeness into a coherent joke. The answer is not just a label; it is the end of a reasoning path.
This is why the New Yorker Cartoon Caption Contest is a useful benchmark. It combines images, language, expert curation, and crowd preference. It also contains a nasty distinction that many AI systems dislike: identifying the relevant scene is not the same as choosing the best humorous interpretation. A model may see the objects. It may understand the words. It may still fail to connect the incongruity to the caption that resolves it best.
The paper’s central claim is that this gap is not solved by scale alone. It needs process supervision.
IRS turns caption judgment into a teachable chain
IRS decomposes multimodal humor understanding into three trainable stages:
| IRS stage | What it teaches | Practical analogue |
|---|---|---|
| Incongruity Modeling | Notice the mismatch in the visual or social setup | Spot anomalies, contradictions, friction, weak signals |
| Resolution Modeling | Build a coherent reinterpretation of the mismatch | Diagnose the cause, explain the insight, frame the decision |
| Preference Alignment | Choose the interpretation humans prefer | Optimize tone, ranking, creative fit, brand judgment |
This mechanism-first structure matters. A normal summary would say, “The authors trained Qwen2.5-VL models and improved accuracy on humor benchmarks.” True, but not the point.
The point is that the authors translated a human interpretive practice into a machine-training pipeline. First, they adapt the model to captionist discourse. Then they teach it structured reasoning traces. Finally, they use reinforcement learning with rewards that evaluate not only correctness and formatting, but also visual grounding and captionist style.
That is the reusable idea. Humor is the domain. Reasoning supervision is the method.
Incongruity Modeling gives the model better priors, not better answers
The first IRS stage is Incongruity Modeling. The authors continue pretraining the model on a curated corpus of humor-relevant material: caption contest discussions, editorial commentary, caption-writing guidance, books, and general-purpose corpora. The corpus is designed to expose the model to how captionists talk about what makes a cartoon work.
Operationally, this is domain adaptation. Conceptually, it is prior shaping.
The model is not being handed the answer to a caption-ranking task. It is being nudged toward the concepts that matter in the domain: visual oddity, narrative setup, word economy, conversational punchline, cultural reference, and the difference between a caption that merely describes and one that resolves.
The appendix makes this stage more concrete. The humor-specific corpus contains podcast transcripts, panel discussions, CartoonStock commentaries, and books such as Bob Mankoff’s The Naked Cartoonist and Lawrence Wood’s Your Caption Has Been Selected. It is supplemented with FineWeb and OLMo-Mix material to avoid making the model too narrow. The paper also notes a reproducibility boundary: some corpus materials are copyrighted and cannot be released directly. That matters. If a business wants to copy the method, the recipe is visible, but not every ingredient is downloadable.
The ablation results also prevent an easy misunderstanding. Incongruity Modeling by itself does not reliably improve performance. On the 7B backbone, adding IM alone actually lowers matching accuracy from 42.67% to 41.00% and ranking accuracy from 55.06% to 51.69%. It helps when paired with the later stages, especially Resolution Modeling. This is a useful warning: domain-flavored text is not magic seasoning. Without a structured task process, it may simply make the model sound more familiar with the domain while not judging better.
A depressing but common enterprise outcome, in other words.
Resolution Modeling is the main engine
The core of IRS is Resolution Modeling. Here the authors build “captionist reasoning traces”: structured explanations showing how to move from a cartoon image and candidate captions to the correct or preferred answer.
These traces are not generic chain-of-thought decorations. They follow a domain-specific pattern: reconstruct the scene, identify salient incongruities, infer speaker context, compare alternatives, and justify why one caption resolves the mismatch better. DeepSeek-R1 is used to generate traces from structured cartoon annotations, and GPT-4o is used to rephrase them into a more concise, image-grounded captionist style. The traces are then verified under human supervision.
This is where the paper becomes more useful than another “let’s prompt the model to think step by step” exercise. IRS does not merely ask the model to think more. It tells the model what kind of thinking the task requires.
That difference matters. Generic reasoning traces can become verbose rituals: restate the prompt, list options, make a plausible guess, wrap it in confidence. Captionist reasoning traces are narrower. They force the model to connect image evidence, incongruity, caption mechanics, and preference.
The ablation table supports this interpretation. On the 7B backbone, Resolution Modeling alone improves matching from 42.67% to 47.00%, ranking from 55.06% to 56.88%, 30-vs-300 from 47.99% to 50.86%, and keeps the model close on 10-vs-1000. It is not the largest single jump in every column, but it is the stage that makes the rest of the framework coherent. When IM and RM are combined, performance rises further on matching and 10-vs-1000. The paper’s own interpretation is clear: RM is the main source of improvement, while IM is most useful when it supports a structured reasoning process.
The broader lesson is almost too practical to be fashionable: if experts solve a task through a repeatable mental process, teach the process. Do not only collect labels and hope the model reverse-engineers expertise from the final answer. Hope is not a training architecture.
Preference Alignment makes “correct” less shallow
The third stage, Preference Alignment, uses reinforcement learning with GRPO and a composite reward. The ordinary components are accuracy and format. Accuracy checks whether the model selected the right caption. Format keeps the output parseable.
Those are necessary. They are also insufficient.
The authors add two humor-aware reward components. A visual perception reward checks whether the reasoning is grounded in salient visual details and incongruities, using curated reference descriptions and a Qwen2.5-7B-Instruct judge. A style reward evaluates whether the caption and reasoning follow captionist-relevant qualities such as natural phrasing, punctuation, wordplay, metaphor, and punchline placement.
This is where IRS becomes an alignment framework rather than only a supervised fine-tuning method. The reward is not just “did you pick B?” It is closer to “did your reasoning attend to the right evidence, and did your explanation resemble the standards of this creative practice?”
The paper reports that when either perception or style reward exceeds 80% of its maximum, more than 70% of cases are correct. That does not prove the judges are perfect. It does suggest the reward signals are not completely detached from task success. In alignment work, this is already more than one should casually assume.
The reward ablation is useful because it clarifies what the extra judges are doing. Starting from Base + IM + RM, adding preference alignment with accuracy and format gives strong gains in matching. Adding perception and style rewards improves ranking-oriented behavior, especially where the task requires fine discrimination among plausible captions. That pattern fits the mechanism: caption ranking depends less on detecting a possible joke and more on deciding which resolution is stronger, cleaner, and more human-preferred.
The headline results: reasoning supervision beats raw Qwen scaling on this task
The main evidence is Table 1 of the paper, which compares IRS models against human baselines, text-only reasoning, closed multimodal models, open multimodal reasoning models, and the underlying Qwen2.5-VL backbones.
A compressed version of the relevant numbers tells the story:
| Model | Matching | Ranking | 10-vs-1000 | 30-vs-300 |
|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 42.67% | 55.06% | 50.57% | 47.99% |
| IRS-7B | 59.67% | 64.42% | 56.29% | 53.14% |
| Qwen2.5-VL-32B-Instruct | 46.67% | 49.87% | 52.00% | 44.57% |
| IRS-32B | 62.67% | 68.05% | 62.86% | 53.14% |
| Qwen2.5-VL-72B-Instruct | 56.00% | 55.58% | 53.71% | 50.29% |
| IRS-72B | 69.33% | 76.10% | 62.57% | 50.86% |
| o3 | 83.33% | 62.85% | 69.05% | 54.57% |
| Human non-expert | 53.03% | 65.61% | 54.70% | 52.27% |
| Human expert | 100.00% | 100.00% | 60.00% | 40.00% |
Two interpretations are worth separating.
First, IRS improves the Qwen2.5-VL backbone at every tested scale. This is the cleanest comparison because it isolates the contribution of the training framework from the base model family. IRS-7B beats Qwen2.5-VL-7B by 17.00 points on matching and 9.36 points on ranking. IRS-72B beats Qwen2.5-VL-72B by 13.33 points on matching and 20.52 points on ranking.
Second, scale alone is not reliable. The base Qwen2.5-VL-72B improves over the 7B model on matching, but not in a way that solves the task, and its ranking score remains close to the 7B base. IRS-32B and IRS-72B are more impressive because they combine capacity with task-specific reasoning structure.
The best closed model, o3, still dominates matching at 83.33%. The paper is careful about why this comparison is not perfectly clean: closed models may benefit from data and pretraining advantages unavailable to open-weight models, including possible familiarity with New Yorker-style content through licensed corpora. That does not invalidate the benchmark. It just means “beat o3 everywhere” is not the right burden of proof.
The more interesting result is ranking. IRS-72B reaches 76.10% on the Hessel et al. ranking task, outperforming the listed model baselines including o3. For subjective preference selection, the trained reasoning path seems to matter more than generic multimodal power.
The 30-vs-300 setting is the stubborn one. IRS improves over its Qwen base models at 7B and 32B, but IRS-72B reaches only 50.86%, below o3 and close to non-expert human performance. This task asks models to distinguish between captions ranked 30–39 and 300–309, where both options may be stylistically competent. The paper treats this as a fine-grained preference problem, and that is the correct reading. It is not an embarrassment hidden in the table. It is the table reminding everyone that “human preference” is not a single clean variable politely waiting to be optimized.
The ablations say “teach the path,” not “add humor data”
The component ablation on the 7B model is the most useful section for builders. It tells us which part of the method deserves attention.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Full model comparison | Main evidence and comparison with prior work | IRS improves Qwen2.5-VL across scales and is competitive with strong baselines | That IRS universally beats closed models or humans |
| IRS component ablation | Ablation | RM is the central mechanism; IM and PA are complementary | That domain pretraining alone is sufficient |
| Reward-function ablation | Ablation inside PA | Perception/style rewards help especially in ranking and ambiguity | That LLM judges perfectly represent human humor preference |
| Cross-dataset tests | Robustness/generalization test | Learned reasoning transfers to YesBut and DeepEval tasks | That the model generalizes across all cultures or humor forms |
| Caption generation appendix | Exploratory extension | IRS may support plausible zero-shot caption generation | That the model is trained or evaluated as a production caption generator |
| Training and prompt appendices | Implementation detail and reproducibility support | The pipeline is concrete enough to inspect and partially reproduce | Full reproduction when copyrighted corpus materials are unavailable |
This table matters because it prevents the usual “new benchmark, new state of the art, everyone clap” reading. The paper is strongest as a mechanism study.
The component ablation shows that IM alone can hurt. RM alone helps. PA alone helps strongly on matching but is not enough to produce the best overall model. The full IRS combination gives the strongest 7B performance across the four reported tasks: 59.67% matching, 64.42% ranking, 56.29% on 10-vs-1000, and 53.14% on 30-vs-300.
That pattern is consistent with a layered process. Domain adaptation prepares the model. Reasoning traces teach the transformation from incongruity to resolution. Preference alignment sharpens the output against human-facing criteria. Remove the middle layer, and the system loses its spine.
For enterprise AI, this is the part to underline. Many companies already have domain corpora. Many also have labels: approved responses, successful campaigns, resolved tickets, analyst recommendations. What they often lack is the expert reasoning trace between evidence and decision.
That missing middle is expensive. It is also where much of the value lives.
The transfer tests are promising, but they are not a passport to every culture
The appendix evaluates zero-shot transfer to two external benchmarks: YesBut and DeepEval. These tests matter because a model could improve on NYCC by learning contest-specific quirks rather than a transferable reasoning pattern.
On YesBut, the IRS-7B model improves from 43.19% to 74.90% on the Philosophy task and from 29.11% to 63.32% on the Title task. On the humor subset of DeepEval, it improves from 10.34% to 54.43% on DeepSemantics, from 24.13% to 100.00% on Description, and from 34.48% to 63.18% on Title.
Those are large gains. They are also narrow evidence.
YesBut is contrastive visual humor. DeepEval includes only a small humor subset: 29 images, representing 2.9% of the benchmark. The tests support the claim that IRS learns something beyond NYCC finalist-caption matching. They do not prove that the model can navigate multilingual humor, regional sarcasm, political satire, workplace irony, or brand-safe comedy across cultures.
That distinction is not pedantry. It is deployment hygiene.
A business system trained on U.S.-centric creative judgment may improve at U.S.-centric creative judgment. It may also confidently mishandle local references, taboo boundaries, idioms, or social hierarchies elsewhere. Humor is one of the fastest ways to discover that “global model” sometimes means “monolingual American taste wearing a passport.”
The qualitative examples show process improvement, not just score movement
The appendix includes qualitative examples showing how model reasoning changes after training. These examples are not the main evidence; they are mechanism inspection. Used carefully, they help explain what the numbers mean.
In one example, an RM-only model produces fluent but shallow analysis. It notices surface themes but does not fully connect them to the central incongruity. After Preference Alignment, the model grounds its reasoning in more salient visual cues, filters distractors, and highlights humor mechanisms such as wordplay, speaker roles, and irony.
In another ranking example, the RM-only model fixates on literal associations, while the RM+PA model integrates the broader visual setup and chooses the caption that better fits the gag.
This is exactly what one would expect if PA is doing its job. It does not merely push the final answer toward the label. It changes the texture of the explanation: less generic, more grounded, more economical, and more aligned with how an expert might discuss caption quality.
There is also an example where the model and an expert choose different captions but follow similar reasoning steps. This is a subtle but important point. In subjective domains, disagreement does not always mean reasoning failure. Sometimes it means two competent judges weight originality, accessibility, and punchline differently. A good AI system in such domains should not only match outcomes; it should expose its judgment process well enough for humans to inspect and contest it.
That is a much better product goal than pretending subjective work can be reduced to a green checkmark.
The business lesson: subjective AI needs process supervision
The paper directly shows that IRS improves multimodal humor understanding on NYCC-style caption matching and ranking, with supporting evidence from ablations and transfer tests. Cognaptus’ business inference is broader but bounded: many high-value AI tasks resemble humor more than classification.
Consider these examples:
| Business task | Why labels are not enough | What IRS suggests |
|---|---|---|
| Brand voice review | “Approved” or “rejected” hides the tone judgment | Train on traces explaining audience, tone, risk, and intended effect |
| Customer support escalation | The right action depends on emotional and operational context | Teach agents how experts infer urgency and dissatisfaction |
| Sales-message selection | Several messages may be factually correct | Align outputs to buyer intent, timing, and persuasion quality |
| Analyst-report drafting | The conclusion matters less without the reasoning bridge | Supervise evidence-to-thesis reasoning, not only final recommendation |
| Creative screening | Preference is subjective and contextual | Use domain-specific judges for grounding, style, and audience fit |
The paper’s deeper operational recipe is this:
- Define the expert reasoning chain. Do not start with the model. Start with the human process.
- Collect or generate structured traces. Labels tell the model what won. Traces teach why it won.
- Separate perception from resolution. Seeing the evidence is not the same as interpreting it.
- Reward domain-specific quality. Generic correctness is too thin for tasks involving taste, tone, or judgment.
- Evaluate transfer carefully. A model that works in one subjective domain may still fail in another.
This is not cheap, but it may be cheaper than brute-force scale. More importantly, it is more inspectable. If a model fails, the trace reveals where: perception, incongruity detection, interpretation, preference weighting, or style. That diagnostic structure is valuable in production. “The model was wrong” is a complaint. “The model misread the scene, then optimized the wrong preference signal” is an engineering agenda.
What this paper does not prove
The boundaries are clear enough and should not be sprinkled randomly like compliance parsley.
First, visual perception remains a bottleneck. The paper includes examples where the model misidentifies visual entities, such as confusing a Viking warrior with the Grim Reaper because of superficial visual similarity. If the model sees the wrong scene, better reasoning can elegantly march in the wrong direction.
Second, cultural grounding remains shallow. NYCC humor is Anglophone, U.S.-centric, and shaped by a particular editorial tradition. The authors explicitly note that humor conventions, references, and linguistic structures may not transfer uniformly across cultures. For business deployment, this means a framework trained on one market’s taste should not be casually exported into another market’s customer interaction layer.
Third, automated preference rewards approximate human judgment. The perception and style judges are useful training signals, but they are still proxies. They can reward visible grounding and stylistic features; they cannot fully capture the diversity of human response. In subjective work, the danger is not only being wrong. It is being consistently wrong in a way the metric congratulates.
Fourth, reproducibility has a material boundary. The paper promises train/test splits, preprocessing code, and prompt templates, but parts of the Incongruity Modeling corpus are copyrighted and cannot be directly released. That is understandable. It also means replication will require substitute corpora or licensing work.
Finally, the paper does not show a production-ready humor generator. The caption-generation appendix is exploratory. It shows that IRS-trained models can produce plausible captions without task-specific generation training, but the central evidence is selection and ranking, not open-ended creative deployment.
The real punchline: reasoning can be smaller than scale but larger than prompting
For the last few years, the lazy strategic answer to model weakness has been scale. Bigger model. More data. Longer context. More expensive leaderboard screenshot. The usual spiritual journey of AI procurement.
This paper does not say scale is irrelevant. IRS benefits from larger backbones, and the 72B model is strongest on several tasks. But the paper does show that scale without task-specific reasoning structure leaves performance on the table. The base Qwen2.5-VL-72B model does not simply grow into expert caption judgment. IRS teaches it a process.
That is the more mature thesis: subjective intelligence is not just latent capacity. It is capacity organized by the right intermediate representation and aligned to the right evaluative criteria.
For businesses, this points away from two weak strategies. The first is buying a bigger generic model and hoping it inherits your best employee’s judgment. The second is prompt-engineering a shallow checklist and calling it “domain expertise.” IRS suggests a third path: model the expert process, supervise the reasoning trace, and align outputs with domain-specific reward signals.
The paper happens to teach AI how to get the joke.
The useful part is that the same recipe may help AI understand why a message feels off, why a customer sounds ready to churn, why a policy memo reads as evasive, or why one analyst narrative is more persuasive than another.
Apparently, cartoon captions were not a toy problem after all.
They were a warning label.
Cognaptus: Automate the Present, Incubate the Future.
-
Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff, Erkut Erdem, and Aykut Erdem, “Learning to Think Like a Cartoon Captionist: Incongruity–Resolution Supervision for Multimodal Humor Understanding,” arXiv:2604.15210, 2026, https://arxiv.org/abs/2604.15210. ↩︎