A document lands in an intake queue. It might be an invoice, a memo, a form, a résumé, or one of those corporate artifacts whose layout says more than the words do. Someone wants the system to classify it instantly, because every downstream workflow—routing, extraction, compliance, archiving—depends on that first label.
The fashionable answer is: send it to a large language model. Extract the text, paste it into a prompt, ask for one label, and let the machine be clever. This is attractive because it feels general. It is also how many automation projects quietly turn a visual problem into a text problem, then act surprised when the system starts calling file folders “proposals” because the word proposal appeared somewhere on the page.
The paper Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis is useful because it does not treat document classification as a beauty contest among model names.1 It builds a cleaner comparison around two questions that matter operationally:
- Is the model a specialized Transformer or a general-purpose LLM?
- Does it process the document image directly, or does it depend on OCR-extracted text?
That gives the paper a simple 2x2 frame:
| Architecture family | OCR-free | OCR-dependent |
|---|---|---|
| Specialized Transformer | Donut | LayoutLMv3 |
| General-purpose LLM | Qwen3-VL-32B-Instruct | Qwen3-32B |
This is the right comparison. Not because these four models exhaust the field—they do not—but because they force the discussion away from “LLMs versus old AI” and toward the real engineering question: what information reaches the classifier?
For visually rich documents, the answer is not just text. It is layout, typography, spatial grouping, handwriting, tables, blocks, margins, forms, headings, and the awkward truth that business documents are often designed to be understood by looking, not by reading a flattened transcript.
The study compares four routes into the same document queue
The paper evaluates the four representative systems on RVL-CDIP, a widely used document type classification benchmark containing 400,000 grayscale scanned document images across 16 categories, with a predefined 320k / 40k / 40k train-validation-test split. The authors use the full test set except for one corrupted image that could not be processed.
The four routes are not just different models. They represent different assumptions about what “document understanding” means.
| Model | Family | Input strategy | Adaptation style | What the model is really being asked to trust |
|---|---|---|---|---|
| Donut | Transformer | OCR-free image input | Fine-tuned on RVL-CDIP | The document image itself |
| LayoutLMv3 | Transformer | OCR text + layout boxes + image patches | Fine-tuned checkpoint | Text, position, and visual patches together |
| Qwen3-VL-32B-Instruct | LLM | OCR-free image + prompt | Prompting, no task fine-tuning | Vision-language instruction following |
| Qwen3-32B | LLM | OCR text inserted into prompt | Prompting, no task fine-tuning | OCR text as a linear transcript |
The experiment is mostly main evidence, not an ablation study. It is not removing one component from the same model to isolate causal effects. Instead, it compares representative architectures under a harmonized pipeline. The class-wise metrics and confusion matrices are diagnostic evidence: they explain where each architecture fails. The preliminary LayoutLMv3 checkpoint screening is an implementation detail with practical consequences, because it shows that “fine-tuned Transformer” is not a magic phrase; the checkpoint quality matters. The dataset discussion near the end is a boundary condition, not a second thesis.
This matters because the numbers are easy to misread. The paper is not saying “Donut is always best” or “LLMs are bad.” It is saying that document type classification has information bottlenecks, and different architectures lose different parts of the document.
The scoreboard is clear, but the ranking is not the whole lesson
The headline result is straightforward:
| Model | Accuracy | Mean end-to-end inference time per image |
|---|---|---|
| Donut | 0.95 | 307 ms |
| LayoutLMv3 | 0.90 | 554 ms |
| Qwen3-VL-32B-Instruct | 0.75 | 239 ms |
| Qwen3-32B | 0.55 | 600 ms |
At first glance, the story is simple: Donut wins on accuracy, Qwen3-VL is fastest, OCR-dependent systems are slower, and OCR-only text classification performs poorly.
That summary is true. It is also too shallow to be useful.
The more important pattern is that image access beats text-only access when the document type is visually encoded. Donut and Qwen3-VL both process document images directly. LayoutLMv3 also uses visual and positional information, although through an OCR-dependent pipeline. Qwen3-32B receives OCR text only. Unsurprisingly, the text-only LLM struggles on classes where the “type” is not fully expressed as continuous prose.
This is the part many enterprise AI discussions still skip. OCR is not the document. OCR is a lossy textual extraction from the document. It preserves words, sometimes imperfectly. It does not preserve the full visual grammar of the page. A table becomes a sequence. A form becomes fragments. A handwritten note becomes noise. A folder cover with sparse text becomes almost nothing. The system has not “read the document”; it has read a transcript produced by another machine, and then pretended the missing visual structure was not needed. Elegant, in the same way a map without roads is elegant.
The LLM misconception: general reasoning does not replace the missing modality
The most relevant misconception is not that LLMs are useless. The paper does not support that claim. Qwen3-VL reaches 0.75 accuracy without task-specific fine-tuning, and it performs strongly on some classes, especially résumé, email, advertisement, and scientific publication. That is not trivial.
The misconception is more specific: a strong general-purpose LLM can classify enterprise documents well enough from OCR text or a simple prompt, making specialized document AI unnecessary.
The Qwen3-32B result is the correction. As an OCR-dependent text-only LLM, it reaches 0.55 accuracy. Its better classes tend to be those with distinctive lexical patterns or continuous text. It performs relatively well on résumé and scientific publication, but struggles on file folder, form, handwritten, and presentation. The pattern is not random. It reflects what the model can see.
Qwen3-VL improves the picture by using the image directly. Its accuracy rises to 0.75. That comparison is the cleanest practical signal in the paper: among the two LLM-based systems, direct visual input is markedly better than relying only on OCR text. The model family is similar; the available evidence is not.
So the replacement rule should be revised:
| Common belief | Better replacement |
|---|---|
| “Use an LLM; it can reason over the document.” | “Use an LLM only after checking whether the evidence it receives contains the document features needed for the label.” |
| “OCR converts the document into machine-readable form.” | “OCR converts part of the document into text and may discard the very structure that defines the document type.” |
| “Prompting is flexible, so it solves closed-set classification cheaply.” | “Prompting is flexible, but closed-set classification also needs output control, label discipline, and stable evidence.” |
That last point appears sharply in the invalid-label behavior. LayoutLMv3 uses a classification head, so its outputs are restricted to the predefined label set. Donut uses an autoregressive decoder, but fine-tuning keeps its outputs aligned with the target labels. The LLMs, by contrast, rely on prompting alone. They generate labels outside the allowed set: 37 invalid labels for Qwen3-VL and 972 for Qwen3.
For business workflows, this is not a minor formatting issue. A closed-set classifier is supposed to choose from the operational taxonomy. If it invents a new category because some phrase looks salient, the downstream pipeline now has an exception-handling problem. The model may sound helpful, but the workflow wanted obedience. Very unfashionable, obedience. Also very useful.
The class-level failures explain the mechanism
The best part of the paper is not the accuracy table. It is the class-level analysis, because it shows why the models fail.
Some categories are easy because they are visually or lexically distinctive. Donut reaches F1 scores between 0.97 and 0.99 for email, résumé, specification, handwritten, and file folder. LayoutLMv3 reaches 0.99 F1 for email and résumé, and around 0.93 for several other categories. Qwen3-VL also performs strongly on résumé, email, advertisement, and scientific publication.
Other categories are hard because the label boundaries are weak. The form category causes trouble across all models. Its F1 scores are 0.90 for Donut, 0.80 for LayoutLMv3, 0.51 for Qwen3-VL, and 0.27 for Qwen3. This is not just a model weakness. It is also a taxonomy problem. “Form” is a broad, visually diverse category that overlaps with invoices, questionnaires, letters, and other semi-structured documents.
The business lesson is slightly uncomfortable: better models cannot fully repair vague categories. If a company’s document taxonomy was created by committee, inherited from a legacy filing system, or designed around human convenience rather than machine separability, model selection alone will not fix the intake pipeline. It may only make the confusion faster.
Presentation documents show a different failure mode. Donut and LayoutLMv3 handle them moderately well, but Qwen3-VL has high precision and very low recall: it predicts presentation only in high-confidence cases and misses most true presentation samples. Qwen3 performs even worse, with presentation F1 at 0.16. The authors suggest that presentation-style layouts may be underrepresented in the models’ training data, and that textual content alone does not provide a strong enough signal.
This is important for companies using foundation models in internal workflows. If a document type is common inside your organization but underrepresented in general pretraining data, prompting may not be enough. The model may know what a presentation is in the everyday sense and still fail to classify scanned slide-like documents reliably. Apparently “general intelligence” still benefits from seeing the thing before being tested on it. Shocking development.
OCR is useful, but OCR-only is fragile
The paper’s comparison does not support a simple anti-OCR conclusion. LayoutLMv3 is OCR-dependent and still performs strongly. It combines OCR-derived word tokens, bounding boxes, and image patches inside a multimodal Transformer encoder. In other words, OCR helps when it is one modality among several.
The weak case is OCR-only classification with Qwen3-32B. That pipeline extracts text using Tesseract, inserts the text into a prompt, and asks the LLM to classify the document. This removes layout, typography, handwriting cues, and visual structure before the model even begins.
The distinction is critical:
| OCR role | Operational meaning | Risk |
|---|---|---|
| OCR as one signal among image and layout features | Useful for text-rich and semi-structured documents | More pipeline complexity and OCR error propagation |
| OCR as the only signal | Cheap conceptual pipeline: extract text, prompt LLM | Loses layout-driven evidence and performs poorly on visually encoded classes |
| OCR-free image processing | Preserves visual structure and avoids external OCR overhead | Requires models capable of extracting enough visual and textual cues from the image |
The paper’s runtime measurements also matter here. The OCR-dependent models are slower end-to-end: LayoutLMv3 averages 554 ms per image and Qwen3-32B averages 600 ms. Donut averages 307 ms, and Qwen3-VL averages 239 ms. Since the reported times include the OCR stage for OCR-dependent models, the comparison is not merely about model forward-pass speed. It is about the full operational pipeline.
For large-scale intake systems, OCR is not free. It introduces compute time, dependencies, maintenance, quality monitoring, and error propagation. This does not mean companies should remove OCR. It means OCR should be justified by the document class and downstream task, not included reflexively because “documents contain text.” Some documents contain layout first and text second.
Fine-tuned Transformers win, but fine-tuning is not a button
The strongest accuracy comes from specialized Transformer models, especially Donut. But the paper is careful about a practical caveat: fine-tuning quality matters.
The official LayoutLMv3 paper reports 0.9544 accuracy on RVL-CDIP, but the official fine-tuned checkpoint was not available for this study. The authors therefore tested public LayoutLMv3 checkpoints and selected the strongest one, which reached 0.90 accuracy. Other public variants performed worse.
This is not a footnote for engineers to ignore. It changes procurement and implementation planning.
A company cannot simply say, “We will use LayoutLMv3,” and assume benchmark-level performance appears. It needs a labeled dataset, training expertise, resource budget, preprocessing alignment, validation discipline, and monitoring under domain shift. Fine-tuning is the price paid for control and performance. Sometimes that price is worth paying. Sometimes it is not.
Prompt-based LLMs offer the opposite trade-off. They are easier to adapt across domains because the task can be specified in language. But the paper reports that their output quality is highly sensitive to prompt formulation, especially when OCR text is included. They also require more memory and compute capacity than the Transformer models, which limits practical experimentation on standard hardware and complicates on-premise deployment.
The decision is therefore not “old specialized model versus modern LLM.” It is:
| Decision axis | Specialized Transformer | Prompt-based LLM |
|---|---|---|
| Accuracy on layout-heavy classification | Stronger when well tuned | Weaker in this study |
| Output control | Stronger for closed label sets | Weaker without extra constraints |
| Adaptation | Requires fine-tuning | Prompt-based, faster to try |
| Resource profile | Smaller than 32B LLMs in this setup | Larger memory and compute demands |
| Pipeline risk | Fine-tuning and preprocessing quality | Prompt sensitivity and invalid outputs |
| Best fit | High-volume, stable taxonomy, strict routing | Exploration, low-label settings, flexible document categories |
The useful conclusion is not that every company should deploy Donut tomorrow. It is that the architecture should match the stability of the workflow. Stable, high-volume intake with strict categories rewards specialized models. Exploratory or low-volume classification may tolerate prompt-based systems—especially if a human review layer remains in the loop.
A practical decision matrix for document intake automation
The paper can be translated into a simple model-selection framework.
| Document situation | Better starting point | Reason |
|---|---|---|
| Scanned forms, invoices, folders, handwritten pages, tables, layout-heavy documents | OCR-free or layout-aware multimodal model | The label depends on spatial and visual evidence, not only text |
| Linear, text-dominant documents such as emails or prose-heavy reports | OCR text plus text model may be acceptable | Continuous text carries much of the class signal |
| Strict closed-set routing with low tolerance for invented labels | Fine-tuned classifier or constrained decoding layer | Prompt-only LLMs may produce invalid labels |
| Rapid prototype with uncertain taxonomy | Vision-language LLM | Fast to test, useful for discovering taxonomy problems |
| High-volume production intake | Specialized model with measured end-to-end latency | OCR and large-model serving costs compound quickly |
| Domain-specific or language-specific document sets | Fine-tune or evaluate on local samples before deployment | Public benchmark performance may not transfer |
This is Cognaptus’s inference from the paper, not something the authors directly test as a business deployment framework. The paper directly shows model performance on RVL-CDIP under a controlled experimental setup. The business inference is that document AI procurement should begin with evidence flow: what parts of the document must the model see to make the correct label?
That question is more useful than asking whether the model is “AI-powered,” which, in 2026, is a little like asking whether a bank uses spreadsheets.
What the paper directly shows, and what it does not
The paper directly shows four things.
First, under its experimental setup, specialized multimodal Transformers outperform the evaluated prompt-based LLM approaches on RVL-CDIP document type classification. Donut reaches 0.95 accuracy, LayoutLMv3 reaches 0.90, Qwen3-VL reaches 0.75, and Qwen3 reaches 0.55.
Second, direct image processing matters. The OCR-free vision-language LLM outperforms the OCR-text-only LLM by a large margin. The class-level failures suggest that layout, handwriting, sparse text, and visual structure are not optional extras.
Third, OCR-dependent pipelines are slower end-to-end in this setup, because OCR adds processing overhead. OCR can still help when combined with layout and image features, as in LayoutLMv3, but OCR-only text is fragile for visually rich classification.
Fourth, label design matters. Broad or visually overlapping categories such as form, presentation, invoice, and budget create confusion across architectures. A classifier cannot cleanly learn categories that the dataset itself makes ambiguous.
What the paper does not show is equally important.
It does not prove that Donut will beat every modern multimodal LLM in every enterprise setting. It tests specific representative models. It does not fine-tune the LLMs. It does not evaluate multi-page enterprise workflows. It does not test private company taxonomies, multilingual document streams, recent born-digital PDFs, or industry-specific forms. It also uses RVL-CDIP, a benchmark with known issues, including label noise, ambiguous documents, duplicate or template-matched samples, older document sources, scanning artifacts, and low resolution.
Those limits do not weaken the practical message. They make it sharper. If a model struggles under a clean benchmark taxonomy, it may not magically behave better when confronted with messy operational archives, missing pages, rotated scans, stamped copies, and labels invented by an operations department in 2014.
The real business value is better routing discipline
The paper is about document type classification, but its broader value is diagnostic. It gives automation teams a way to ask better questions before building a pipeline.
Do not start with: “Which model is best?”
Start with:
- Which document classes are visually separable?
- Which classes depend on text, layout, handwriting, tables, or sparse visual cues?
- Is OCR a helper signal or the only evidence source?
- Is the taxonomy clean enough for a machine to learn?
- Does the workflow require closed-set obedience or flexible reasoning?
- What is the acceptable latency per document at production volume?
- Can the organization fine-tune and maintain a specialized model, or does it need prompt-based adaptability?
This is less glamorous than saying “we use a 32B model.” It is also more likely to work.
The main operational lesson is that document intake automation is not a language task with screenshots attached. It is a multimodal routing problem. Sometimes the text is enough. Sometimes the page layout is the label. Sometimes the model’s failure is not caused by model weakness but by a document taxonomy that humans understand socially and machines experience as overlapping patterns.
A strong LLM can help, especially when used with images and guardrails. But for high-precision, closed-set document classification, the paper’s evidence favors specialized multimodal systems that preserve visual structure and constrain outputs. The future of document AI may still include LLMs. It just should not require pretending that OCR text is the whole document.
The city was always there. The OCR transcript only gave us the street names.
Cognaptus: Automate the Present, Incubate the Future.
-
Catyana Heyne, Jürgen Frikel, and Filippo Riccio, “Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis,” arXiv:2606.02162, 2026. The HTML version was unavailable at drafting time, so this article uses the PDF version: https://arxiv.org/pdf/2606.02162 ↩︎