The Right Tool for the Thought: How LLMs Solve Research Problems in Three Acts

TL;DR for operators

Generative AI is useful for data processing when the work is painfully simple for a human and painfully awkward for software. That sounds like a joke until you meet the actual enterprise data stack: PDFs with shifting layouts, scanned documents with OCR scars, multilingual reports, product descriptions pretending to be industry classifications, and a graveyard of “temporary” spreadsheets that somehow became critical infrastructure.

The paper by Mitra et al. is valuable because it does not ask whether LLMs are magical research assistants. Mercifully. It asks a narrower and much more useful question: when can generative AI process large volumes of research text without a human checking every single item?¹

The answer is comparative:

Use case	What the paper directly shows	Operational meaning	Boundary
Historical seedlists	Claude 3 Opus extracted all species names correctly from four evaluated seedlist pages, repeated consistently across three runs.	LLMs can work well for objective extraction from heterogeneous formats when the answer is present in the text.	The sample is very small; the result is promising, not a population-level guarantee.
HTA reimbursement documents	Across three documents and three runs, 11 of 14 target data points were extracted accurately and consistently; some variation was semantic rather than substantive, and one error came from prompt ambiguity.	LLMs can extract structured facts from complex documents, but “understanding” introduces ambiguity that prompt wording must control.	The evidence covers one drug-indication combination across three HTA bodies.
Kickstarter NAICS classification	The best AI-human match was 53%, while the best human-human match was 60%, on overlapping samples.	For subjective classification, the standard is not perfect accuracy but comparability to human judgement.	There is no objective ground truth; plausibility replaces correctness, which is a dangerous trade if governance is weak.

The operator’s lesson is simple: do not begin with “Can we use AI here?” Begin with “What kind of uncertainty does this task contain?” If the task is objective extraction, evaluate against ground truth. If it is document interpretation, test prompt ambiguity and semantic stability. If it is classification without ground truth, measure agreement with human raters and decide whether that level of disagreement is acceptable. Very glamorous. Also the difference between automation and expensive nonsense.

Messy documents are where automation goes to die

Every organisation has a version of this problem. The files are digital, technically. The content is structured, spiritually. A person can read them and know what matters. A script cannot, because the table moved, the label changed, the language switched, the scan is imperfect, or the category system was designed by someone who enjoyed pain.

This is the gap the paper targets. It is not about using ChatGPT to polish a paragraph or brainstorm a literature review. The authors are interested in “unsupervised mode”: using generative AI to process large quantities of textual data without a human reviewing every sample. That distinction matters. A tool can be impressive when watched closely and still be unsuitable when scaled across 300,000 rows.

The paper studies three research engineering tasks:

extracting plant species names from historical seedlists;
extracting policy-relevant data points from Health Technology Assessment documents;
assigning NAICS industry codes to Kickstarter projects.

The shared feature is not that all three tasks are “AI tasks”. The shared feature is that rule-based approaches are awkward. The formats vary. The language is unstructured. The classification space is large. A human can usually make sense of an individual sample, but writing a clean deterministic rule for every future case is another matter entirely.

That is the paper’s most useful contribution: it converts enthusiasm for generative AI into a suitability test. The question becomes: is the task large, textual, heterogeneous, hard to rule-code, and still evaluable enough that errors can be detected or bounded?

If yes, generative AI is worth testing. If no, the model is probably just a very articulate way to create audit problems.

The three acts are not three examples; they are three kinds of uncertainty

The paper is best read as a comparison, not as a case-study tour. The three use cases differ in the kind of uncertainty they expose.

Act	Task type	Main uncertainty	Evaluation logic
Seedlists	Information extraction	Can the model recover exact entities from messy layouts and OCR text?	Compare against manually constructed ground truth.
HTA documents	Natural language understanding	Can the model extract and summarise data points when information is implicit, multilingual, or differently framed?	Compare against human interpretation and inspect semantic consistency.
Kickstarter	Text classification	Can the model assign plausible industry codes where even humans disagree?	Compare AI-human agreement with human-human agreement.

That comparison is the article hiding inside the paper. In Act I, the answer is supposed to exist in the document. In Act II, the answer may require interpretation. In Act III, there may not be a single correct answer at all.

This distinction is where many automation projects quietly fail. They treat all “text processing” as one category. It is not one category. Extraction, interpretation, and judgement behave differently. They need different metrics, different review processes, and different tolerance for error.

Act I: seedlists show LLMs at their most useful

The seedlist task is almost comically well suited to generative AI. Botanical gardens have published catalogues of seeds for centuries. Utrecht University Botanic Gardens has an archive containing seedlists from around the world over the last 200 years. The pilot phase alone involves 2,000 seedlists and 45,000 pages; the full archive contains 30,000 seedlists.

The task is to extract plant species names. That sounds straightforward until the documents arrive wearing all the usual disguises: one-column lists, two-column lists, tables, PDFs, Excel files, older paper scans, typewritten pages, handwritten material, and OCR errors. The information is present, but not consistently placed or formatted.

This is the sweet spot: objective output, heterogeneous input.

The authors evaluate four seedlist pages. The manually identified numbers of plant species names are 42, 28, 23, and 32. Claude 3 Opus extracted all species names correctly from these pages. The authors ran the process three times and obtained consistent results. For this limited sample, recall, precision, and accuracy were all 100%.

That sounds dramatic, so it needs immediate resizing. This is not a large benchmark. The authors explicitly frame the result as a limited sample, not proof that the method will generalise perfectly across the archive. But the result is still meaningful because it demonstrates a pattern operators should care about: when the target is objective and visible in the source text, LLMs can absorb layout variation that would make rule-writing brittle.

The seedlist example also shows why LLMs are not merely replacing OCR or parsing. OCR converts scanned documents into text. It does not necessarily know that a distorted botanical name is still a botanical name. In one older seedlist example, the OCR contained errors, and the model was able to correct some of them. That is useful, but also slightly dangerous: correction is helpful when correct and hallucination-adjacent when wrong. The model is not just copying; it is interpreting.

The paper includes an earlier attempt with OpenAI’s Assistants API using gpt-4-0125-preview, where the temperature could not be specified at the time of the experiment. This should not be treated as a controlled ablation. It is better read as an exploratory comparison and sensitivity example. The API extracted the species names and corrected some OCR errors, but it also introduced different errors across three runs. Some were harmless abbreviation or expansion of author names. Others were more serious, including substitutions, unwanted inclusion of text, or omission of parts of a species name.

The lesson is not “Claude good, OpenAI bad”. That would be the sort of benchmark theatre we already have enough of. The lesson is that even when the task is objective, the model configuration and interface matter. Temperature control, prompt design, file handling, and output processing are not accessories. They are part of the method.

Act II: HTA documents expose the cost of “understanding”

The second use case is harder because the target is not just an entity sitting neatly in text. The project extracts data points from Health Technology Assessment reimbursement documents published by bodies such as NICE in the UK, HAS in France, and ZIN in the Netherlands.

The data points include the drug name, brand name, indication, final recommendation, comparator, relative effectiveness, cost-effectiveness, budget impact, managed entry agreements, and clinical restrictions. There are 14 target fields in total.

This task adds three complications.

First, documents come from different organisations with different formats. Second, documents appear in different languages. Third, some information is not directly or concisely stated. The model may need to synthesise, summarise, or infer from several parts of a document.

That is where extraction becomes interpretation.

The authors test one drug-indication combination: Ivabradine for chronic heart failure, assessed by NICE, HAS, and ZIN. They create a human ground truth and run the model three times per document.

The result is encouraging but less clean than the seedlists case. Out of 14 data points, 11 are extracted accurately and consistently across all three documents and runs. In the ZIN document, two fields — final recommendation and budget impact outcome — are not lexically identical across runs, but are semantically consistent. In the HAS document, one field is consistent but not the desired answer: the model reports the evaluating committee, Commission de la Transparence, rather than the parent organisation, HAS. The authors trace this to ambiguity in the question asking which HTA body performed the assessment.

That single error is more useful than a dozen success stories. It shows where LLM workflows break in practice: not only because the model “hallucinates”, but because the instruction under-specifies the desired interpretation.

In a business setting, this distinction matters. If a model extracts an invoice number incorrectly, the error is factual. If it answers the wrong organisational level because the prompt says “body” and the document contains both a committee and parent agency, the problem is partly semantic governance. The model may be doing a reasonable job on an unreasonable instruction.

The HTA case therefore changes the engineering question. For seedlists, the main question is whether the model can preserve exact entities across messy formats. For HTA documents, the main question is whether the workflow can force ambiguous concepts into stable definitions.

This is why prompt engineering is not a cosmetic exercise here. The paper notes that crafting a clear, unambiguous prompt is often the most time-consuming part of building the pipeline. That is not because researchers need better incantations. It is because the prompt becomes the task specification. If the specification is vague, the model’s consistency may simply mean it is consistently answering the wrong question. Splendidly efficient, in the worst possible way.

Act III: Kickstarter shows what happens when ground truth disappears

The Kickstarter use case looks like classification, but its real contribution is about evaluation.

The research goal is to study whether crowdfunding contributes to local economic growth in the United States. To do that, the project needs to assign 2017 NAICS industry codes to Kickstarter projects from 2014 to 2023. The dataset contains about 300,000 projects. Each project includes fields such as name, blurb, Kickstarter category, and subcategory.

This is a large classification task with 311 possible four-digit NAICS codes. Manual labelling is not realistic. Rule-based mapping is also weak because Kickstarter categories do not map neatly to NAICS industry categories. A project might be music, publishing, manufacturing, retail, food service, or several of those depending on what exactly is being funded.

Unlike the seedlist case, there may be no single right answer. The paper states this directly: assessment of a NAICS code is inherently subjective, and different human raters may assign different codes to the same project.

So the evaluation changes. The authors use interrater reliability logic. They select a representative sample of 540 projects, with roughly equal numbers from the 15 Kickstarter categories. Six human raters assign NAICS codes to partially overlapping subsets, so each project receives codes from two independent human raters. The model’s codes are then compared with human codes.

The highest AI-human match is 53% across 145 projects. The highest human-human match is 60% across 63 projects. The authors conclude that, for this task, generative AI is broadly comparable to a human rater.

This is not a conventional accuracy result. It says something subtler: when humans themselves disagree, model performance has to be judged against the disagreement structure of the task. If two competent people do not reliably produce the same label, demanding perfect model agreement is intellectually tidy and operationally useless.

But the result also imposes a sharp business boundary. “Comparable to a human rater” is not the same as “correct”. It may be enough for exploratory economic research, portfolio-level sector analysis, or pre-labelling workflows. It may not be enough for compliance, credit decisions, regulated reporting, or anything where one label creates a binding consequence.

The paper’s examples of disagreements make the ambiguity concrete. A jazz album project can plausibly be classified as performing arts or sound recording. A poetry book can be independent artists or publishing. A curry ketchup project can be food manufacturing or restaurants. Both labels can be defensible depending on whether one emphasises the creator, product, business model, or intended economic activity.

This is exactly the sort of ambiguity executives like to ignore until it appears in a dashboard with two decimal places.

The pipeline is deliberately boring, which is good

The paper’s engineering pipeline is simple:

preprocess and chunk the input so it fits model token limits;
combine each chunk with task instructions;
send the prompt to the model;
retry if the call fails because of timeout, token limits, or similar issues;
post-process the output into usable structured data;
repeat until all chunks are processed.

The output is requested in JSON because the result must be machine-readable. Temperature is set to 0 because the desired outputs are objective or at least intended to be stable. Claude 3 Opus is used because, by trial and error, the authors found it more accurate and consistent than the GPT-3.5 and GPT-4 models they tried for these tasks.

There is nothing flamboyant here. That is the point.

Most useful AI pipelines are not cinematic agent swarms marching through a corporate PowerPoint. They are input preparation, prompt discipline, retries, structured output, post-processing, and evaluation. The operational value comes from reducing the number of places where a stochastic model can behave like a bored intern with excellent grammar.

The paper’s pipeline should be read as an implementation detail that supports the larger claim: generative AI only becomes a data processing method when it is wrapped in engineering controls. A bare chat interface is not a pipeline. A prompt pasted into a web app is not a method. A JSON instruction is not a quality system.

Temperature zero and JSON are guardrails, not a magic oath

One of the most useful misconceptions the paper corrects is that deterministic settings make LLMs deterministic enough for production.

The authors set temperature to 0 across the main use cases to minimise randomness and variability. They also ask for JSON output to make downstream processing possible. Both choices are sensible. Neither is sufficient.

The paper explicitly notes that temperature 0 minimises non-determinism but does not eliminate it. This matters because many teams treat temperature as a superstition dial: set it to zero, declare the model “stable”, and proceed directly to procurement.

The paper says: not so fast.

Accuracy and consistency are related but separate. A model can be consistent and wrong. It can be accurate on average but inconsistent across repeated runs. It can produce semantically equivalent answers with different wording, which may be acceptable for interpretation tasks but annoying for automated processing. It can also produce outputs that look structured but contain an answer to a subtly different question.

For operators, this creates three tests:

Test	Question	Failure mode
Accuracy	Does the model match a trusted ground truth?	It produces plausible but incorrect data.
Consistency	Does the same setup produce stable outputs across runs?	It gives different answers under nominally identical conditions.
Specification fit	Is the model answering the intended question?	It answers a reasonable but wrong interpretation of the prompt.

The third test is the one most likely to be missed. The HTA “HTA body” example is not just an extraction failure. It is a specification failure revealed through model behaviour.

The evidence is illustrative, not benchmark theatre

The paper is careful about the kind of evidence it provides. It is qualitative and example-driven. The authors do not present a large statistical evaluation using standard metrics across thousands of cases. They explicitly plan that for future work.

That does not make the paper weak. It makes it appropriately scoped.

Here is how the evidence should be read:

Evidence item	Likely purpose	What it supports	What it does not prove
Shared chunking, retry, JSON, post-processing pipeline	Implementation detail	LLM data processing requires engineering scaffolding.	It does not prove the pipeline is optimal.
Claude 3 Opus seedlist extraction on four pages, repeated three times	Main illustrative evidence	Objective extraction can be accurate and consistent on heterogeneous text samples.	It does not guarantee archive-wide performance.
OpenAI Assistants API three-run seedlist example	Exploratory comparison / sensitivity example	Model interface and configuration can affect variability and error patterns.	It is not a controlled model benchmark.
HTA extraction for Ivabradine across NICE, HAS, and ZIN, three runs each	Main illustrative evidence	LLMs can extract many structured data points from complex multilingual policy documents.	It does not establish general performance across all HTA documents or therapeutic areas.
Kickstarter sample of 540 projects with human-rater comparison	Main evidence for ambiguous classification	LLM classification can be broadly comparable to human labelling where humans also disagree.	It does not establish objective correctness because no ground truth exists.

This is useful because it prevents the common overreading. The paper is not saying “LLMs achieve X% accuracy on research data processing.” It is saying: here are the conditions under which the method looks promising, here are the patterns of failure, and here is how evaluation must change by task type.

That is a better contribution than another leaderboard number with a suspiciously heroic decimal.

What Cognaptus infers for business use

The business implication is not “replace data teams with LLMs”. It is more specific: use generative AI as a triage tool for high-volume text workflows where rule-based automation breaks under format variation.

The best candidates have five properties:

the input is textual or document-like;
the volume makes full manual review uneconomic;
the layout, language, or category system is too heterogeneous for simple rules;
a human can usually perform the task on an individual sample;
quality can be evaluated through ground truth, repeated runs, or human agreement.

That last condition is not optional. Without evaluation, the model is not a data processing method. It is a confident narrator.

For businesses, the paper suggests a practical decision framework:

Task condition	Use generative AI?	Recommended control
Objective extraction with visible answers	Yes, strong candidate	Build a labelled sample, test precision/recall, repeat runs, inspect edge cases.
Complex document interpretation	Yes, but cautiously	Define fields tightly, test prompt ambiguity, compare semantic consistency across runs.
Subjective classification without ground truth	Possibly	Compare against multiple human raters and decide whether disagreement is acceptable for the business purpose.
High-stakes regulated decision	Not without stronger governance	Require audit trails, human review, legal assessment, and model/version control.
Simple structured extraction from stable formats	Probably not necessary	Use rules, parsers, or conventional automation. Do not bring a dragon to open a jar.

The ROI logic is also narrower than the usual AI sales pitch. The value is not just lower labour cost. It is cheaper diagnosis of where automation is feasible. A disciplined LLM pilot can tell an organisation whether the bottleneck is format heterogeneity, ambiguous definitions, missing ground truth, or genuinely hard domain judgement.

That diagnostic value is underrated. Many automation failures happen because teams automate the interface before understanding the task.

Where the business inference stops

The paper has three important boundaries.

First, the evidence is illustrative. The seedlist result is excellent but based on four evaluated pages. The HTA result covers one drug-indication combination across three organisations. The Kickstarter comparison is broader, but it evaluates agreement rather than objective truth. These are credible demonstrations, not final performance guarantees.

Second, model choice is empirical but not exhaustively benchmarked. Claude 3 Opus performed best among the models the authors tried, but the paper does not provide a systematic model comparison across current alternatives. That matters because model performance, API behaviour, and availability change over time. A workflow built on a proprietary model inherits versioning and reproducibility risk.

Third, the authors explicitly do not assess legal and ethical appropriateness. They focus on technical suitability. In business settings, that distinction is non-negotiable. Sending sensitive documents to a public API may be unacceptable even if the extraction quality is excellent. Bias, privacy, data retention, auditability, and regulatory obligations do not vanish because the JSON parsed correctly.

These boundaries do not weaken the paper’s practical value. They prevent misuse. A tool can be technically appropriate and still institutionally inappropriate. Annoying, but reality has never been known for its elegant API.

The real lesson is task typing

The paper’s central lesson is that generative AI should not be evaluated as one generic capability. It should be evaluated by task type.

Seedlists ask: can the model find the right entities in messy text?

HTA documents ask: can the model extract and stabilise meaning from complex documents?

Kickstarter asks: can the model behave like a plausible human classifier when even humans disagree?

Those are different questions. They need different evidence. They create different business risks.

The right tool for the thought is not always an LLM. Sometimes it is a parser. Sometimes it is a database constraint. Sometimes it is a human expert and an uncomfortable meeting about data definitions. But when the problem is large-scale, heterogeneous text that humans can interpret and rules cannot easily capture, generative AI deserves a disciplined trial.

Not a miracle. Not a toy. A tool — which is already a high enough bar.

Cognaptus: Automate the Present, Incubate the Future.

Modhurita Mitra, Martine G. de Vos, Nicola Cortinovis, and Dawa Ometto, “Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases,” arXiv:2504.15829, 2025. ↩︎

TL;DR for operators#

Messy documents are where automation goes to die#

The three acts are not three examples; they are three kinds of uncertainty#

Act I: seedlists show LLMs at their most useful#

Act II: HTA documents expose the cost of “understanding”#

Act III: Kickstarter shows what happens when ground truth disappears#

The pipeline is deliberately boring, which is good#

Temperature zero and JSON are guardrails, not a magic oath#

The evidence is illustrative, not benchmark theatre#

What Cognaptus infers for business use#

Where the business inference stops#

The real lesson is task typing#