The Right Tool for the Thought: How LLMs Solve Research Problems in Three Acts

Generative AI is often praised for its creativity—composing symphonies, painting surreal scenes, or offering quirky new business ideas. But in some contexts, especially research and data processing, consistency and accuracy are far more valuable than imagination. A recent exploratory study by Utrecht University demonstrates exactly where Large Language Models (LLMs) like Claude 3 Opus shine—not as muses, but as meticulous clerks.

When AI Becomes the Analyst

The research project explores three different use cases in which generative AI was employed to perform highly structured research data tasks:

Extracting plant species names from historical seedlists: These documents came from over 200 years of botanical records, scanned or exported from various formats. The LLM was able to identify genus, epithet, and even synonym variants despite inconsistencies in layout and OCR noise. This task would be unfeasible for a rule-based parser due to the lack of a unified structure.
Extracting specific medical data from HTA reports: Health Technology Assessment (HTA) documents are written in various languages and styles across European nations. The LLM extracted key information like drug name, indication, and recommendation from paragraphs of dense regulatory language. It outperformed any static rule system due to its ability to understand semantic intent.
Classifying Kickstarter projects into NAICS codes: With over 300,000 creative project descriptions, assigning industry codes was a classic example of subjective classification. Generative AI matched human-level performance, demonstrating usefulness even in ambiguous, label-sparse tasks. Its capacity to generalize across brief blurbs gave it an edge.

Why Not Use Rules?

In structured environments, traditional code excels. But in these use cases, the input data had inconsistent formats (scanned seedlists, multilingual HTA reports, brief Kickstarter blurbs). For instance, extracting plant names from seedlists with handwritten OCR errors or parsing French HTA documents with embedded tables defies rule-based tokenization. Another example is the Kickstarter NAICS labeling, where “poetry” could relate to independent artistry or book publishing depending on nuanced wording—something a regex or tree parser can’t handle.

Engineering Consistency in a Creative Tool

To convert LLMs from poets to processors, the researchers set:

Temperature = 0, minimizing randomness
Well-engineered prompts, specifying exact output format (usually JSON)
Chunked input to avoid token limits and ensure clarity

Here’s a table showing how model outputs vary with temperature, using the same prompt “Extract all plant species names from the following OCR text.”

Temperature	Output Quality	Notes
0.0	✅ Accurate and consistent	Strict JSON, no hallucination
0.5	✅ Mostly accurate, slight drift	Occasionally adds explanations
0.7	⚠️ Adds alternate interpretations	Inserts variations and extra formatting
0.9	❌ Hallucinates names and facts	Multiple formats, verbose, inconsistent structure

Examples of prompts:

✅ Good: “Extract all plant species names in JSON format. Each entry should include genus, species, and author if present.”
❌ Bad: “Can you tell me the plants mentioned here?”

Example of chunking:

Given a 200-page OCR PDF, it’s broken into chunks of ~500 words per prompt. Each prompt includes a static header: “You are a botanical data extraction assistant. Process the following chunk and output results in structured JSON.”

Lessons for Broader Business Process Automation

Though these cases are research-oriented, the same logic applies to business workflows in BPA (Business Process Automation):

Here’s a comparative table of where LLMs outperform and underperform traditional logic:

Case	Task Description	Better with LLM?	Why?
1	Parsing unstructured resumes	✅	Varied layouts and keywords, LLM adapts flexibly
2	Classifying email intent (support vs. complaint)	✅	Requires semantic understanding
3	Summarizing meeting transcripts	✅	LLM can abstract long conversations
4	Extracting product specs from web content	✅	HTML structures are inconsistent
5	Translating colloquial customer messages	✅	Nuance and idioms handled better
6	Calculating invoice totals from tables	❌	Rule-based script is deterministic and reliable
7	Validating insurance form formats	❌	Regex and schema checkers excel
8	Sorting entries into fixed taxonomy	❌	LLM may misclassify edge cases
9	Running SQL queries based on dropdown inputs	❌	Direct mapping preferred over natural language
10	Monitoring server logs for error codes	❌	Pattern-matching scripts are faster and more accurate

LLMs excel when ambiguity, natural language, or formatting irregularities exist. Traditional logic dominates when the structure is known, deterministic, and performance-critical.

Closing Thought

As Cognaptus sees it, generative AI is not a universal replacement for rule-based systems—but a powerful tool when matched to the right type of problem. If it’s too subtle for scripts, but too dull for humans, it just might be the perfect job for a well-tuned LLM.

Explore the full research here: github.com/UtrechtUniversity/generative-ai

Reference: Mitra, M., de Vos, M. G., Cortinovis, N., & Ometto, D. (2025). Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases. arXiv:2504.15829 [cs.AI]. Submitted to 2024 IEEE 20th International Conference on e-Science, Osaka, Japan.

Cognaptus: Automate the Present, Incubate the Future.

When AI Becomes the Analyst#

Why Not Use Rules?#

Engineering Consistency in a Creative Tool#

Lessons for Broader Business Process Automation#

Closing Thought#

When AI Becomes the Analyst

Why Not Use Rules?

Engineering Consistency in a Creative Tool

Lessons for Broader Business Process Automation

Closing Thought