When the AI Becomes the Agronomist: Can Chatbots Really Replace the Literature Review?

A farmer does not need a literature review. She needs to know what works.

That simple sentence is why AI agronomy is so tempting. Somewhere inside thousands of papers are useful answers: which microbial agents suppress whitefly, whether botanicals work outside the lab, how much pest control disappears when a method leaves a greenhouse and meets weather, soil, and actual insects with their own little business plans. The evidence exists, but it is fragmented, multilingual, paywalled, and written in the soothing dialect of “further research is warranted.”

So the promise is obvious: ask a chatbot to read the science and turn it into actionable crop-protection advice.

The uncomfortable question is equally obvious: when the chatbot becomes the agronomist, is it reading the literature—or remixing a very confident hallucination?

A recent arXiv paper by Kris A. G. Wyckhuys gives that question a useful stress test. It compares two free-tier general-purpose AI systems—web-grounded DeepSeek-R1 and non-grounded ChatGPT-4o—on agroecological crop-protection knowledge synthesis across nine major insect pests, plant diseases, and weeds.¹ The comparison is not merely “which model is better.” That would be too easy, and therefore probably wrong. The real lesson is sharper: web-grounding greatly improves evidence coverage and internal coherence, but it does not magically turn an LLM into a trustworthy scientific review engine. Agriculture gets a librarian, not an oracle. Admittedly, a librarian who sometimes invents insects.

The study compares two AI behaviors, not just two brands

The paper sets up a practical comparison between two modes of AI knowledge work.

ChatGPT-4o, in the free-tier version tested, is treated as a non-grounded model: fluent, general-purpose, and dependent on what is available in its trained representation or accessible behavior at inference time. DeepSeek-R1 is treated as web-grounded: able to search across online material and report from a larger live corpus. The distinction matters because crop-protection science is not a neat encyclopedia entry. It is a messy evidence field, where relevant information may sit in country-specific trials, older taxonomy, regional databases, and non-English sources.

The author submits structured prompts between June 20 and July 5, 2025. The prompts ask both systems to screen peer-reviewed literature, extract efficacy data, report laboratory, greenhouse, and field performance, and list biological control agents or non-chemical management solutions. The targets include three insect pests—Bemisia tabaci, Helicoverpa armigera, and Plutella xylostella—plus three major plant diseases and three major weed groups. The management tactics cover microbial agents, predators, parasitoids, botanical extracts, and agroecological measures.

This is not a casual chatbot quiz. The prompts use inclusion and exclusion criteria resembling systematic-review instructions: peer-reviewed sources, geographic restrictions, quantitative efficacy metrics, controls, replication, and reported variability. Supplementary Table 1 is best read as an implementation detail: it shows the prompt design used to force the models into something closer to a structured evidence-synthesis workflow. It does not prove the workflow is valid. It tells us what the models were being asked to approximate.

That distinction is important. The paper does not validate either AI output against a complete human-run systematic review. Instead, it compares AI-reported literature breadth, internal consistency, and obvious factual reliability. It is a benchmark of machine-generated evidence synthesis under practical conditions, not a declaration that machine-only reviews are ready for adult employment.

Coverage: DeepSeek brought a library; ChatGPT brought selected clippings

The first result is the easiest to understand and the easiest to misuse.

DeepSeek reportedly screened a much larger literature corpus than ChatGPT. Across the study, the author summarizes DeepSeek’s coverage as 4.8 to 49.7 times larger and its reported set of biological control agents or solutions as 1.6 to 2.4 times larger. For insect pests, this gap appears repeatedly.

For B. tabaci, DeepSeek and ChatGPT reported data for about 13.8 versus 5.8 agents, extracts, or methods across tactics. The corresponding publication counts were about 774.4 versus 17.4. That makes ChatGPT’s reported literature base 97.8% smaller.

For H. armigera, DeepSeek reported around 14.8 agents or methods versus ChatGPT’s 9.0, based on about 2,668.6 versus 550.8 publications. ChatGPT’s base was 79.4% smaller.

For P. xylostella, the contrast widened again: DeepSeek reported around 11.0 agents or methods versus ChatGPT’s 6.2, based on about 1,938.8 versus 39.0 publications. ChatGPT’s base was 98.0% smaller.

A blunt table helps:

Comparison point	What the paper reports	Interpretation	Boundary
Literature coverage	DeepSeek screened 4.8-49.7x more literature	Web-grounding substantially expands the evidence base	Bigger retrieval is not the same as verified retrieval
Agents or solutions listed	DeepSeek listed 1.6-2.4x more	Broader coverage reduces omission risk	It may also increase hallucinated entities
ChatGPT free-tier corpus	Often 79-98% smaller in reported insect-pest examples	Free-tier access may shape knowledge quality	The study does not test paid ChatGPT, plugins, or custom retrieval
Country-level evidence	DeepSeek drew more strongly from China and regional databases	Local evidence matters in agriculture	Country-level outputs can still be sparse or inconsistent

This is the first business lesson. In technical domains, model comparison is not only about reasoning quality. It is about evidence access. A chatbot that cannot retrieve enough of the relevant literature may sound useful while quietly narrowing the decision space.

In agriculture, that narrowing can be operationally serious. If the model misses entire predator guilds, ignores relevant microbial agents, or under-represents regional studies, the user may not see a weak recommendation. They may see a plausible but incomplete menu.

That is the dangerous part. Bad AI does not always look wrong. Sometimes it looks tidy.

Consistency: the better model understood that fields are not laboratories

Coverage alone would be a weak result. A model can retrieve more and still synthesize nonsense at scale. The paper therefore asks a second question: do the AI-reported numbers behave in an agronomically plausible way?

Here DeepSeek looks stronger.

Across pests and tactics, DeepSeek’s AI-reported field efficacy was strongly consistent with corresponding laboratory efficacy. For average efficacy values, the relationship had an $R^2$ of 0.838; for maximum values, $R^2$ was 0.810. ChatGPT showed the same broad direction but much weaker explanatory power: $R^2$ of 0.219 for averages and 0.135 for maxima.

This matters because the lab-to-field gap is one of the most basic realities in crop protection. Laboratory performance usually overstates field performance. Controlled temperature, humidity, exposure, and target density do not survive contact with weather, crop architecture, farmer practice, or pests that inconveniently refuse to behave like spreadsheet rows.

DeepSeek captured this decline more coherently. Its average field efficacy values were reported as 15.8-25.9% lower than corresponding laboratory values; maximum field efficacy values were 7.8-15.6% lower. ChatGPT also showed lower field than lab estimates, but with wider variability.

The paper’s Figure 2 is main evidence rather than decorative plotting. It tests whether each model’s outputs preserve a plausible relationship between laboratory and field performance. DeepSeek’s panels show tighter, more interpretable lab-field relationships; ChatGPT’s are visibly noisier, with some dashed non-significant trendlines. The visual point is simple: DeepSeek’s numbers behave more like agronomic evidence. ChatGPT’s numbers behave more like a model trying to satisfy a format.

That difference is not just academic. A decision-support system needs consistency more than eloquence. If the system cannot preserve the basic relationship between controlled trials and field outcomes, then every downstream recommendation becomes suspect. It may still be grammatically polished. Crops, regrettably, do not respond to grammar.

The models disagree exactly where users would want confidence

The most revealing result is not that DeepSeek beats ChatGPT on coverage. It is that the two models disagree at the level of individual agents and solutions.

When the paper compares DeepSeek-reported field efficacy values with ChatGPT-reported values for specific pest-management agents, the agreement is essentially absent. For average field efficacy, the relationship is non-significant with $R^2 = 0.026$. For maximum field efficacy, it is similarly non-significant with $R^2 = 0.027$.

Figure 3 is therefore not a side chart. It is a warning label.

At the coarse level, both systems can identify that non-chemical pest-management approaches—microbials, botanicals, predators, parasitoids, agroecological practices—can be valid pesticide alternatives. At the detailed level, the models do not reliably agree on how well specific agents work.

This creates a practical hierarchy of trust:

Level of question	AI appears more useful for	AI remains risky for
Strategic direction	“Are there credible non-chemical alternatives?”	Replacing agronomic policy design
Evidence discovery	“Which agents or tactics should experts inspect?”	Treating listed agents as verified
Coarse pattern	“Does field performance tend to lag lab performance?”	Precise efficacy claims
Farmer communication	Plain-language summaries of broad options	Autonomous prescription
Scientific synthesis	Screening and triage	Machine-only review conclusions

This hierarchy is the article’s central point. The LLM can be a useful evidence triage layer. It should not be the final evidentiary authority.

A bad procurement decision would ask, “Which chatbot gives the best answer?” A better procurement decision asks, “At what resolution does the chatbot remain useful?”

For this paper, the answer is: coarse trends, broader discovery, and early synthesis look promising. Detailed decision support still needs expert verification. The boring sentence is also the correct one. Annoying how often that happens.

Hallucination is not one failure; it is several different failures wearing the same hat

The paper’s limitation discussion is unusually useful because it does not merely say “hallucinations happened.” It gives several types.

First, both models periodically listed implausible ecological interactions. The paper notes that both reported high B. tabaci control efficacy from Nosema bombycis and Spodoptera exigua nucleopolyhedrosis virus, even though neither microorganism has been reported to affect that pest. DeepSeek also reportedly hallucinated non-existent agents and associated references, such as a B. tabaci nucleopolyhedrosis virus.

Second, the models described biologically or experimentally impossible settings. ChatGPT reportedly described reduced tillage suppressing H. armigera populations in the laboratory; DeepSeek reportedly described hedgerow deployment doing the same. These are not simply wrong numbers. They are category errors. A hedgerow is not a petri-dish treatment unless the laboratory has become aggressively ambitious.

Third, the models confused nomenclature. DeepSeek treated older and newer names of biological control agents, such as Paecilomyces fumosoroseus and Isaria fumosorosea, as distinct species. In biology, names change. A system that cannot reconcile taxonomy can double-count evidence, split records incorrectly, or recommend a “new” agent that is merely an old name wearing a new badge.

Fourth, omission remained severe. ChatGPT sometimes missed whole categories. For example, the paper reports that for global field-level control of P. xylostella, ChatGPT covered only one microorganism, Bacillus thuringiensis, and omitted invertebrate predators altogether. Narrower retrieval did not just produce fewer citations. It distorted the structure of the solution space.

These failures have different operational meanings.

Failure mode	What it looks like	Operational risk
Fabricated agent	Non-existent organism or virus	False recommendation enters product knowledge base
Implausible interaction	Wrong pest-agent relationship	Decision support invents efficacy
Impossible setting	Field practice tested “in lab”	Evidence labels become meaningless
Taxonomic confusion	Old and new names counted separately	Duplicate or fragmented evidence
Omission	Entire solution category missing	User never sees viable alternatives

This is why “hallucination rate” is an insufficient governance metric. A fabricated reference, a nomenclature error, and an omitted predator guild are not the same failure. They require different safeguards.

An agritech company building AI crop advisory tools should therefore design validation layers around failure type, not around generic “AI accuracy.” Entity resolution, citation verification, taxonomy normalization, country-specific source checks, and expert review are separate controls. Put them all under a single “fact-checking” label and you have produced governance theater. Very popular. Not very helpful.

The disease and weed results are exploratory extensions, not the main comparison

After showing stronger consistency for DeepSeek in the insect-pest comparison, the paper uses DeepSeek alone to assess disease and weed management efficacy. This is best read as an exploratory extension. It expands the scope of the question, but it does not provide the same two-model comparison.

For diseases, DeepSeek identified botanicals and agroecological measures as slightly outperforming microbial fungicides or bactericides under field conditions. It reported botanical mixtures and combined agroecological preventative measures reaching about 90-93% efficacy for potato late blight, wheat rust, and Fusarium head blight.

For weeds, DeepSeek reported microbial mixtures as especially strong for Echinochloa spp. and Erigeron canadensis, with field efficacy up to 95%. For Amaranthus spp., botanical mixtures and allelopathic intercrops reportedly performed slightly better than microorganisms, reaching about 93-95% field efficacy.

These tables are useful, but only if handled carefully. Table 2 and Table 3 explicitly warn that all listed data and agents are AI-generated and may be fictitious. That caveat should not be treated as legal padding. It changes how the tables should be used.

The business interpretation is not “DeepSeek has identified the best disease and weed controls.” The stronger interpretation is narrower: a web-grounded LLM can quickly generate a candidate map of non-chemical interventions across pests, diseases, and weeds, which experts can then verify.

That is still valuable. In neglected or under-funded knowledge domains, even a candidate map can reduce search costs. But it is not equivalent to a validated treatment guide.

The PRISMA exercise tests retrieval breadth, not systematic-review legitimacy

The paper includes a PRISMA-style exercise for B. tabaci management in China. Both models were asked to disclose their literature review process and draw up a flow diagram. The result again favors DeepSeek.

For Chinese B. tabaci literature, DeepSeek considered about 837.4 initial records and 135.6 final records, while ChatGPT considered about 381.0 initial records and 22.6 final records. Across tactics, ChatGPT’s final results were based on 83% fewer literature sources than DeepSeek. Table 1 also shows DeepSeek more often reporting use of Chinese-language databases such as Wanfang and CQVIP, as well as national repositories.

The likely purpose of this test is not to claim that the AI performed a proper systematic review. It is a retrieval-breadth and process-disclosure probe. It asks whether, when forced into a PRISMA-like format, the systems report comparable screening behavior.

They do not.

But this also exposes a governance problem. A PRISMA-shaped output can look authoritative even when the underlying process is not independently verifiable. The format borrows trust from systematic review practice. The model may or may not have earned that trust.

For business use, this matters because interfaces influence user confidence. If an AI advisory system shows a neat flow diagram, a list of screened databases, and a table of excluded studies, many users will read that as methodological rigor. The product may be displaying structure, not verification.

The safe design principle is simple: never let a format imply a level of validation the pipeline has not actually performed.

What this means for agritech decision-support systems

The paper’s practical relevance is strongest for agritech companies, agricultural advisory platforms, sustainability programs, and agri-food value-chain actors trying to make scientific knowledge usable.

The opportunity is real. Farmers often receive crop-protection advice through input sellers, consultants, or informal networks. Non-chemical alternatives may be under-communicated, especially when evidence is scattered across journals and countries. A web-grounded LLM can reduce the first search cost: it can surface candidate biological agents, summarize broad efficacy patterns, translate technical evidence into simpler language, and identify where regional studies may exist.

That is useful in three business workflows.

First, AI can support evidence triage. Before an agronomist or research team runs a full review, the system can generate a candidate evidence map: pests, tactics, agents, geographies, and likely source clusters. The output is not final advice. It is a structured queue for verification.

Second, AI can support advisory content generation. Once experts verify the evidence, the model can help turn it into farmer-readable guidance, training materials, FAQ documents, and localized advisory scripts.

Third, AI can support portfolio discovery. Biocontrol companies, agri-input firms, and sustainability programs can use AI-generated maps to identify underexplored solution categories, regional evidence gaps, or promising non-chemical alternatives requiring validation.

The ROI logic is therefore not “replace agronomists.” It is “reduce the cost of getting agronomists to the right evidence faster.”

That distinction matters because replacing the expert is where the risk explodes. The paper documents fabricated agents, unverifiable efficacy claims, impossible experimental contexts, and omissions. Those are not minor formatting bugs. In a farm-level DSS, they can become wrong product recommendations, bad procurement decisions, or misplaced confidence in methods that do not work under local conditions.

The safe business model is human-machine collaboration:

LLM retrieval and synthesis
        ↓
taxonomy and citation verification
        ↓
expert agronomic review
        ↓
localized recommendation rules
        ↓
farmer-facing explanation
        ↓
feedback and field monitoring

Remove the middle layers, and the product becomes a fluent rumor engine with a sustainability logo. The market already has enough of those.

The boundary: this paper does not prove autonomous AI agronomy

The study is exploratory and its boundaries are material.

It compares free-tier versions of general-purpose systems under a specific prompting design. It does not test paid ChatGPT, specialized academic-search tools, custom retrieval-augmented systems, domain-specific fine-tuning, or professionally curated knowledge graphs. A production agritech DSS could perform better if it used verified databases, entity normalization, citation retrieval, and expert-supervised evaluation.

It also does not benchmark outputs against a full human systematic review. The paper checks consistency, breadth, and obvious factual problems, and it compares outputs between models. That is useful, but it is not the same as proving factual completeness.

Finally, the AI-reported efficacy values themselves must be handled cautiously. The paper repeatedly notes that listed data may be fictitious. Some reported patterns align with known empirical evidence, but the existence of plausible aggregate trends does not validate every listed agent or number.

These limitations do not weaken the article’s main business interpretation. They sharpen it.

The paper shows that general-purpose LLMs can help discover and summarize coarse patterns in agroecological crop protection. It also shows why autonomous decision support is not ready. The frontier is not “AI versus agronomist.” It is evidence infrastructure: retrieval, verification, taxonomy, provenance, expert review, and feedback.

The real benchmark is not intelligence; it is accountable evidence flow

The DeepSeek-versus-ChatGPT comparison is useful because it turns a vague AI question into a concrete operational framework.

Coverage first. Did the system reach enough of the relevant literature, including local and non-English sources?

Consistency second. Do the extracted numbers behave in ways domain experts expect, such as lower field efficacy than laboratory efficacy?

Verification always. Are the agents real, the citations traceable, the taxonomy normalized, and the efficacy claims tied to actual studies?

This is the standard that matters for business adoption. Not whether the answer sounds confident. Not whether the table looks professional. Not whether the model can say “integrated pest management” without hurting itself.

For agritech, the near-term value is not a fully automated agronomist. It is a research assistant that can widen the evidence funnel and speed up expert work. Used that way, LLMs can help bring neglected agroecological knowledge into practical decision systems. Used blindly, they can automate exactly the misinformation that sustainable agriculture is trying to escape.

So yes, let the AI read the literature.

Just do not let it become the only adult in the field.

Cognaptus: Automate the Present, Incubate the Future.

Kris A. G. Wyckhuys, “General-purpose AI models can generate actionable knowledge on agroecological crop protection,” arXiv:2512.11474, 2025. ↩︎

The study compares two AI behaviors, not just two brands#

Coverage: DeepSeek brought a library; ChatGPT brought selected clippings#

Consistency: the better model understood that fields are not laboratories#

The models disagree exactly where users would want confidence#

Hallucination is not one failure; it is several different failures wearing the same hat#

The disease and weed results are exploratory extensions, not the main comparison#

The PRISMA exercise tests retrieval breadth, not systematic-review legitimacy#

What this means for agritech decision-support systems#

The boundary: this paper does not prove autonomous AI agronomy#

The real benchmark is not intelligence; it is accountable evidence flow#