When More Explanation Hurts: The Early‑Stopping Paradox of Agentic XAI

A farmer does not need ninety-three charts before deciding what to do next.

That sounds obvious. Unfortunately, “obvious” is where many agentic AI workflows go to die. Give an LLM a model explanation, ask it to improve the explanation, let it generate more analysis, feed the results back, and repeat. The process feels responsible. More checks. More plots. More reasoning. More “depth.” Somewhere in the background, a product manager begins to hear the soft music of enterprise automation.

The paper behind today’s article is useful because it interrupts that music.¹ It does not merely ask whether an LLM can translate SHAP outputs into plain-language recommendations. That part is almost the polite appetizer. The sharper question is whether an agentic explanation loop keeps getting better as it iterates.

The answer is: not for long.

In this rice-yield case study, recommendation quality improves in the early rounds, peaks around Rounds 3–4, and then declines. The decline is not simply because the model becomes lazy or incoherent. It becomes too busy. It adds analysis, generates more figures, incorporates economic language, and gradually moves away from the kind of concise, grounded, actionable guidance that a real decision-maker can use.

So the practical lesson is not “agentic XAI works” or “agentic XAI fails.” That would be too convenient, and therefore suspicious.

The lesson is stranger and more operational: explanation quality has its own early-stopping problem.

The experiment begins with a familiar XAI problem: SHAP explains the model, not the decision

The technical starting point is ordinary enough. The authors train a Random Forest model to predict rice yield using field-level data from Fukushima, Japan. The dataset covers 26 fields across three years, producing 66 field-year observations. It combines rice yield with soil properties, rice varieties, cultivation practices, and meteorological variables across growth stages.

The model is not the main novelty, but it matters because the entire explanation workflow rests on it. The optimized Random Forest uses 100 estimators and maximum depth 10. It achieves a leave-one-out cross-validation $R^2$ of 0.749, while the final training fit reaches $R^2 = 0.962$. That gap suggests possible overfitting, which the authors acknowledge, but the model is used here mainly as the base for SHAP-driven explanation rather than as a deployed yield predictor.

The initial XAI layer uses SHAP to identify influential yield drivers. This is where many “AI explainability” workflows would stop: produce a feature importance plot, perhaps add a beeswarm chart, and call it transparency. Very comforting. Also not especially useful to a farmer.

A SHAP plot can tell a technically trained reader which variables influence predicted yield and in which direction. It does not automatically answer the more human question: “What should I do differently, and why should I trust that advice?”

That gap between model explanation and practical recommendation is exactly where the paper introduces Agentic XAI.

Round 0: the agent can explain, but the explanation is thin

The authors define Agentic XAI as an approach where an AI agent explores better explanations of modeled patterns through iterative natural-language refinement. In this implementation, the workflow has three parts:

Layer	What it does	Why it matters
XAI analysis	A Random Forest predicts rice yield; SHAP identifies important features	Creates the technical explanation substrate
Multimodal LLM refinement	Claude Sonnet 4 interprets the SHAP plot, generates code for extra analyses, reads the resulting outputs, and revises recommendations	Turns explanation into an iterative analytical workflow
Evaluation	Human crop scientists and LLM judges score all rounds across seven metrics	Tests whether iteration actually improves usefulness

Round 0 is deliberately limited. The multimodal LLM receives the SHAP beeswarm plot and the list of variables, then generates farmer recommendations from that information. There are no additional statistics, no new plots, no generated code outputs yet. It is a translation of XAI into advice, not yet an agentic investigation.

This baseline is important because it represents a common business pattern: take a model explanation, hand it to an LLM, ask for a readable summary, and ship the result into a dashboard. It will often look better than raw SHAP. It may also be under-specified.

The evaluators agree. The Round 0 average score from crop scientists is 3.679 on a 1–7 scale. LLM evaluators score it higher, at 4.776, but still not near peak quality. The initial explanation is useful enough to begin the story, not strong enough to end it.

The bias side of the paper’s analogy lives here. Early explanations are simpler, but also incomplete. They compress too much. They miss operational nuance. The user gets clarity, but not enough actionable depth.

Rounds 1–4: the agent earns its keep by adding just enough analysis

After Round 0, the workflow becomes agentic. In each iteration, the LLM reviews its previous recommendation, identifies analytical gaps, generates Python code to compute new statistics or visualizations, receives those outputs as a PDF, and rewrites the recommendation using both the original SHAP plot and the new evidence.

This is the part agentic AI vendors like to show in demos, and for once the demo logic has empirical support.

During the early rounds, the agent adds useful structure. Round 1 introduces correlation matrices and several basic plot types. Round 2 adds PCA-related analysis, Q-Q plots, histograms, and pie charts. Round 3 introduces response-curve or binning analysis, helping quantify interactions and build a management timeline. By Round 4, the workflow adds partial dependence and return-on-investment analysis.

The key point is not that every new chart is magical. The key point is that the first few rounds help the system move from “these factors matter” toward “these factors matter in this context, in this approximate way, and here is how they connect to management decisions.”

That is where the scores peak.

Evaluator group	Round 0 average	Observed peak	Peak round	Round 10	What the curve says
Crop scientists	3.679	4.905	3	2.643	Strong early improvement, then severe decline
LLM evaluators	4.776	6.214	4	5.184	Early improvement, then milder but real decline

The human crop scientists see a gain of 1.226 points from Round 0 to the observed peak. LLM evaluators see a gain of 1.439 points. The authors summarize this as roughly a 30–33% improvement over baseline.

This is the strongest evidence that Agentic XAI is not just “LLM decoration on top of SHAP.” The agentic loop does create value, at least in this setting. It helps translate technical explanation into recommendation quality, and both human domain experts and LLM judges identify the improvement.

But the useful window is short. The best observed round is not Round 10. It is not even Round 7. For humans, the best average score appears at Round 3. For LLMs, Round 4.

Apparently, wisdom arrives early and then starts adding appendices.

Rounds 5–10: the agent keeps thinking after the useful explanation is already gone

The paper’s most business-relevant result begins after the peak.

From Round 5 onward, the agent continues expanding its analytical scope. It adds field typology classification, sustainability assessment, decision trees, rank correlations, risk analysis, Monte Carlo simulation, value-at-risk framing, distributional tests, and eventually beta distribution simulation. By the end of Round 10, the accumulated workflow has generated 93 figures.

That number is not automatically bad. A serious analytical system may need many artifacts behind the scenes. The problem is that the output quality perceived by evaluators declines while analytical complexity continues to increase.

For crop scientists, the average score falls from 4.905 at Round 3 to 2.643 at Round 10. That is not a small diminishing return. That is the explanation walking confidently past the exit door.

For LLM evaluators, the decline is less dramatic but still visible: from 6.214 at Round 4 to 5.184 at Round 10. The difference between human and LLM judgments is itself worth noticing. LLMs may tolerate, or even reward, certain forms of elaborate reasoning more than human domain experts do. The authors use multiple LLMs as judges, which is useful for scalability, but the crop scientists’ harsher late-round judgment is the more operationally sobering signal.

The statistical analysis supports the shape of the story. The authors fit generalized additive models and compare them against linear baselines using AIC. For the overall average score, both evaluator groups show an inverted-U pattern. The estimated peak occurs at Round 2.50 for crop scientists and Round 3.53 for LLMs. In both cases, the nonlinear model outperforms the linear baseline by more than the authors’ stated threshold.

This matters because the paper is not merely eyeballing a curve and declaring a paradox. It tests whether the trajectory behaves like an early improvement followed by decline. For the overall quality score, it does.

The metric split shows why “more explanation” breaks differently across qualities

The average curve is memorable, but the metric-level breakdown is where the mechanism becomes clearer.

The paper evaluates seven dimensions: Clarity, Conciseness, Contextual Relevance, Cost Consideration, Crop Science Credibility, Practicality, and Specificity. These do not move together.

Metric pattern	Metrics	Interpretation
Inverted U-shaped	Specificity, Practicality, Contextual Relevance, Crop Science Credibility, and mostly Clarity	Early analysis helps, but later elaboration dilutes grounding and actionability
Monotonic decline	Conciseness	Each additional round tends to make the explanation heavier
Monotonic increase	Cost Consideration	The agent keeps adding economic reasoning even when the original dataset lacks economic parameters

This table is the heart of the paper.

Five metrics behave like the overall curve: they improve early and then decline. These are not cosmetic metrics. Specificity, practicality, contextual relevance, and crop science credibility are exactly the qualities a decision-support system needs if it is supposed to help real users. Their decline means the late-round system is not merely longer; it is less useful in the dimensions that matter.

Conciseness behaves more brutally. It declines nearly monotonically. Crop scientists score conciseness at 4.417 in Round 0, with observed peaks of 4.833 at Rounds 1 and 3, then only 2.333 by Round 10. LLM evaluators place the peak at Round 0 itself: 5.929, declining to 5.143 by Round 10.

This is a useful reminder for anyone building AI copilots for professionals: the user did not ask for a literature review disguised as advice. Busy users experience verbosity as cost.

Cost Consideration is the strangest metric. It improves continuously, peaking late: Rounds 7–8 for crop scientists and Round 9 for LLMs. At first glance, that sounds good. Surely business users want cost reasoning.

Except the original dataset does not contain economic parameters. The agent becomes increasingly willing to discuss cost, ROI, and financial feasibility without direct economic grounding in the data. That is not necessarily hallucination in the crude sense; it may be plausible managerial reasoning. But plausibility is not the same as evidence.

This is why the authors interpret late-round cost improvement as a warning sign. The agent is optimizing a desirable dimension using insufficient grounding. Very enterprise. Very dangerous.

The paper’s bias–variance analogy is not decorative; it explains the failure mode

The authors frame explanation quality using a bias–variance analogy.

In early rounds, explanations are biased in the sense that they are too simple. They omit important relationships, lack supporting analysis, and fail to provide enough specificity. The recommendation is digestible but underpowered.

In late rounds, explanations become high-variance in the sense that they overreact to the expanding analytical process. They add detail, abstraction, and speculative structure. The recommendation may look sophisticated, but its connection to actionable, data-grounded advice weakens.

This is not the same as the classic statistical bias–variance trade-off in model prediction, of course. The paper is using it as an analogy for explanation comprehensiveness. But the analogy is useful because it changes the design question.

The question is not:

How do we make the agent explain more?

The better question is:

How do we stop the agent once added explanation starts reducing practical utility?

That shift matters for business workflows. Many agentic systems are designed as if iteration were inherently virtuous. Review the answer. Improve it. Add more context. Critique it. Refine it again. This paper shows that, in at least one real decision-support setting, the value curve is not monotonic.

Iteration is not a moral good. It is a treatment with dosage.

The test design is stronger than a demo, but not the same as deployment evidence

The paper’s evaluation design is more serious than many agentic AI demos. The authors use 12 human crop scientists and 14 LLM evaluators. Each evaluator scores recommendations from all 11 rounds across seven metrics. The round order is randomized, and evaluators are blind to iteration number. The LLM evaluations are conducted in isolated contexts to reduce contamination from prior conversation history.

That design supports the main claim: recommendation quality differs across refinement rounds, and the pattern is not simply “more rounds are better.”

Still, we should separate what the paper directly shows from what business readers may want it to show.

Evidence element	Likely purpose	What it supports	What it does not prove
Round-by-round human expert scoring	Main evidence	Domain experts prefer early-to-mid refined recommendations over both baseline and late-round outputs	Farmers would adopt the advice or improve yield outcomes
LLM-as-judge scoring across 14 models	Scalable corroborating evaluation	The inverted-U pattern is not unique to one evaluator model	LLM judgments are independent of shared training or alignment biases
GAM and AIC comparison	Statistical trajectory test	Overall quality follows a nonlinear early-rise-then-decline pattern	The exact peak round generalizes to other models, domains, or datasets
Metric-level analysis	Mechanism evidence	Different quality dimensions respond differently to iteration	One universal stopping metric can optimize all dimensions
Archive of prompts, generated code, figures, and recommendations	Implementation transparency	The agentic process can be inspected and reproduced	The generated analyses are automatically valid or sufficient for deployment

This distinction is important because the result is genuinely useful, but not yet a deployment rulebook.

The study uses one crop system, one location, one generation model for the agentic workflow, and a static dataset. The evaluation measures recommendation quality at generation time, not real-world farmer adoption, implementation fidelity, yield changes, or economic returns. Those limitations do not kill the insight. They define its operating boundary.

The paper is strong enough to warn against infinite refinement. It is not strong enough to tell every enterprise system to stop at exactly four rounds. Please do not turn “Round 4” into a corporate KPI. We have enough rituals already.

The business lesson is observability, not prettier explanation

For business use, the most actionable interpretation is not that every AI explanation system needs SHAP, agriculture data, or Claude Sonnet 4. The broader lesson is that agentic explanation workflows need governance at the level of the workflow itself.

A conventional model-risk review asks whether the model is accurate, whether inputs are valid, whether explanations are faithful, and whether outputs are monitored. Agentic XAI adds another layer: the explanation process can generate its own code, plots, assumptions, and intermediate claims. That makes the explanation pipeline itself a system requiring oversight.

Four design implications follow.

First, define stopping rules before deployment. Stopping should not depend on the agent’s self-confidence or the elegance of its final prose. The paper suggests that quality can decline while complexity rises, so the workflow needs external signals: human review, metric monitoring, output-length constraints, grounding checks, or validation against held-out decision scenarios.

Second, monitor multiple quality metrics separately. A single average score hides trade-offs. In this paper, Cost Consideration improves while Conciseness declines and practical metrics peak early. In a business setting, the same pattern could appear as “risk coverage improves while user actionability collapses” or “legal completeness improves while operational clarity dies quietly in a conference room.”

Third, archive intermediate artifacts. The authors’ repository includes prompts, generated code, visual outputs, and recommendations for each round. That is not academic neatness. It is the minimum viable audit trail for agentic systems. If the agent produces a questionable recommendation, teams need to know which generated analysis introduced the questionable claim.

Fourth, require domain review where grounding is thin. The Cost Consideration result is the warning label. When the agent starts reasoning about variables not actually present in the data, it may produce plausible-looking advice that exceeds the evidence base. In business domains, that could mean unsupported ROI estimates, invented customer constraints, speculative compliance interpretations, or market assumptions wrapped in polished prose.

The danger is not that the system sounds stupid. The danger is that it sounds mature.

Early stopping is a product feature, not a research footnote

The paper’s conclusion argues for strategic early stopping, hybrid human-LLM evaluation, observability protocols, and recognition that quality metrics cannot all be optimized through iteration alone. That may sound like standard responsible-AI language, but the evidence gives it sharper teeth.

In ordinary software, “more processing” usually sounds like a performance cost. In agentic decision support, more processing can also be a quality risk. The system may continue to produce artifacts after the useful evidence has already been extracted. It may chase completeness, widen the analytical frame, and gradually lose the user’s actual decision.

This is particularly relevant for enterprise AI products that turn internal analytics into recommendations: sales copilots, credit review assistants, compliance summarizers, operational planning agents, clinical decision-support tools, procurement advisors, and financial analysis copilots. These systems do not fail only by hallucinating spectacular nonsense. They can fail by over-explaining themselves into weaker decisions.

The practical design pattern is therefore simple, but not easy:

Let the agent refine enough to overcome shallow explanation.
Measure whether added refinement improves the qualities users actually need.
Stop when marginal explanation begins to trade actionability for sophistication.
Escalate to more data or human review instead of asking the same agent to think harder inside the same evidence box.

That last point matters. The paper suggests future systems might extend the productive refinement window by adding new grounding sources, such as retrieval systems, external databases, or domain-specific knowledge repositories. That is plausible. But it also changes the system. If the agent has exhausted the current evidence base, another iteration over the same material is not intelligence. It is pacing.

What this paper directly shows, and what Cognaptus infers

To keep the business interpretation clean, here is the boundary.

The paper directly shows that, in a Japanese rice-yield recommendation case, Agentic XAI improves recommendation quality in early rounds, peaks around Rounds 3–4, and then declines. It shows this with human crop scientists, LLM evaluators, seven evaluation metrics, and statistical tests for nonlinear trajectories. It also shows that different metrics behave differently: practical and credibility-related metrics peak early, conciseness deteriorates, and cost reasoning improves despite missing economic variables.

Cognaptus infers that similar agentic workflows in business should be governed as iterative decision pipelines, not treated as automatic quality escalators. This means early stopping, metric-specific monitoring, audit trails for generated code and analysis, and domain review when the agent expands beyond grounded evidence.

What remains uncertain is how general the exact curve is. Other domains may peak later or earlier. Other generation models may behave differently. Workflows with stronger retrieval or structured validation may delay degradation. End users may judge usefulness differently from domain experts or LLM evaluators. And real-world outcomes may diverge from recommendation-quality scores.

But uncertainty about the exact stopping point does not weaken the central warning. It strengthens the operational requirement: measure the curve in your own workflow.

The quiet lesson: good agents need brakes

Agentic AI is usually sold through motion: planning, reasoning, tool use, self-correction, iteration. The implied story is that a system that keeps working on a problem must be getting closer to the answer.

This paper shows the missing half of that story. An agent can improve an explanation, then overwork it. It can add analysis that makes the recommendation look more complete while making it less practical. It can optimize one quality dimension while damaging another. It can sound more sophisticated precisely when it is drifting away from grounded decision support.

The best version of Agentic XAI, then, is not the agent that explains forever.

It is the agent that knows when the explanation has become useful enough, when the next chart is likely to hurt, and when the right move is not another refinement round but a human check, a better dataset, or a very unfashionable act of restraint.

That is not a glamorous message. But in enterprise AI, glamorous messages are usually where the incident report begins.

Cognaptus: Automate the Present, Incubate the Future.

Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, and Keisuke Katsura, “Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation,” arXiv:2512.21066, 2025, https://arxiv.org/pdf/2512.21066. ↩︎

The experiment begins with a familiar XAI problem: SHAP explains the model, not the decision#

Round 0: the agent can explain, but the explanation is thin#

Rounds 1–4: the agent earns its keep by adding just enough analysis#

Rounds 5–10: the agent keeps thinking after the useful explanation is already gone#

The metric split shows why “more explanation” breaks differently across qualities#

The paper’s bias–variance analogy is not decorative; it explains the failure mode#

The test design is stronger than a demo, but not the same as deployment evidence#

The business lesson is observability, not prettier explanation#

Early stopping is a product feature, not a research footnote#

What this paper directly shows, and what Cognaptus infers#

The quiet lesson: good agents need brakes#