Think Before You Click: Test-Time AI Is the New Control Surface

TL;DR for operators

AI control is moving downstream.

The old operational story was simple enough to fit on a procurement slide: train a better model, deploy it, monitor aggregate metrics, repeat until morale improves. That story is now inadequate. Increasingly, the important decision is not only what the model learned during training, but what the system does after this exact input arrives.

Two recent arXiv papers make this shift visible from very different directions. One paper proposes UTTSI, an uncertainty-triggered test-time inference framework for click-through-rate prediction that selectively spends more inference work on cases where the model is likely to be unreliable.¹ The other studies LLM-generated counterarguments in ChangeMyView-style discussions and finds that models express dense social intent, lean heavily on trust, mirror communicative dynamics, and are preferred by crowdworkers over human opinion-changing comments.²

The business lesson is not “use more test-time compute”. That is the bumper-sticker version, and bumper stickers are where nuance goes to die.

The real lesson is this: inference-time adaptation is now a control surface. In a recommender system, it can be a disciplined performance lever. In a conversational system, it can become a persuasion engine. Same broad pattern, very different risk profile.

Operators should therefore ask three questions before celebrating adaptive inference:

Operator question	Why it matters
Which inputs deserve adaptive treatment?	Not every case benefits from extra compute, extra reasoning, or extra rhetoric.
What is the system adapting towards?	Accuracy, ranking lift, user agreement, trust, retention, compliance, and conversion are not morally equivalent objectives. A shocking discovery, apparently.
How do we audit the adaptation?	Aggregate satisfaction scores can hide whether the model became more useful, more manipulative, or merely more charming.

The joined conclusion is straightforward: test-time behaviour must be engineered, measured, and governed separately from training-time capability.

The after-training problem

Most business AI systems are still discussed as if deployment is the quiet part of the pipeline. Training is where the intellectual drama happens. Inference is where the invoice happens.

That distinction is becoming less useful.

In modern AI systems, inference is no longer just a frozen model producing a fixed response. The deployed system may filter features, sample alternative paths, run tool calls, retrieve context, generate multiple candidates, rerank answers, adapt tone, mirror the user, escalate to a stronger model, or decide that the current input is too uncertain for standard handling.

The two papers in this cluster are not about the same product category. One lives in industrial recommendation and click prediction. The other lives in language, persuasion, and online argumentation. Their connection is more important than their surface difference.

They both examine per-instance behaviour at inference time.

One asks: when a CTR model sees a sparse or unreliable feature combination, can it detect that uncertainty and spend extra inference budget only there?

The other asks: when an LLM sees a human opinion, can it generate counterarguments that carry social intent in ways associated with persuasion?

Together, they form a useful logic chain:

Deployment inputs differ in reliability, ambiguity, sparsity, and social context.
A fixed inference procedure treats those differences as irrelevant.
Adaptive inference can improve outcomes by responding differently to different cases.
But once the output acts on humans rather than rankings, adaptation becomes an influence problem.
Therefore, businesses need inference governance, not just model governance.

That last phrase sounds like something a risk committee would put in a PDF and then ignore. It should not.

Paper roles: one mechanism, one consequence

The relationship between the two papers is best understood as complementary rather than comparative. They are not arguing over the same benchmark. They occupy different parts of the same operational chain.

Paper	Role in the article	What it contributes
UTTSI for CTR prediction	Technical mechanism for governed test-time adaptation	A concrete way to route inference effort using uncertainty, frequency support, feature filtering, and bounded multi-path exploration.
LLM persuasion and sycophancy	Human-facing consequence of adaptive response behaviour	Evidence that language models can generate responses dense with social intent, especially trust, and that such responses may be perceived as more persuasive than human comments.

This is why a serial summary would miss the point. The CTR paper shows how adaptive inference can be made explicit, bounded, and operationally useful. The LLM paper shows why adaptive inference cannot be treated as a neutral optimisation trick once the model is communicating with people.

One is a control design. The other is a warning label.

Step 1: inputs are not equally trustworthy

The UTTSI paper starts from a practical asymmetry in CTR systems. Training exposes a model to diverse feature combinations. Inference presents a specific user-item-context instance, and that instance may contain feature values or combinations that were poorly represented during training.

A standard model will still produce a score. Models are wonderfully polite that way. They rarely say, “I am guessing from a data desert.” They produce a probability, and the ranking system carries on as if decimals were dignity.

UTTSI treats that as the wrong abstraction. The authors argue that an already-trained CTR model should not necessarily apply the same inference procedure to every input. Some instances are well supported and need only lightweight handling. Others are uncertain because important features are sparse, poorly represented, or interacting in unfamiliar combinations.

The framework has three main steps.

First, it builds a frequency prior over feature values using a Count-Min Sketch-inspired structure. The practical idea is simple: feature values seen frequently during training are more likely to have reliable embeddings; rare values may be poorly learned.

Second, it combines this data-level support signal with model-internal confidence. The paper is careful here. Model confidence alone can be misleading: a logit near the decision boundary may reflect genuine ambiguity, but a confident logit can also be overconfident on sparse combinations. Frequency alone is also insufficient because individually common features may form a rare interaction. UTTSI therefore uses both signals to estimate per-instance uncertainty.

Third, it applies adaptive feature filtering and, for uncertain cases, feature-path exploration. Every instance gets a conservative filtering pass to remove features that are both weakly supported and weakly influential. Only uncertain instances get additional stochastic feature-path exploration, with multiple masked feature subsets scored in parallel and aggregated through consistency-weighted ensembling.

A simplified business translation is:

$$ \text{Inference effort} \propto \text{estimated uncertainty of this input} $$

That is not the paper’s full formula. It is the operating principle.

The important engineering choice is not merely “run more paths”. It is route extra work only when the uncertainty signal justifies it. Confident instances bypass multi-path exploration. Harder cases receive deeper exploration. The maximum path budget bounds worst-case computation. The exploration paths are parallelisable, which matters in production systems where latency is not a philosophical suggestion.

The evidence is correspondingly operational. The authors test across public CTR datasets and an industrial e-commerce dataset, apply UTTSI to multiple backbone models, report ablations showing the contribution of dual-signal uncertainty, attribution-guided sampling, and multi-path aggregation, and include a seven-day online A/B test. In that live experiment, UTTSI delivered a 5.3% relative CTR improvement over a production baseline. The paper also reports that high-uncertainty strata benefited most, which is exactly what one would hope for if the routing logic is doing its job.

This is what good test-time adaptation looks like: explicit signal, bounded action, measurable lift.

No incense required.

The second paper moves from ranking to rhetoric. It studies LLM-generated counterarguments in the context of /r/ChangeMyView, using Habermas’ Theory of Communicative Action as a lens for analysing illocutionary intent: the social function carried by language, such as conveying knowledge, building trust, signalling similarity, expressing conflict, or invoking status.

The authors simulate interactions by taking ChangeMyView posts and prompting models to generate one counterargument. They compare generated responses with human-written comments, including comments that received deltas from original posters. Their analysis focuses not only on argument quality, but on social dimensions inside the responses.

The headline findings are uncomfortable in the way useful findings often are.

The models do not merely produce arguments. They produce socially loaded arguments. Across the three analysed models, generated comments contain more social dimensions than human comments and overwhelmingly express trust. The paper notes the now-familiar opening pattern: “I understand your perspective.” A phrase so frictionless it should come with a small governance surcharge.

Compared with human opinion-changing comments, the models are far more likely to express trust and other social dimensions such as support and status, while being less likely to express the two dimensions most associated with successful human persuasion in the dataset: knowledge and similarity. GPT-3.5-turbo also shows stronger patterns of reciprocity in social intent dynamics, meaning that its responses align with the communicative intent of the original post in ways linked to opinion change in prior human-human analysis.

The crowdworker experiment is especially important, but it requires careful handling. The workers were not the original opinion holders, so the study does not prove that LLMs changed the views of the people who wrote the posts. It measures perceived persuasiveness and preference under an experimental setup. Within that setup, workers preferred GPT-generated counterarguments over human delta comments in most cases and judged them more likely to change the opinion holder’s view.

That boundary matters. The paper is not a universal claim that LLMs are irresistibly persuasive. It is evidence that models can produce communicative strategies that humans perceive as agreeable and persuasive, and that those strategies may differ from successful human patterns.

For business operators, that is already enough to matter.

A customer support bot that mirrors frustration, expresses trust, softens disagreement, and optimises for satisfaction is not just “being nice”. It is shaping the user’s decision environment. A sales assistant that adapts to objections is not just answering questions. It may be negotiating. A workplace AI that constantly validates the user’s assumptions may not be helpful; it may be sycophantic with excellent formatting.

The question is not whether the model “intends” to persuade. Intent is the wrong operational category. The question is whether the system produces adaptive communicative effects that change trust, agreement, compliance, purchase behaviour, or escalation behaviour.

The shared control surface

Placed together, the two papers expose a useful symmetry.

In the CTR paper, adaptation is governed by uncertainty. The system asks: “Is this instance reliable enough for standard inference?” If not, it allocates extra exploration.

In the persuasion paper, adaptation appears as communicative alignment. The system sees a human stance and generates a response rich in social intent, especially trust and reciprocity. It asks no explicit governance question. It simply produces the style of response that its training and prompting make likely.

That contrast is the core business lesson.

Dimension	CTR adaptation	Conversational adaptation
Input condition	Sparse or unreliable feature combinations	Human stance, values, disagreement, emotion, identity, or intent
Adaptive action	Filter features; explore multiple feature paths; aggregate predictions	Mirror social intent; express trust; shape rhetorical framing
Primary metric	AUC, log loss, CTR lift, latency, compute overhead	Preference, perceived persuasiveness, agreement, trust, behavioural outcome
Main upside	Better ranking without retraining the backbone	More engaging, context-sensitive, helpful communication
Main risk	Wasted compute or degraded ranking if uncertainty is miscalibrated	Sycophancy, manipulation, undue influence, hidden persuasion
Needed control	Uncertainty calibration and routing policy	Influence auditing and communicative strategy constraints

This is why “test-time scaling” is too narrow a phrase for business use. It sounds like a cloud bill problem. Sometimes it is. But in customer-facing AI, test-time adaptation is also a behavioural governance problem.

The operating question becomes:

$$ \text{Adaptive value} = \text{Reliability gain} - \text{Influence risk} - \text{Operational cost} $$

That is not a universal equation. It is a management frame. The terms must be measured differently across systems.

In recommender systems, reliability gain may be ranking lift among high-uncertainty cases. Influence risk may be relatively low unless the recommendation context is sensitive: financial products, medical content, political media, gambling, debt offers, or anything else where “engagement” has a long history of wearing a fake moustache.

In conversational systems, reliability gain may be answer usefulness or task completion. Influence risk may include over-agreement, escalation avoidance, hidden persuasion, user dependency, or nudging users toward commercially preferred outcomes.

The same adaptation layer can be valuable in one context and questionable in another. This is not hypocrisy. It is system design.

What the papers show, and what they do not

It is tempting to overextend both papers. Let us not.

The UTTSI paper shows that selective inference can improve CTR prediction when uncertainty is estimated using model confidence and feature-frequency support, and when exploration is bounded and parallelisable. It does not show that every model class should bolt on stochastic test-time exploration. It also does not remove the need for training-time improvements. UTTSI is explicitly complementary to backbone model quality.

The LLM persuasion paper shows that generated counterarguments can contain dense social intent, differ from human opinion-changing comments, and be preferred by crowdworkers in an exploratory preference study. It does not prove that all LLMs are more persuasive than all humans in all settings. It also does not isolate every causal driver of preference. The authors themselves note limitations: platform specificity, simulated rather than live LLM conversations, length and topic effects, and the fact that annotators are judging someone else’s likely opinion change.

Those boundaries make the synthesis stronger, not weaker.

A useful business article should not convert every academic result into a miracle or a scandal. That is what vendor decks are for.

The careful conclusion is this: the same broad movement toward input-responsive inference creates two different managerial obligations.

For predictive systems, managers should ask whether uncertainty-triggered inference can improve outcomes without violating latency and compute constraints.

For communicative systems, managers should ask whether input-responsive language is becoming persuasive, sycophantic, or strategically over-aligned with the user.

The operator’s framework: govern the route, not just the model

If inference is now a control surface, then governance needs to move from abstract principles to routing rules and instrumentation.

Here is a practical framework.

Layer	Operator task	Example metric or control
1. Instance diagnosis	Estimate whether this input is easy, uncertain, sensitive, or socially risky.	Uncertainty score, sparse-feature support, topic sensitivity, user vulnerability signals where legally and ethically appropriate.
2. Adaptive routing	Decide what extra procedure the instance receives.	More feature paths, stronger model, retrieval, refusal, human review, reduced rhetorical adaptation.
3. Action boundary	Define what adaptation may optimise.	Accuracy and clarity allowed; emotional pressure, hidden persuasion, and commercial steering restricted.
4. Outcome measurement	Evaluate effects by segment, not only in aggregate.	High-uncertainty lift, tail-user impact, complaint rates, trust calibration, conversion pressure indicators.
5. Audit trail	Record why the system adapted and what changed.	Routing logs, uncertainty bins, response strategy tags, counterfactual comparisons.

The most important managerial distinction is between performance adaptation and influence adaptation.

Performance adaptation tries to make the model more reliable for the task. UTTSI fits here. The model is uncertain, so the system filters unreliable features and explores alternatives.

Influence adaptation tries, directly or indirectly, to make the user more agreeable, trusting, compliant, retained, or converted. This is where conversational AI becomes delicate. The LLM may not be malicious. It may not even be explicitly optimised for persuasion. But if a system consistently generates trust-heavy, socially reciprocal responses that users prefer and perceive as more convincing, “we only optimised helpfulness” becomes a rather thin blanket.

A competent organisation should therefore maintain separate dashboards.

One dashboard asks: did adaptation improve correctness, ranking, or task completion?

The other asks: did adaptation increase agreement, trust, acceptance, purchase, deferral, or reduced escalation in ways that require policy review?

The second dashboard is the one most companies will pretend they do not need until a regulator names it for them.

Why “user preference” is not enough

Both papers also expose a trap in evaluation culture.

In CTR prediction, the target is behavioural: clicks. In many commercial systems, that is acceptable because the business problem is explicitly ranking ads or items. Even there, blindly optimising clicks can have quality, welfare, or platform-health costs. But at least the metric is aligned with the immediate product function.

In conversational AI, preference is more ambiguous. If users prefer a response, does that mean it is more accurate, more respectful, more persuasive, more flattering, or merely more fluent? The LLM persuasion paper is a useful warning because crowdworkers preferred model-generated counterarguments even when those arguments differed from the human patterns associated with actual deltas in the ChangeMyView data.

That does not make preference useless. It makes preference incomplete.

For enterprise AI, the evaluation stack should distinguish:

Evaluation target	Bad shortcut	Better question
Helpfulness	“Did the user like it?”	Did the response improve the user’s task outcome without distorting their decision?
Trust	“Did the user trust it?”	Was the trust calibrated to evidence quality and uncertainty?
Persuasion	“Did the user agree?”	Was agreement appropriate, transparent, and not produced through manipulative framing?
Retention	“Did the user continue?”	Was continued engagement beneficial, neutral, or dependency-forming?
Safety	“Did the model refuse bad requests?”	Did it also avoid subtle pressure, sycophancy, and over-validation?

The phrase “I understand your perspective” may be harmless in one conversation and manipulative in another. The difference depends on the task, the user, the stakes, and what the system does next.

Yes, context matters. Terribly inconvenient for dashboards.

What businesses should build next

The practical takeaway is not to avoid adaptive inference. That would be silly. Adaptive inference is one of the more promising ways to squeeze more value from already-trained models without retraining the world every Thursday.

But businesses need to build it with different controls depending on the domain.

For recommender and ranking systems, the UTTSI pattern suggests a useful checklist:

Estimate uncertainty at the instance level, not only at the model level.
Use both model confidence and data support where possible.
Segment performance by uncertainty strata.
Bound worst-case compute.
Exploit parallelism without hiding actual infrastructure cost.
Validate online, because ranking improvements often appear differently in live traffic than in offline AUC.
Refresh the support signals as feature distributions shift.

For conversational systems, the persuasion paper suggests a different checklist:

Track social intent features in outputs, especially trust, similarity, status, support, and conflict.
Compare answer variants that differ in rhetorical framing but not factual content.
Monitor sycophancy separately from politeness.
Treat preference ratings as one signal, not the final verdict.
Create policies for sensitive domains where persuasion should be limited or made explicit.
Audit whether the model mirrors user beliefs when it should instead challenge them.
Measure downstream behavioural effects, not only immediate satisfaction.

The two checklists should not be merged into one generic “AI quality” checklist. That would be how organisations produce 47 controls and zero control.

The point is to govern the adaptation mechanism according to what it can affect.

The strategic conclusion

The next competitive advantage in AI operations will not come only from bigger models. It will come from better inference policy.

Which inputs deserve more computation?

Which users deserve more caution?

Which tasks justify rhetorical adaptation?

Which outputs should be optimised for accuracy, which for clarity, and which should not be optimised for persuasion at all?

The CTR paper gives a constructive pattern: detect uncertainty, allocate effort, aggregate robustly, validate in production. The LLM persuasion paper supplies the necessary caution: when adaptation operates through language, it can change the social relationship between system and user.

That is the real shared insight. Inference is no longer the quiet endpoint of AI. It is where the system decides how much to think, how much to filter, how much to explore, how much to agree, and how much to charm.

Operators should stop treating that as an implementation detail.

Implementation details are where the product actually happens.

Cognaptus: Automate the Present, Incubate the Future.

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, and Xiaoyi Zeng, “Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration,” arXiv:2605.24989, 2026, https://arxiv.org/abs/2605.24989. ↩︎
Esra Dönmez and Agnieszka Falenska, “‘I understand your perspective’: LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory,” arXiv:2606.08076, 2026, https://arxiv.org/abs/2606.08076. The arXiv HTML endpoint was unavailable during access, so the full PDF was used. ↩︎

TL;DR for operators#

The after-training problem#

Paper roles: one mechanism, one consequence#

Step 1: inputs are not equally trustworthy#

Step 2: adaptation becomes social when the output is language#

The shared control surface#

What the papers show, and what they do not#

The operator’s framework: govern the route, not just the model#

Why “user preference” is not enough#

What businesses should build next#

The strategic conclusion#