Pay to Think: Incentive Design Is the Hidden Variable in Human–AI Research

Payment sounds like the boring part of a user study.

Recruit participants. Estimate task time. Set a base rate. Add a small bonus if the budget allows. Put the number in the methods section, preferably somewhere readers can skim past with dignity. Then move on to the interesting material: trust, reliance, explanations, fairness, error rates, cognitive load, and all the other variables that make human–AI decision-making sound like a serious field rather than a procurement spreadsheet.

The paper behind today’s article argues that this is exactly the mistake.¹

Incentives are not just how participants are compensated. In crowdsourced human–AI decision-making studies, incentives are part of the experimental environment. They can change what participants attend to, how much effort they invest, whether they follow AI advice, whether they treat a task as consequential, and whether a measured “human–AI collaboration” result is really about the interface, the model, the task, or the payment rule quietly steering the entire performance.

That is the paper’s useful irritation. It does not offer another benchmark number. It does not claim that one bonus structure magically fixes human–AI research. Instead, it reviews 97 crowdsourced empirical studies and asks a more uncomfortable methodological question: when researchers study how humans work with AI, do they actually know what they are paying humans to do?

The answer is: sometimes. Often not well enough. And very rarely with the transparency needed for others to interpret the result cleanly.

Incentives are a behavioral interface, not accounting noise

The easiest misconception is that participant payment sits outside the experiment. The AI system is the treatment. The explanation interface is the treatment. The task framing is the treatment. Payment is merely logistics.

In human–AI decision-making, that separation is too neat. A participant who receives a flat completion payment is not in the same behavioral world as a participant whose bonus depends on accuracy. A participant who can lose part of a bonus for a wrong decision is not in the same world as one who is told only to “try their best.” A participant who understands the reward rule is not in the same world as one who vaguely knows there is a bonus somewhere behind the curtain.

The mechanism is simple enough to be dangerous:

Design element	Behavioral pressure	Possible research consequence
Flat base pay	Finish the task acceptably	Measures may reflect minimum-effort completion rather than careful decision-making
Accuracy bonus	Optimize correct answers	Participants may rely on AI more or less depending on expected payoff
Penalty or loss framing	Avoid mistakes	May simulate stakes, but may also create stress, gaming, or ethical concerns
Bonus for comprehension checks	Read instructions carefully	Better task understanding, but not necessarily better real-world judgment
Unclear communication	Uncertain motivation	Harder to interpret whether incentives influenced behavior

This is why the paper’s “incentive-tuning” language matters. The payment scheme is not just a number. It is a control surface.

A business reader can translate this immediately. In a workplace AI pilot, the “incentive scheme” may not be literal task pay. It may be productivity targets, supervisor expectations, sales commissions, error penalties, dashboard rankings, or the social reward of being seen as an early adopter. Same logic, less explicit envelope. If those incentives are misaligned, the pilot result may say less about the AI tool and more about what employees believed they were being rewarded for.

The paper stays within the research setting: crowdsourced empirical human–AI decision-making studies, with a focus on monetary incentives. The business extension is an inference, not a direct result. But it is a highly practical inference, because organizations routinely treat internal AI pilots as evidence for deployment decisions while barely documenting the behavioral pressures surrounding the pilot. Very scientific. Very vibes-based.

What the paper actually reviews

The authors conduct a semi-structured literature review and reflexive thematic analysis of empirical human–AI decision-making studies using crowd workers as decision-makers. Their scope is deliberately narrower than “all AI user studies.” They focus on studies where participants perform decision-making tasks in collaboration with, or in the presence of, AI assistance.

The final dataset contains 97 papers. The authors extract information about study goals, task domains, perceived stakes, participant details, incentive schemes, task completion time, and textual excerpts discussing incentives or participant motivation. Those excerpts become the basis for the thematic analysis.

This matters because the paper is not experimentally testing whether a specific bonus improves AI reliance. It is mapping how the field currently designs, reports, and reasons about incentives. The evidence is methodological and descriptive. Its value is not “here is the best payment formula.” Its value is “the field is treating a behavior-shaping variable as if it were a receipt.”

The review identifies five major themes:

Theme	Count reported in the paper	What it reveals
Components of incentive schemes	88/97 papers	Most papers mention some payment component, but with uneven detail
Base pay	83/97 papers	Base pay is common, but often heuristically chosen
Bonuses	52/97 papers	Bonus structures vary widely across type, metric, and payout mapping
Incentive manipulation	36/97 papers	Researchers sometimes use incentives to simulate stakes or improve ecological validity
Impact of incentives	12/97 papers	Few papers explicitly discuss how incentives may affect outcomes
Communication of incentives	5/97 papers	Very few discuss how payment rules were explained to participants
No incentive mention	8/97 papers	Some studies omit incentives entirely from reporting

The counts are not just trivia. They show a field with partial awareness. Researchers frequently report that participants were paid. Many mention bonuses. Some deliberately manipulate incentives. But only a minority discuss the impact of those incentives on results, and very few discuss how incentive schemes are communicated.

That gap is the hinge of the paper.

Base pay solves recruitment, not interpretation

Base pay is the most visible part of incentive design. In the reviewed studies, 83 of 97 papers mention it. Sometimes the paper gives a flat amount. Sometimes it gives an hourly rate. Sometimes it gives both. Occasionally the rate is justified by minimum wage, platform recommendations, pilot studies, or expected task duration.

The authors’ concern is not that everyone must use the same rate. That would be too easy, and therefore probably wrong. Human–AI decision tasks vary in time, difficulty, domain knowledge, cognitive load, and emotional burden. A short image classification task and a high-stakes legal-risk assessment simulation are not interchangeable just because both happen inside a browser.

The deeper issue is that base pay is often determined heuristically and reported inconsistently. A reader may know that participants received $2.50, but not whether that was fair relative to task time, whether it was based on a pilot, whether the task required expertise, or whether participants felt the pay justified careful effort.

Base pay mostly answers one question: will people show up and complete the task?

It does not automatically answer a second, more important question: will participants behave like the decision-makers the study claims to simulate?

That distinction matters for human–AI research because decision-making tasks are cognitively richer than ordinary microtasks. A participant labeling five images for a flat fee may behave differently from a participant asked to weigh an AI recommendation in a simulated medical, legal, financial, or moderation context. The payment structure cannot fully recreate real-world stakes, but it can make the simulation more or less plausible.

The paper’s mechanism-first lesson is that compensation is not merely an ethical floor. It is also part of the behavioral frame. Fair pay is necessary, but not sufficient. A study can pay fairly and still create incentives that make the measured “human judgment” oddly thin.

Bonus schemes quietly define the task

Bonuses are where incentive design becomes even more interesting, because bonuses tell participants which behavior is valuable.

The review finds that 52 of the 97 papers mention bonuses. Among the 88 papers that mention some incentive component, 47 mention both base pay and bonus. The authors identify several bonus types: performance-based, completion-based, and luck-based. Performance-based bonuses are the most common. These may reward accuracy, custom performance metrics, expert judgments, in-game currency, or other task-specific scores.

This is where an apparently minor design choice becomes a substantive research choice.

Suppose a study examines whether users appropriately rely on AI advice. The researcher can reward final decision accuracy. That sounds sensible. But if the participant discovers that the AI is usually strong, accuracy incentives may encourage reliance. If the AI is unreliable in certain cases, the same incentive may encourage careful override. If the bonus rewards “correct when the AI is wrong,” it may directly encourage appropriate reliance. But then the study may no longer be measuring natural reliance; it may be measuring behavior under a training-like reward condition.

The paper’s own case-study discussion captures this trade-off clearly. In one applied example, the authors consider whether to reward “appropriate reliance” itself or overall accuracy. Rewarding appropriate reliance would align the bonus with the desired behavior, but it could also contaminate the measurement of that behavior. Rewarding overall accuracy avoids directly instructing the target construct, but may permit overreliance on the AI.

That is not a technicality. It is the experimental design problem in miniature.

A bonus metric can turn the task into something else. Accuracy bonuses make the study about correctness under monetary pressure. Speed bonuses would make it about throughput. A penalty makes it about loss avoidance. A bonus for passing comprehension checks makes it about instruction processing. Each may be defensible. None is neutral.

For businesses, this maps cleanly onto pilot design. If a company gives employees a new AI assistant and evaluates the pilot using “tickets closed per day,” employees will optimize tickets closed per day. If managers praise adoption rates, employees may overuse the tool. If compliance teams punish AI-assisted errors, employees may avoid the tool even when it helps. The organization then announces that “users trust AI too much” or “users resist AI,” when the more boring explanation is that the incentive scheme was yelling in the background.

Boring explanations are often underrated. They are also cheaper.

The strongest evidence is the pattern of omission

The paper’s most important finding is not that incentives exist. Everyone knows people like money. Thank you, economics.

The important finding is that incentive design is often reported without enough context to interpret its role. The field has many studies where incentives are present, fewer where they are intentionally manipulated, fewer still where their impact is discussed, and only a small number where communication is described.

The 12/97 figure for “impact of incentives” is especially telling. Some papers reflect on whether incentives may affect generalizability. Some identify incentives as a possible explanation for observed behavior. A few intentionally manipulate incentives and report whether they affect outcomes. But the authors note that these cases involve different tasks, domains, variables, and conditions. There is no single stable conclusion such as “bonuses increase trust” or “higher stakes reduce overreliance.”

That uncertainty is precisely the point. If incentive effects vary by task and context, then documentation becomes more important, not less. Without it, readers cannot know whether a null result means the interface failed, the AI was not useful, the task was too easy, the bonus was too weak, the reward rule was misunderstood, or the participants simply optimized a different objective.

The communication theme is even sharper. Only 5 of 97 papers discuss how incentive schemes were communicated to participants. That is a small number for a variable whose effect depends heavily on what participants believe they are being rewarded for.

A hidden incentive is not necessarily a behavioral incentive. A misunderstood bonus is not the same as a clear bonus. A reward rule buried in instructions may affect careful participants and miss everyone else. In one reviewed example, participants were explicitly taught the reward structure and tested on comprehension. In another, participants were told one thing about bonus allocation while all ultimately received the bonus. These are not equivalent designs.

Communication turns a payment rule into a psychological treatment. If the participant does not understand the rule, the mechanism may not activate. If the participant understands the rule too well, the study may invite gaming. If the paper does not report the communication strategy, readers are left admiring the results through frosted glass.

The framework is less a recipe than a design discipline

The authors respond with the Incentive-Tuning Framework. Its value is not that it finds the universal optimal incentive. It explicitly does not. Its value is that it forces researchers to state the behavioral logic they are otherwise tempted to leave implicit.

The framework has five steps:

Identify the purpose of the incentive scheme.
Determine an appropriate base pay.
Design a bonus structure where relevant.
Gather participant feedback.
Reflect on design implications.

The first step is the most important. Before choosing numbers, researchers must ask what the incentive is meant to do. Is it only fair compensation? Is it meant to motivate careful effort? Simulate stakes? Encourage attention? Improve ecological validity? Support a specific behavior beyond task completion?

Once the purpose is clear, base pay can be reasoned about in relation to task complexity, required expertise, cognitive load, time estimates, platform standards, minimum wage, prior studies, and pilot feedback. Bonuses can then be designed around type, amount, performance metric, and reward mapping policy. Participant feedback can reveal whether the scheme was perceived as fair, motivating, confusing, stressful, or irrelevant. Reflection then ties those choices back to interpretation.

This sounds obvious in the way all good methodology sounds obvious after someone writes it down.

The framework also comes with a reporting template and a public repository intended to make incentive schemes more transparent and comparable across studies.² That is a practical contribution. A field cannot accumulate knowledge about incentive design if the relevant design decisions are scattered, underspecified, or absent.

The paper’s case studies show how the framework can be applied to existing research. These should be read as demonstrations, not proof that the framework improves outcomes. The case studies walk through trade-offs: whether a study should use bonuses, whether task accuracy should be rewarded, whether comprehension checks are a better target than final judgments, whether penalties are ethically appropriate, and how participant feedback should be gathered.

In evidence terms, the case studies are implementation demonstrations. They help readers see how the framework changes the design conversation. They do not establish that the resulting incentive scheme would necessarily produce better data. That remains an empirical question.

The business version: pilot incentives are KPIs in disguise

The paper directly studies academic crowdsourced research. It does not directly study enterprise AI pilots. Still, the business relevance is unusually strong because the same mechanism appears in a different costume.

A company testing an AI copilot is also running a human–AI decision-making study. The participants are employees rather than crowd workers. The incentives are often managerial rather than monetary. The outcomes may include productivity, accuracy, satisfaction, adoption, compliance, cost savings, or retention. But the methodological problem is the same: if the surrounding incentives are not documented, the pilot result can be badly misread.

Consider three common enterprise scenarios.

First, a customer-support team tests an AI drafting assistant. Management measures average handling time. Agents quickly learn that shorter conversations look good. The AI appears to improve productivity. But perhaps it also encourages shorter, less careful responses. Unless quality outcomes are measured and incentives are documented, the company may mistake speed optimization for AI value.

Second, analysts test an AI research assistant. They are told the goal is “better insight,” but their actual performance reviews still reward report volume and deadline compliance. The assistant gets used mostly for summarization and first drafts. Leadership concludes that the tool is “not strategic.” Maybe. Or maybe the incentive system trained people not to spend time on strategic use.

Third, a compliance function pilots AI risk triage. Errors are punished harshly. Staff avoid AI recommendations unless they already agree with them. The pilot shows low reliance. The easy story is algorithm aversion. The better story may be rational self-protection under asymmetric downside.

In each case, the lesson is not “change the incentive and all will be well.” The lesson is more disciplined: record the incentive environment before interpreting the human–AI behavior.

A useful business translation of the paper looks like this:

Research concept	Enterprise analogue	Practical question
Base pay	Baseline workload and role expectation	Are users given enough time and permission to use the tool properly?
Performance bonus	KPI, commission, ranking, promotion signal	What behavior does the organization actually reward?
Penalty	Error blame, compliance risk, managerial escalation	Does downside risk discourage AI use or encourage defensive behavior?
Incentive communication	Pilot instructions, training, manager messaging	Do users understand what success means?
Participant feedback	Employee interviews, surveys, usage logs	Did users feel the tool changed effort, risk, or accountability?
Reflection	Post-pilot interpretation	Could the observed behavior be an incentive artifact?

This is where the article’s mechanism-first framing pays off. Incentive design is not a small appendix to AI evaluation. It is part of the causal path from tool introduction to observed behavior.

A minimal incentive audit for AI pilots

The paper’s framework can be adapted into a lightweight audit for organizations running AI pilots. Not a 40-page governance ceremony. Those reproduce very well in slide decks and very poorly in reality. A practical version needs only a few questions.

Audit question	Why it matters
What behavior do we want users to exhibit during the pilot?	Without this, the pilot may reward whatever is easiest to measure
What behavior are users already rewarded or punished for?	Existing workplace incentives may dominate the pilot instructions
Are we measuring speed, quality, judgment, adoption, or risk reduction?	Different metrics create different behavioral pressure
What have users been told about how the pilot will be evaluated?	Communication shapes whether the incentive is understood
Could users maximize the pilot metric while reducing actual business value?	This detects gaming and proxy optimization
Have we collected user feedback about motivation, workload, and perceived risk?	Usage logs alone rarely explain why people behaved as they did
When interpreting results, have we separated tool effects from incentive effects?	Prevents overclaiming success or failure

This audit does not require sophisticated modeling. It requires intellectual hygiene.

For example, if an AI coding assistant pilot shows strong adoption among junior developers but weak adoption among senior developers, the incentive explanation should be checked before the cultural one. Junior developers may be rewarded for output volume and learning speed. Senior developers may be rewarded for architectural judgment, code review quality, and avoiding downstream maintenance risk. The same tool meets different incentive fields.

Similarly, if an AI sales assistant produces more outreach but weaker conversion, the issue may not be the model. It may be that the incentive system rewarded message volume. The AI faithfully amplified the wrong objective. Machines are very helpful that way: they reveal management confusion at scale.

What the paper does not prove

The paper is careful about scope, and the business reader should be too.

First, this is a review and thematic analysis, not a controlled experiment comparing incentive schemes. It maps practices, identifies gaps, and proposes a framework. It does not prove that using the framework will improve accuracy, reduce overreliance, or increase ecological validity in every setting.

Second, the paper focuses on monetary incentives in crowdsourced human–AI decision-making studies. Real organizations also involve intrinsic motivation, professional identity, peer norms, promotion systems, legal accountability, fatigue, fear, curiosity, and office politics. A tragedy, but a realistic one.

Third, incentive effects are likely context-dependent. The reviewed literature includes examples where incentives appear to affect outcomes and examples where stake size did not show an effect under the studied conditions. This means the correct conclusion is not “incentives always matter in the same way.” The correct conclusion is “if incentives might matter, and you do not document them, interpretation becomes weaker.”

Fourth, better incentive design can create new risks. Performance bonuses can motivate effort, but they can also encourage gaming. Penalties can simulate stakes, but they can also create stress or ethical problems. High rewards may improve engagement, but they may also pressure participants. The framework helps surface those trade-offs; it does not abolish them.

Finally, the paper’s framework itself may face adoption barriers. Researchers and practitioners may see it as extra work. Smaller studies may not have the budget to pilot multiple incentive schemes or gather extensive participant feedback. This is a real limitation. But the minimum standard is still achievable: state the purpose, document the scheme, explain the rationale, gather some feedback, and reflect on implications.

That is not bureaucracy. That is reading the label on the instrument before trusting the measurement.

The real contribution is interpretability for human behavior

Human–AI research spends enormous energy making AI systems interpretable. This paper asks for a humbler symmetry: make the human side interpretable too.

When a participant accepts AI advice, rejects it, overrelies on it, underuses it, reports trust, speeds through the task, or carefully revises a recommendation, that behavior did not emerge in a vacuum. It emerged inside a task environment. Incentives are part of that environment.

The paper’s contribution is therefore not only about payment design. It is about causal legibility. If researchers want to claim that an AI explanation improved trust calibration, that a model confidence display changed reliance, or that a decision-support tool improved human performance, they need to show what participants were being motivated to do. Otherwise, the incentive scheme remains a hidden variable.

For business leaders, the same principle is even more direct. Before declaring an AI pilot successful, failed, underused, overtrusted, or “culturally resisted,” ask what the organization paid, praised, measured, punished, or implied. The answer may be less glamorous than a grand theory of human–AI collaboration.

But it will probably be more useful.

Incentive design is not the whole story. It is just the part of the story that keeps pretending to be administrative.

That disguise is no longer very convincing.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Simran Kaur, Sara Salimzadeh, and Ujwal Gadiraju, “Incentive-Tuning: Understanding and Designing Incentives for Empirical Human-AI Decision-Making Studies,” arXiv:2601.15064, 2026. https://arxiv.org/abs/2601.15064 ↩︎
The paper describes supplementary materials, a reporting template, and a public repository for incentive-design information. Repository: https://github.com/simrankaur1509/IncentiveTuning ↩︎

Incentives are a behavioral interface, not accounting noise#

What the paper actually reviews#

Base pay solves recruitment, not interpretation#

Bonus schemes quietly define the task#

The strongest evidence is the pattern of omission#

The framework is less a recipe than a design discipline#

The business version: pilot incentives are KPIs in disguise#

A minimal incentive audit for AI pilots#

What the paper does not prove#

The real contribution is interpretability for human behavior#