Meeting.
That is where many AI demos go to die.
A model receives a tidy prompt, produces a tidy answer, and everyone nods. Then the real work begins: the client clarifies a requirement, the dataset has a missing column, the UI screenshot does not match the written description, the user contradicts themselves, and the model has to decide whether to ask, revise, infer, test, or gracefully admit that it is flying blind.
Static benchmarks rarely measure that moment. They ask the model to answer. Real work asks the model to reduce uncertainty.
That is the core argument of Interactive Benchmarks, a new paper that proposes evaluating large language models through budgeted multi-turn interaction rather than one-shot answer generation.1 The paper’s contribution is not simply another benchmark suite. We have enough leaderboards already; some of them are beginning to look like airport departure boards during a storm. The more useful idea here is a mechanism: treat intelligence partly as the ability to decide what information to acquire, when to acquire it, and how to use sparse feedback to improve the next action.
This matters because the common business misconception is still alive and well: if a model scores highly on static reasoning tests, it must also be good at interactive work. That assumption is convenient, vendor-friendly, and often wrong.
The missing capability is not reasoning alone, but information acquisition under budget
Most familiar LLM benchmarks use a fixed input and ask for a final answer. This works when the deployed task also looks like a fixed exam question. Many business tasks do not.
A support agent must ask for the missing account detail. A coding assistant must infer what the user means by “make it look like Stripe, but more playful,” which is exactly the sort of sentence that should make a frontend engineer stare into the middle distance. A financial analysis agent must decide whether to query more data or act on the signal it already has. A negotiation or trading agent must behave coherently across many turns, not merely produce one plausible paragraph about strategy.
The paper formalizes this distinction by treating evaluation as a sequential interaction. At each round, the model observes the previous history, chooses an action, receives a response from the environment, and continues until it submits an answer or exhausts its budget. In simple terms:
$$ \text{Model performance} = f(\text{questions asked}, \text{feedback used}, \text{budget spent}, \text{final outcome}) $$
That is the shift. The benchmark no longer asks only, “Did the model know the answer?” It asks, “Could the model find its way toward the answer when the initial prompt was incomplete?”
The paper separates this into two regimes:
| Regime | What the model interacts with | Goal | Business analogue |
|---|---|---|---|
| Interactive Proofs | A judge that holds hidden ground truth and gives restricted feedback | Converge on the correct answer | Debugging, requirements clarification, math/coding verification, diagnostic workflows |
| Interactive Games | An environment and other agents with uncertain behavior | Maximize long-horizon utility | Negotiation, trading, pricing games, strategic customer interaction |
This distinction is useful because “interactive AI” is otherwise too broad to mean much. A chatbot asking “Can you clarify?” and a poker-playing model managing risk over thousands of hands are both interactive, but they test different muscles. One is about converging on truth. The other is about adapting under strategic uncertainty.
Interactive proofs test whether the model can turn weak feedback into better hypotheses
The first regime, interactive proofs, is the cleaner one. There is a hidden target. The model cannot see it directly. It must ask questions, receive restricted answers, and decide when it has enough evidence to submit a final answer.
The paper instantiates this in three domains: Logic, UI2Html, and Math.
In the Logic task, the benchmark uses Situation Puzzles. These are short paradoxical stories where the surface description is deliberately underspecified. The model has 20 rounds to ask yes/no-style questions or submit an explanation. The judge answers intermediate questions with one of four labels: yes, no, both, or irrelevant. Final submissions receive correct or incorrect.
The authors built a dataset of 46 curated puzzles. A useful diagnostic appears before the main experiment: when models are forced to answer directly without interaction, every evaluated model scores 0%. That matters. It means the benchmark is not merely asking whether the model remembers a puzzle trope or can hallucinate a plausible explanation. The task is designed so that interaction is necessary.
Under the 20-turn interactive protocol, the scores remain modest. Gemini-3-flash reaches 30.4% accuracy, GPT-5-mini reaches 17.4%, Grok-4.1-fast and DeepSeek-v3.2 each reach 15.2%, Kimi-k2 reaches 6.5%, and Qwen3-max reaches 4.3%. The efficiency metric adds another layer: Kimi-k2 solves its few successful cases in the fewest turns on average, while DeepSeek-v3.2 takes the most turns among solved cases.
The point is not that any one model is crowned king of riddles. Please do not build a procurement policy around who best solves morbid bedroom puzzles. The point is that the benchmark separates two abilities that static tests usually blur: generating a final answer and using feedback to narrow uncertainty.
UI2Html brings the same idea closer to business software. The model receives an intentionally incomplete textual description of a target webpage, writes a complete HTML file, asks one yes/no clarification question per round, and revises across 20 rounds. The final output is scored on layout, component, style, text, and polish.
Here, all six models improve with interaction. GPT-5-mini achieves the best final score under the full setting at 57.62, followed by Grok-4.1-fast at 57.12 and Gemini-3-flash at 55.46. DeepSeek-v3.2 scores 53.72, Qwen3-max 51.35, and Kimi-k2 49.03. But the more interesting result is the gain from interaction: Grok-4.1-fast improves from 53.19 to 57.12, GPT-5-mini from 54.57 to 57.62, while Kimi-k2 barely moves from 48.88 to 49.03.
That pattern is exactly what enterprise evaluators should care about. In real implementation work, the first version is rarely final. The useful question is not merely “Can the model produce HTML?” It is “Can the model use sparse user feedback to make the implementation closer to the target?” Some models can. Some mostly just keep typing. The difference is expensive.
Math gives the sharpest comparison with static evaluation. The authors use 52 expert-selected hard problems from the mathematics portion of Humanity’s Last Exam. In the interactive setting, the model can ask the judge whether intermediate claims are valid, again receiving restricted feedback, or submit a final solution.
The authors compare this with a budget-matched pass@k baseline. This is important. Without matching the budget, an interactive protocol could simply look better because it spends more tokens. The paper chooses $k$ so that repeated full-solution attempts roughly match the player-side token budget of the interactive run. The resulting $k$ values are small: 1 for Grok, 5 for Gemini, 2 for GPT-5, 2 for Kimi, 4 for DeepSeek, and 8 for Qwen3.
Under this comparison, interactive evaluation substantially outperforms pass@k. Grok-4.1-fast reaches 76.9% interactive accuracy versus 25.0% under budget-matched pass@k. GPT-5-mini reaches 73.1% versus 34.6%. Gemini-3-flash reaches 61.5% versus 25.0%. DeepSeek-v3.2 reaches 48.1% versus 25.0%. Qwen3-max reaches 46.2% versus 25.0%. Kimi-k2 reaches 34.6% versus 9.6%.
This is the paper’s strongest business-relevant result. Repeated sampling is a brute-force way to improve accuracy: ask the model to solve the whole problem again and hope one attempt works. Interactive verification is more surgical: test a lemma, discard a bad branch, keep the useful partial work, and continue. The result suggests that when feedback is available, a model’s practical capability may be underestimated by static pass@k evaluation.
For enterprise workflows, that is not a small detail. It changes how we should test AI systems that operate with review loops, human feedback, validators, linters, simulators, or tool responses. The real question is not “How good is answer number one?” It is “How quickly does the system improve after being told what is wrong?”
Interactive games test coherence when there is no oracle waiting politely in the corner
The second regime removes the judge-as-truth-verifier. In interactive games, there is no hidden correct answer to recover. The model must act in an environment, deal with uncertainty, and optimize long-horizon reward.
The paper uses Texas Hold’em and the Trust Game.
Texas Hold’em tests imperfect-information strategic reasoning. Each agent sees private cards, public cards, stack sizes, pot odds, and recent actions. It must output legal actions such as fold, check, call, raise, or all-in. The benchmark runs 5,000 hands across 10 independent tables, each with the same six LLM agents.
Gemini-3-flash performs best on average winnings per hand, with 31.8 ± 42.4. Grok-4.1-fast follows with 27.9 ± 53.5, and GPT-5-mini with 22.2 ± 71.3. The behavioral metrics are also revealing. GPT-5-mini has the highest VPIP at 23.7% ± 1.1% and the lowest fold rate at 71.4% ± 1.9%, meaning it participates more actively. DeepSeek-v3.2 is much tighter, with VPIP of 9.0% ± 2.0% and fold rate of 90.5% ± 1.4%.
The interpretation is not “Gemini is better at poker, therefore buy Gemini.” That would be the kind of conclusion one writes after reading only the bar chart and possibly after losing money at a casino.
The stronger interpretation is that interactive game benchmarks reveal strategic style. A model can be aggressive, tight, risk-seeking, slow, profitable, or brittle. Those traits matter for business agents that make repeated decisions under uncertainty. Customer retention negotiation, ad bidding, inventory pricing, and trading simulations are not one-shot exams. They are sequences. Style is part of capability.
The Trust Game pushes this further. It is modeled as a repeated Prisoner’s Dilemma with random horizon. Models repeatedly choose cooperate or defect, with payoffs depending on both players’ actions. The benchmark includes two simple baselines: Grim Trigger and Tit-for-Tat.
Qwen3-max achieves the highest average payoff per round at 1.867, while DeepSeek-v3.2 is lowest at 1.648. Only Qwen3-max and GPT-5-mini, at 1.836, outperform both heuristic baselines. The appendix adds behavioral measures such as cooperation rate and betrayal rate, because payoff alone does not describe how the model plays. A model that earns well through stable reciprocity is not the same as one that opportunistically defects at the right moment.
That distinction matters in deployed systems. An enterprise agent that maximizes immediate payoff while damaging trust may look good in a short simulation and terrible in a customer relationship. Conversely, an agent that cooperates too naïvely may be exploitable. The Trust Game is stylized, but the diagnostic lens is valuable: evaluate policy behavior, not only aggregate score.
The appendix is mostly calibration, not a second thesis
The appendix does useful work, but it should be read correctly.
Some results are main evidence. The core experiments in Logic, UI2Html, Math, Poker, and Trust Game support the claim that interactive evaluation reveals capabilities that static or single-shot protocols miss.
Some results are ablations or sensitivity tests. They do not create a separate argument; they stress-test the measurement protocol.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Logic judge sensitivity | Judge calibration ablation | Absolute scores shift by judge, but broad model ordering remains relatively stable | That one fixed judge is universally reliable |
| Logic interaction budget | Budget sensitivity test | More rounds help stronger models more clearly because they can use sparse feedback | That more interaction always helps every model |
| UI2Html judge stack | Evaluator sensitivity test | Visual judge choice materially affects score scale | That UI2Html scores are judge-invariant |
| Math judge heatmap | Robustness check | Player-level differences remain visible across judges | That math feedback is independent of judge quality |
| Poker deterministic baselines | Sanity-check ablation | LLM agents outperform degenerate all-in and always-fold strategies | That the poker benchmark captures all real strategic skill |
| Trust Game continuation probability | Protocol sensitivity test | Intermediate horizons work best; longer horizons change rankings | That one horizon parameter represents all trust dynamics |
This is where the paper is better than a simple leaderboard. It admits that interactive evaluation is itself an engineered environment. Change the judge, the feedback vocabulary, the budget, the horizon, or the opponent pool, and the measured behavior may change.
That is not a fatal flaw. It is the reality of evaluating agents. Static benchmarks hide their assumptions behind fixed datasets. Interactive benchmarks expose the assumptions more visibly. Slightly less convenient, much more honest.
What this changes for enterprise AI procurement
The practical lesson is not that every company should immediately reproduce these five benchmarks. Most will not, and most should not. A logistics company does not need a poker table to evaluate an AI planner. A bank does not need Situation Puzzles unless its compliance department has become unusually theatrical.
The lesson is that enterprise evaluation should move from answer quality to interaction quality.
A useful procurement benchmark for AI agents should measure at least four things:
| Evaluation dimension | What to measure | Why it matters |
|---|---|---|
| Clarification quality | Does the model ask questions that reduce uncertainty rather than performatively “checking in”? | Bad clarification wastes user time and still produces wrong output |
| Feedback conversion | Does the model revise meaningfully after negative feedback? | Many systems accept feedback linguistically but ignore it operationally |
| Budget discipline | How many turns, tokens, tool calls, or human interventions are needed before success? | Accuracy without cost control is just expensive optimism |
| Strategy stability | Does the model behave coherently across long sequences? | Multi-step agents fail through drift, overreaction, or inconsistent risk posture |
This is where Cognaptus would infer a broader business pathway from the paper. The direct evidence shows that interactive protocols reveal model differences across five controlled settings. The business inference is that companies should design internal task simulations that resemble their actual workflows: ambiguous requirements, incomplete data, validators, review loops, and time or cost budgets.
For example:
- A coding assistant should be tested on whether it asks useful implementation questions and improves after linting, unit tests, and user feedback.
- A UI generation tool should be tested on iterative reconstruction, not only first-shot visual appeal.
- A financial research agent should be tested on whether it seeks missing data before producing investment commentary.
- A customer-service agent should be tested on whether it asks for the right missing constraint without becoming a bureaucratic form in chatbot clothing.
- A trading or negotiation agent should be tested through repeated environments where risk posture and adaptation matter.
The paper does not prove that interactive benchmark scores directly predict enterprise ROI. That link still has to be built task by task. But it gives a better template for evaluation: measure how the model behaves while uncertainty is being reduced.
The limits are real, but they are not the usual decorative caution paragraph
There are several boundaries that materially affect interpretation.
First, judge design matters. In interactive-proof settings, the player model is evaluated through a judge, and the judge’s behavior can shift absolute scores. The UI2Html appendix makes this especially clear: qwen-vl-max, gpt-4.1-mini, and gemini-2.5-flash produce meaningfully different score scales. This means the benchmark measures performance under a protocol, not pure capability floating in the Platonic cloud.
Second, the datasets are moderate in size: 46 logic puzzles, 50 UI screenshots, and 52 math problems. The game tasks are also specific: one poker engine configuration and one repeated Trust Game design. These are useful testbeds, not a complete map of interactive intelligence.
Third, interaction is entangled with domain skill. UI2Html requires both clarification and HTML/CSS competence. Poker requires both interaction and game knowledge. Math requires both proof search and mathematical ability. This is not a scalar “interaction IQ” meter. It is interactive reasoning measured through concrete tasks.
Fourth, cost is only partially represented. The math comparison matches player-side token budget but excludes judge-side tokens, latency, system overhead, and the messiness of real human feedback. In deployment, feedback may be delayed, inconsistent, political, or simply wrong. A benchmark judge, mercifully, does not have a vice president asking whether the button can “feel more premium.”
Finally, model drift and contamination remain possible. The Logic task reduces direct-answer shortcuts by showing 0% no-interaction accuracy, but some tasks derive from existing public sources. API model behavior can also change over time while model names remain stable. Anyone using these results as fixed vendor rankings is already misusing the paper.
The useful future benchmark looks more like work
The most important idea in Interactive Benchmarks is simple: capable AI systems should not merely answer; they should investigate.
That investigation has a cost. Every question, tool call, revision, simulation, and validation step consumes budget. So the relevant evaluation question becomes: how much uncertainty does the model remove per unit of interaction?
This is a more realistic lens for enterprise AI than static accuracy alone. It rewards models that ask discriminative questions, use feedback efficiently, revise without losing context, and stay strategically coherent across repeated decisions. It also exposes models that look impressive in first-shot demos but become strangely ornamental when the task requires actual adaptation.
Static benchmarks are not obsolete. They remain useful for standardized measurement, regression testing, and quick comparisons. But for agentic systems, they are incomplete. The next meaningful evaluation layer will look less like an exam and more like a workflow.
And that is the uncomfortable part for AI vendors: in real work, the correct answer is often not available at the start. The model has to earn it.
Cognaptus: Automate the Present, Incubate the Future.
-
Baoqing Yue et al., “Interactive Benchmarks,” arXiv:2603.04737v4, 16 May 2026, https://arxiv.org/abs/2603.04737. ↩︎