Opening — Why this matters now
Agentic AI has moved beyond polite conversation. Increasingly, we expect language models to act: negotiate contracts, procure services, choose suppliers, and close deals on our behalf. This shift quietly transforms LLMs from passive tools into economic actors.
Yet here’s the uncomfortable truth: most evaluations of LLM agents still resemble logic puzzles or toy auctions. They test reasoning, not commerce. Real markets are messy—private constraints, asymmetric incentives, multi-round bargaining, and strategic patience all matter. The paper behind AgenticPay steps directly into this gap.
Background — From game theory to language-mediated markets
Classical bargaining theory (Rubinstein, Myerson–Satterthwaite, etc.) models negotiation as numeric exchanges over well-defined utility functions. Elegant, provable—and largely irrelevant once language enters the loop.
Recent LLM benchmarks improve on this only marginally. Auctions replace dialogue. Bilateral bargaining dominates. Horizon lengths remain short. Most importantly, language is treated as surface form, not as the mechanism that creates economic outcomes.
AgenticPay reframes negotiation as a language-grounded market game. Dialogue is no longer decoration; it is the control surface.
Analysis — What AgenticPay actually does
At its core, AgenticPay is a simulation framework where buyer and seller agents—powered by LLMs—negotiate through natural language rather than numeric bids.
Market structure
Each episode defines:
- Buyers with private maximum willingness-to-pay
- Sellers with private minimum acceptable prices
- Products with structured and textual attributes
- Market context shared across agents
Agents exchange multi-turn messages. A parser extracts structured actions (prices, acceptance). A deal only succeeds if it lands inside the bargaining zone.
Task scaling
The benchmark spans 111 tasks across increasing complexity:
| Dimension | Variants |
|---|---|
| Buyers | Single → Multiple |
| Sellers | Single → Multiple |
| Products | Single → Multiple |
This yields scenarios from simple 1-to-1 bargaining to full many-to-many matching markets. Importantly, the same business scenario (e.g., SaaS procurement) can be tested under different market structures, isolating reasoning skill from structural luck.
Evaluation logic
AgenticPay does not reward agreement at any cost. Scores combine:
- Feasibility (deal must respect private constraints)
- Efficiency (fewer rounds is better)
- Welfare (balanced surplus beats lopsided wins)
The GlobalScore peaks when surplus is split evenly—an explicit bias toward economically reasonable outcomes rather than rhetorical dominance.
Findings — What breaks when LLMs negotiate
1. Proprietary models dominate—decisively
Across all tasks, frontier models consistently outperform open-weight ones.
| Model | GlobalScore | Deal Rate | Avg Rounds |
|---|---|---|---|
| Claude Opus 4.5 | ~87 | 100% | ~3.7 |
| GPT-5.2 | ~82 | 100% | ~3.8 |
| Gemini-3-Flash | ~82 | 100% | ~4.8 |
| Qwen3-14B | ~64 | 79% | ~7.8 |
| Llama-3.1-8B | ~33 | 51% | ~15.0 |
Language fluency scales. Strategic closure does not.
2. Buyers are systematically worse than sellers
Every model—without exception—performs better as a seller than as a buyer. Even top-tier models extract more surplus when selling.
This is not random noise. It suggests a structural bias in LLM training: persuasion and justification are overrepresented; disciplined concession-making is not.
If you plan to deploy an AI purchasing agent, this should worry you.
3. More competition improves outcomes
Counterintuitively, models perform better in multi-agent markets than in bilateral ones. With more buyers or sellers, deal quality and convergence improve.
Why? Liquidity.
More alternatives reduce the need for fragile long-horizon reasoning. The market, not the model, does part of the optimization.
4. Open models fail at the “last mile”
Failure analysis reveals something subtle: many failed negotiations end very close to agreement.
Over 40% of timeouts occur when buyer and seller prices differ by less than 5 units. The agents understand the zone—but lack the strategic patience or timing to close.
This is not ignorance. It’s control failure.
Implications — What this means for agentic commerce
Evaluation must move past single-agent reasoning
AgenticPay shows that reasoning benchmarks miss the point. Economic agency demands:
- Persistent memory
- Role-conditioned incentives
- Long-horizon convergence strategies
Without these, fluent language becomes expensive noise.
Market design can compensate for weak agents
Multi-agent settings stabilize outcomes even with weaker models. For practitioners, this suggests a design principle:
If your agent is weak, surround it with structure.
Competition, parallel negotiation, and choice sets act as external reasoning scaffolds.
Buyer-side agents need deliberate correction
The buyer disadvantage is too consistent to ignore. Future systems will need explicit counter-bias mechanisms—reward shaping, policy constraints, or post-negotiation audits—to avoid systematic overpayment.
Conclusion — Language is not strategy
AgenticPay is not just a benchmark. It is a warning.
LLMs can talk their way through a negotiation, but talking is not the same as closing well. Until agents learn patience, role awareness, and surplus discipline, autonomous commerce will remain asymmetrical—and occasionally costly.
AgenticPay gives the field a mirror. The reflection is sharp.
Cognaptus: Automate the Present, Incubate the Future.