AgenticPay: When LLMs Start Haggling for a Living

Opening — Why this matters now

Agentic AI has moved beyond polite conversation. Increasingly, we expect language models to act: negotiate contracts, procure services, choose suppliers, and close deals on our behalf. This shift quietly transforms LLMs from passive tools into economic actors.

Yet here’s the uncomfortable truth: most evaluations of LLM agents still resemble logic puzzles or toy auctions. They test reasoning, not commerce. Real markets are messy—private constraints, asymmetric incentives, multi-round bargaining, and strategic patience all matter. The paper behind AgenticPay steps directly into this gap.

Background — From game theory to language-mediated markets

Classical bargaining theory (Rubinstein, Myerson–Satterthwaite, etc.) models negotiation as numeric exchanges over well-defined utility functions. Elegant, provable—and largely irrelevant once language enters the loop.

Recent LLM benchmarks improve on this only marginally. Auctions replace dialogue. Bilateral bargaining dominates. Horizon lengths remain short. Most importantly, language is treated as surface form, not as the mechanism that creates economic outcomes.

AgenticPay reframes negotiation as a language-grounded market game. Dialogue is no longer decoration; it is the control surface.

Analysis — What AgenticPay actually does

At its core, AgenticPay is a simulation framework where buyer and seller agents—powered by LLMs—negotiate through natural language rather than numeric bids.

Market structure

Each episode defines:

Buyers with private maximum willingness-to-pay
Sellers with private minimum acceptable prices
Products with structured and textual attributes
Market context shared across agents

Agents exchange multi-turn messages. A parser extracts structured actions (prices, acceptance). A deal only succeeds if it lands inside the bargaining zone.

Task scaling

The benchmark spans 111 tasks across increasing complexity:

Dimension	Variants
Buyers	Single → Multiple
Sellers	Single → Multiple
Products	Single → Multiple

This yields scenarios from simple 1-to-1 bargaining to full many-to-many matching markets. Importantly, the same business scenario (e.g., SaaS procurement) can be tested under different market structures, isolating reasoning skill from structural luck.

Evaluation logic

AgenticPay does not reward agreement at any cost. Scores combine:

Feasibility (deal must respect private constraints)
Efficiency (fewer rounds is better)
Welfare (balanced surplus beats lopsided wins)

The GlobalScore peaks when surplus is split evenly—an explicit bias toward economically reasonable outcomes rather than rhetorical dominance.

Findings — What breaks when LLMs negotiate

1. Proprietary models dominate—decisively

Across all tasks, frontier models consistently outperform open-weight ones.

Model	GlobalScore	Deal Rate	Avg Rounds
Claude Opus 4.5	~87	100%	~3.7
GPT-5.2	~82	100%	~3.8
Gemini-3-Flash	~82	100%	~4.8
Qwen3-14B	~64	79%	~7.8
Llama-3.1-8B	~33	51%	~15.0

Language fluency scales. Strategic closure does not.

2. Buyers are systematically worse than sellers

Every model—without exception—performs better as a seller than as a buyer. Even top-tier models extract more surplus when selling.

This is not random noise. It suggests a structural bias in LLM training: persuasion and justification are overrepresented; disciplined concession-making is not.

If you plan to deploy an AI purchasing agent, this should worry you.

3. More competition improves outcomes

Counterintuitively, models perform better in multi-agent markets than in bilateral ones. With more buyers or sellers, deal quality and convergence improve.

Why? Liquidity.

More alternatives reduce the need for fragile long-horizon reasoning. The market, not the model, does part of the optimization.

4. Open models fail at the “last mile”

Failure analysis reveals something subtle: many failed negotiations end very close to agreement.

Over 40% of timeouts occur when buyer and seller prices differ by less than 5 units. The agents understand the zone—but lack the strategic patience or timing to close.

This is not ignorance. It’s control failure.

Implications — What this means for agentic commerce

Evaluation must move past single-agent reasoning

AgenticPay shows that reasoning benchmarks miss the point. Economic agency demands:

Persistent memory
Role-conditioned incentives
Long-horizon convergence strategies

Without these, fluent language becomes expensive noise.

Market design can compensate for weak agents

Multi-agent settings stabilize outcomes even with weaker models. For practitioners, this suggests a design principle:

If your agent is weak, surround it with structure.

Competition, parallel negotiation, and choice sets act as external reasoning scaffolds.

Buyer-side agents need deliberate correction

The buyer disadvantage is too consistent to ignore. Future systems will need explicit counter-bias mechanisms—reward shaping, policy constraints, or post-negotiation audits—to avoid systematic overpayment.

Conclusion — Language is not strategy

AgenticPay is not just a benchmark. It is a warning.

LLMs can talk their way through a negotiation, but talking is not the same as closing well. Until agents learn patience, role awareness, and surplus discipline, autonomous commerce will remain asymmetrical—and occasionally costly.

AgenticPay gives the field a mirror. The reflection is sharp.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From game theory to language-mediated markets#

Analysis — What AgenticPay actually does#

Market structure#

Task scaling#

Evaluation logic#

Findings — What breaks when LLMs negotiate#

1. Proprietary models dominate—decisively#

2. Buyers are systematically worse than sellers#

3. More competition improves outcomes#

4. Open models fail at the “last mile”#

Implications — What this means for agentic commerce#

Evaluation must move past single-agent reasoning#

Market design can compensate for weak agents#

Buyer-side agents need deliberate correction#

Conclusion — Language is not strategy#