Procurement looks boring until the software starts spending money.

A human buyer can be slow, inconsistent, and occasionally allergic to spreadsheets. But at least we know what failure looks like: overpaying, accepting bad terms, walking away too late, or trusting the wrong supplier. When the buyer is an LLM agent, the failure mode becomes more polished. It can overpay in fluent English. It can miss a deal while sounding reasonable. It can keep bargaining after the answer is already visible. Progress, apparently, now comes with better punctuation.

That is why AgenticPay is useful.1 The paper does not ask whether language models can produce negotiation-shaped text. They obviously can. The harder question is whether they can act as economically disciplined negotiators when they must manage private prices, multi-round dialogue, multiple counterparties, and explicit deal constraints.

The paper’s answer is not “no.” That would be too easy. The answer is sharper: frontier proprietary models can close deals reliably in this benchmark, but negotiation quality is uneven, seller-side behavior is consistently stronger than buyer-side behavior, financial-asset scenarios expose weakness, and smaller open-weight models often fail not because they cannot enter the bargaining zone, but because they cannot finish the last mile.

That distinction matters for business. If AI agents are going to purchase software, negotiate creator contracts, select vendors, or represent customers in service markets, fluency is the wrong minimum standard. The real standard is closer to this: can the agent protect private constraints, converge efficiently, avoid invalid prices, allocate surplus reasonably, and know when to stop talking?

AgenticPay gives us a controlled way to ask that question.

AgenticPay Tests Negotiation as a Market, Not as Chat Theater

Most LLM evaluations still treat agency as a single-agent problem: answer the question, call the tool, follow the instruction, complete the task. Commerce is different. A transaction is not solved by one agent thinking clearly in isolation. It is shaped by incentives, incomplete information, bargaining zones, timing, and the unpleasant fact that the other side also wants the surplus.

AgenticPay formalizes that setting as a natural-language buyer-seller market. Buyers receive private maximum willingness-to-pay values. Sellers receive private minimum acceptable prices. Neither side is supposed to reveal its reservation value. They exchange multi-turn natural-language messages, and the system parses structured offers from those messages.

A deal is valid only if the final price sits inside the bargaining zone:

$$ p^{min} \leq p \leq p^{max} $$

where $p^{min}$ is the seller’s minimum acceptable price and $p^{max}$ is the buyer’s maximum acceptable price.

This design matters because it turns dialogue into measurable economic behavior. The language is not decorative. It is the control surface through which the agent proposes, concedes, accepts, stalls, or fails.

The benchmark covers 111 tasks: 31 basic tasks and 80 realistic tasks across 10 business scenarios. Those scenarios include used smartphones, used cars, vacation rentals, website development, commercial photography, home renovation, SaaS software, raw materials, luxury watches, and business acquisitions. Prices range from hundreds of dollars to $120,000. The task structures move from simple bilateral bargaining to multi-buyer, multi-seller, multi-product, sequential, and parallel negotiation settings.

This is the first important contrast: AgenticPay is not merely asking whether an LLM can bargain over a jacket. It asks whether the same policy can survive when the market structure changes.

Design element What it tests Business interpretation
Private buyer and seller reservation prices Whether agents can negotiate without leaking constraints Useful for procurement, sales, and marketplace agents handling confidential limits
Structured price extraction Whether free-form language can be converted into auditable actions Necessary for systems that must log, approve, or block transactions
GlobalScore, BuyerScore, SellerScore Whether the deal is feasible, efficient, and surplus-aware Prevents “deal at any cost” behavior from looking successful
Multiple buyers, sellers, and products Whether agents handle liquidity, alternatives, and matching Closer to real vendor selection than one-off chat bargaining
Sequential and parallel modes Whether orchestration changes negotiation quality Relevant for agents comparing suppliers one by one or simultaneously

The scoring system is also disciplined. AgenticPay does not reward agreement alone. GlobalScore rewards balanced outcomes where both parties benefit, while BuyerScore and SellerScore measure role-specific surplus. Faster agreements receive higher efficiency rewards. Invalid or failed negotiations are penalized.

That is a sensible choice. A procurement agent that closes every deal by paying too much is not “high-performing.” It is just expensive.

Frontier Models Close Deals; Smaller Open Models Often Wander

The main result is a capability gap, and it is not subtle.

Across all 111 tasks, Claude Opus 4.5 achieves the highest GlobalScore at 86.9, followed by Gemini-3-Flash at 82.2 and GPT-5.2 at 81.7. All three proprietary models achieve a 100% deal rate with zero timeouts. GPT-5.2 and Claude Opus 4.5 also record zero price overflow; Gemini-3-Flash has a small overflow rate of 2.7%.

The open-weight models are much weaker. Qwen3-14B reaches a GlobalScore of 63.9 with a 79.3% deal rate and 20.7% timeout rate. Llama-3.1-8B falls to a GlobalScore of 32.5, with only a 51.4% deal rate, a 48.6% timeout rate, and a 10.8% overflow rate.

Model GlobalScore SellerScore BuyerScore Deal rate Timeout rate Overflow rate Avg. rounds
Claude Opus 4.5 86.9 76.1 63.5 100.0% 0.0% 0.0% 3.7
Gemini-3-Flash 82.2 73.3 61.1 100.0% 0.0% 2.7% 4.8
GPT-5.2 81.7 81.1 58.5 100.0% 0.0% 0.0% 3.8
Qwen3-14B 63.9 58.9 47.6 79.3% 20.7% 1.8% 7.8
Llama-3.1-8B 32.5 26.3 25.2 51.4% 48.6% 10.8% 15.0

This is not just a “bigger model better” result. The operational meaning is more specific.

Frontier models appear capable of maintaining negotiation state, moving toward acceptable prices, and recognizing when agreement is reachable. Smaller open-weight models can participate in the dialogue, but they struggle with convergence and constraint discipline. That difference matters because production negotiation systems fail less like chatbots and more like process controls. They do not merely say something wrong. They miss windows, violate limits, or continue a conversation after the useful work is done.

Average rounds tell the same story. Claude Opus 4.5 terminates in 3.7 rounds on average; GPT-5.2 in 3.8. Llama-3.1-8B averages 15 rounds. A slow negotiator is not automatically bad, but in a finite-horizon system, slowness becomes risk. The agent burns turns, exposes more opportunities for inconsistency, and may never cross the final acceptance threshold.

The business implication is straightforward: before letting an agent negotiate anything real, measure not only whether it reaches agreement, but how long it takes, how often it times out, and whether it ever proposes prices outside allowed bounds.

A friendly demo conversation is not enough. The demo always closes. The invoice arrives later.

Sellers Keep Beating Buyers, Which Should Make Procurement Teams Nervous

The most commercially uncomfortable result is role asymmetry.

Across the benchmark, SellerScores are consistently higher than BuyerScores. GPT-5.2, for example, records a SellerScore of 81.1 but a BuyerScore of 58.5. Claude Opus 4.5 records 76.1 as seller and 63.5 as buyer. Qwen3-14B records 58.9 as seller and 47.6 as buyer.

The paper also runs a cross-play analysis in the simple 1-buyer, 1-product, 1-seller setting. This is useful because it tests models against each other rather than only in self-play. The asymmetry remains. Claude Opus 4.5 achieves a SellerScore of 83.6 but a BuyerScore of 57.6. Gemini-3-Flash: 84.5 versus 56.4. GPT-5.2: 81.7 versus 54.8. Qwen3-14B shows the largest gap, with 82.4 as seller and only 39.2 as buyer.

There are several possible explanations. Seller prompts may naturally align with persuasive language patterns common in training data: justify value, defend price, offer moderate concessions, close confidently. Buyer behavior requires a different discipline: resist anchoring, maintain budget secrecy, search alternatives, time concessions, and walk away when necessary. That is less glamorous material. There are many more public examples of persuasive selling than rigorous purchasing.

The paper does not prove the training-data explanation. It suggests it as a plausible hypothesis. For business readers, the more important point is not the cause but the control requirement.

If a company deploys an AI purchasing agent, it should assume buyer-side behavior needs special hardening. The agent must be tested for overpayment, premature agreement, leakage of willingness-to-pay, and failure to exploit alternatives. A generic “negotiate politely and strategically” prompt is not a procurement policy. It is a greeting card with budget authority.

More Market Structure Can Help Weak Reasoning

One counterintuitive finding is that performance often improves when there are more buyers and sellers.

In Table 2, GlobalScore generally rises as the market moves from single-buyer-single-seller settings toward multi-buyer-multi-seller settings. Claude Opus 4.5 increases from 83.4 in 1B1S to 89.8 in MBMS. Gemini-3-Flash rises from 77.5 to 87.4. GPT-5.2 moves from 79.1 to 84.0. Qwen3-14B jumps from 63.2 to 77.6. Even Llama-3.1-8B improves from 27.9 to 37.5.

At first glance, this looks strange. More agents should mean more complexity. More complexity should hurt. But market structure is not only a burden; it is also a scaffold.

When agents have alternatives, they do not need to solve every bilateral negotiation perfectly. A buyer can find a more compatible seller. A seller can accept a better-matched buyer. Competition disciplines unreasonable offers. Liquidity does part of the optimization that the model would otherwise have to perform internally.

This is an important business lesson: agent design and market design are substitutes to some extent. If an agent is weak, surrounding it with structured alternatives may improve outcomes more than asking it to “think harder.”

Cognaptus inference: firms should not evaluate AI negotiation agents only in isolated one-to-one chats. They should test them in the actual decision structure they plan to deploy: vendor pools, quote comparisons, fallback suppliers, internal approval thresholds, and walk-away rules. A weak agent in a well-designed market may outperform a stronger-sounding agent in a badly designed one.

The boundary is equally important. This does not mean complexity is always good. It means that some kinds of market complexity add liquidity and alternatives, which can improve matching. Other kinds of complexity may add cognitive load without adding useful choice. The difference must be tested, not assumed.

Financial Assets Punish Generic Bargaining Scripts

AgenticPay’s scenario breakdown shows that not all markets are equally forgiving.

Professional services are relatively strong across models. Claude Opus 4.5 scores 93.4, GPT-5.2 scores 89.8, Gemini-3-Flash scores 88.3, Qwen3-14B scores 72.5, and Llama-3.1-8B scores 41.1. Financial assets are weaker for most models: Claude Opus 4.5 drops to 85.7, GPT-5.2 to 79.9, Gemini-3-Flash to 68.1, and Qwen3-14B to 60.9. Llama-3.1-8B is already weak across the board, with especially poor performance in business procurement at 18.1.

The paper hypothesizes that financial negotiations demand stronger reasoning about risk, market dynamics, and adversarial pressure. That interpretation is plausible, though it should be handled carefully. The benchmark does not prove that models understand professional services “well” or financial assets “badly” in the real world. It shows that, under this task design, financial-asset negotiation produces lower GlobalScores for most evaluated models.

That distinction matters. A website development negotiation can often be handled with visible scope, timeline, price anchors, and service concessions. A business acquisition or luxury watch negotiation may involve valuation uncertainty, authenticity risk, asymmetric information, timing pressure, and market comparables. Generic bargaining scripts become thinner there.

For businesses, the practical takeaway is not “ban agents from financial negotiation.” It is narrower and more useful: classify negotiation domains by risk and reasoning depth. Low-stakes, quote-comparison tasks may be suitable for greater autonomy. High-stakes asset transactions need stronger human review, domain-specific valuation tools, and hard transaction limits.

The agent should not be trusted more just because the conversation sounds more professional. The most expensive mistakes are often written in excellent prose.

Personality Is an Experimental Variable, Not a Brand Voice Exercise

The paper also tests personality-based negotiation using Claude Opus 4.5 in a 1B1P1S setting. This is best read as an exploratory extension, not the main evidence of the paper. Still, the result is useful because it shows how surface-level persona choices can materially affect outcomes.

The tested buyer personalities include Budget-Conscious, Experienced Bargain Hunter, and Busy Professional. Seller personalities include Friendly, Professional, and Aggressive. The best GlobalScore appears when a Budget-Conscious buyer negotiates with an Aggressive seller: 92.7. The weakest appears when a Busy Professional buyer negotiates with an Aggressive seller: 44.1.

That is not a small style effect. It suggests that negotiation “personality” changes concession timing, patience, and willingness to defend surplus. A busy professional persona may be efficient in language but weak in bargaining. It wants closure. Closure is expensive when the other side is aggressive.

For business deployment, this matters because many companies treat agent personality as branding: friendly, concise, helpful, professional. In negotiation, personality is not just brand tone. It is policy behavior. It changes how hard the agent pushes, how quickly it concedes, and whether it prioritizes speed over price.

A procurement agent should not be “busy” unless the company is intentionally paying for convenience. A sales agent should not be “aggressive” unless compliance and customer relationship risk have been modeled. Tone is not a skin on top of strategy. In autonomous negotiation, tone becomes part of strategy.

Sequential Versus Parallel Is Not the Main Bottleneck for Strong Models

AgenticPay compares sequential and parallel negotiation modes. Sequential mode handles negotiations one at a time. Parallel mode conducts multiple negotiations simultaneously.

For proprietary models, the difference is modest. Claude Opus 4.5, Gemini-3-Flash, and GPT-5.2 maintain perfect deal rates and zero overflow in both modes. Open-weight models show some improvement in parallel mode: Qwen3-14B GlobalScore rises from 54.5 to 58.9, and Llama-3.1-8B rises from 24.8 to 29.3. But Llama-3.1-8B’s overflow rate also doubles from 0.08 to 0.17.

This is a robustness-style test. It does not define the paper’s central thesis, but it clarifies a deployment question: should agents negotiate sequentially or in parallel?

For strong models, orchestration mode is not the binding constraint in this benchmark. For weaker models, parallelism can improve throughput and deal rate, but may worsen constraint adherence. That is exactly the kind of trade-off businesses should expect in production. Parallel negotiation can create better market coverage, but it also increases state-management burden.

The practical rule is simple: do not parallelize negotiation just because the API allows it. Parallelism should be gated by constraint checks, budget locks, and per-counterparty memory separation. Otherwise, the agent may become very efficient at making mistakes in several conversations at once.

The Last Mile Is Where Open Models Fail Quietly

The failure analysis is one of the paper’s most valuable parts because it separates ignorance from convergence failure.

For Qwen3-14B and Llama-3.1-8B, all reported failures are timeouts. The failures are not concentrated in one task structure; no configuration accounts for more than 22% of failures. That suggests the problem is not simply that one market type is impossible. It is a broader model-capability issue.

The near-miss analysis is even more revealing. Among failed negotiations, 43.5% of Qwen3-14B failures and 46.3% of Llama-3.1-8B failures occur when the minimum buyer-seller price gap is within 5 units. Within 10 units, the near-miss rates are 52.2% and 55.6%, respectively.

In plain language: many failed negotiations were close enough to close. The agents had discovered the neighborhood of agreement, but could not execute the final concession or acceptance.

That is not a knowledge failure. It is a control failure.

This distinction should shape business evaluation. A negotiation agent can look competent for 90% of the interaction and still fail at the economically decisive moment. It can discuss the product, cite alternatives, counteroffer, narrow the spread, and then miss the final step. Human managers may read the transcript and think the system “basically worked.” The ledger will disagree.

A serious deployment test should therefore include last-mile metrics:

Metric Why it matters
Minimum price gap before timeout Shows whether the agent was close but failed to close
Final concession timing Measures whether the agent can move when delay becomes costly
Walk-away discipline Distinguishes strategic refusal from indecision
Price-bound violations Captures whether the agent breaks hard constraints
Round-to-close distribution Reveals whether success depends on long, fragile conversations

Closing is a separate skill. AgenticPay makes that visible.

What Businesses Can Use From This Paper Now

AgenticPay is a research benchmark, not a ready-made procurement product. Still, it offers a practical evaluation template for firms thinking about agentic commerce.

The paper directly shows that current LLMs vary substantially in negotiation reliability under a controlled simulation. It also shows persistent seller advantage, domain sensitivity, role asymmetry, and convergence failure among weaker models.

Cognaptus inference: companies should treat AI negotiation as a controlled workflow problem, not merely a model-selection problem. The model matters, but so do market structure, role-specific prompts, approval gates, fallback options, and post-negotiation audits.

Paper result Business meaning Control to add
Frontier models close deals reliably in the benchmark Model capability matters for negotiation autonomy Require model-specific negotiation testing before deployment
BuyerScores lag SellerScores AI purchasing agents may overpay or concede poorly Add buyer-side reward shaping, budget locks, and overpayment audits
Multi-agent markets improve many outcomes Alternatives can scaffold weaker reasoning Use supplier pools, quote comparison, and fallback counterparties
Financial assets are harder for most models Domain risk changes negotiation reliability Use stricter human review for high-uncertainty or high-stakes assets
Personality affects outcomes “Brand voice” changes economic behavior Treat persona settings as policy parameters, not cosmetic choices
Near-miss failures are common in open models Agents may understand the zone but fail to close Track last-mile convergence and add acceptance heuristics

The safest early use cases are not fully autonomous deal-making. They are structured negotiation support tasks: draft counteroffers, summarize counterparties’ positions, recommend concession ranges, detect violations of internal policy, and compare supplier quotes. Limited autonomy can come later, bounded by price caps, approval thresholds, and audit logs.

The dangerous use case is letting a fluent agent negotiate with real money, no hard bounds, and only transcript review. That is not innovation. That is delegation without accounting controls.

Boundaries: What AgenticPay Does Not Prove

The limitations are not decorative; they affect interpretation.

First, each task instance is executed once per model. That makes the benchmark useful as an initial comparative signal, but it does not fully measure variance across repeated runs, prompt variants, or stochastic decoding settings.

Second, the experiments use deterministic decoding with temperature 0 and a fixed seed. That improves comparability, but production agents may operate under different sampling, retrieval, tool-use, and memory conditions.

Third, the benchmark is simulated. The scenarios are realistic, but they are not live markets with legal obligations, reputational consequences, fraud risk, supplier relationships, or compliance review. A model that performs well in AgenticPay is not automatically safe to deploy as an autonomous commercial representative.

Fourth, the scoring function encodes a preference for balanced surplus and faster agreement. That is reasonable for evaluating cooperative deal quality, but not every business negotiation has the same objective. Some companies may prioritize margin, speed, supplier reliability, relationship preservation, or risk transfer.

Finally, the tested models are a snapshot. The specific leaderboard will age quickly. The evaluation logic will age more slowly.

That is the durable contribution: not that Claude Opus 4.5 beats Llama-3.1-8B on this benchmark, but that negotiation agents need to be evaluated as economic actors with constraints, incentives, and failure modes.

The Agent That Can Talk Is Not Yet the Agent That Should Trade

AgenticPay’s central lesson is simple enough to be uncomfortable: language fluency is not negotiation competence.

A capable commercial agent must protect private information, reason over counterparties, make timely concessions, avoid invalid offers, close when the bargaining zone is reached, and understand when the domain itself requires human review. These are not optional refinements. They are the difference between a helpful assistant and an expensive liability.

The paper is most useful when read as a deployment checklist. If a firm wants AI agents to buy, sell, procure, or negotiate, it should stop asking whether the model sounds persuasive. Persuasion is cheap. Surplus discipline is harder.

The future of agentic commerce will not be decided by whether LLMs can haggle. They can. The question is whether they can haggle without quietly donating your margin to the other side.

Notes

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast


  1. Xianyang Liu, Shangding Gu, and Dawn Song, “AgenticPay: A Multi-Agent LLM Negotiation System for Buyer–Seller Transactions,” arXiv:2602.06008, 2026. https://arxiv.org/abs/2602.06008 ↩︎