When you think of AI in public decision-making, you might picture chatbots handling service requests or predictive models flagging infrastructure risks. But what if we let large language models (LLMs) actually allocate resources—acting as digital social planners? That’s exactly what this new study tested, using Participatory Budgeting (PB) both as a practical decision-making task and a dynamic benchmark for LLM reasoning.

Why Participatory Budgeting Is the Perfect Testbed

PB is more than a budgeting exercise. Citizens propose and vote on projects—parks, public toilets, community centers—and decision-makers choose a subset to fund within a fixed budget. It’s a constrained optimization problem with a human twist: budgets, diverse preferences, and sometimes mutually exclusive projects.

For LLMs, PB offers:

  • Clear constraints (budget, conflicts)
  • Multiple input formats (numerical votes, natural language, inferred preferences)
  • Dynamic complexity that static benchmarks lack

This is a richer, more realistic way to probe whether LLMs can reason under constraints without “cheating” via data memorization.

Three Ways to Frame the Same Problem

The researchers tested three PB setups:

  1. PPI – All data is structured: numerical votes, costs, and budgets.
  2. VRPI – No votes, just project details and voter demographics; the LLM must infer preferences.
  3. NLVPI – Preferences expressed in natural language rather than numbers.

And three prompting strategies:

  • Greedy – Mimic the Utilitarian Greedy algorithm, picking projects by the highest utility-to-cost ratio until the budget is spent.
  • Optimization – Frame as a knapsack problem, encouraging more global trade-off thinking.
  • Hill Climb – Build candidate subsets and iteratively improve them.

What Surprised the Researchers

One might expect that structured numeric votes (PPI) would produce the best results. Instead, richer inputs often beat bare numbers:

  • In many cases, natural language preferences (NLVPI) and metadata-based inference (VRPI) led to higher average utility.
  • Qwen2.5-72B consistently balanced instruction-following (producing feasible, budget-respecting allocations) and utility maximization.
  • Greedy often underperformed—not because the idea is bad, but because LLMs applied it loosely, skipping viable projects or stopping early.

In other words: giving the model more context—even unstructured—can be more valuable than giving it perfect spreadsheets.

Beyond the Basic Budget

To test adaptability, the team added project conflicts (think: can’t build a library and a sports complex on the same site). This created a Disjunctively Constrained Knapsack Problem. Even here, the framework held up, and some LLMs outperformed the modified greedy baseline.

This matters because real-world allocation is rarely just “pick the top items until the money’s gone”—constraints can be political, spatial, or strategic.

Why This Matters for Business and Policy

The study’s deeper message isn’t about PB per se—it’s about evaluating LLMs in realistic decision-making environments:

  • Richer context improves performance – Unstructured, narrative, or inferred signals may make AI reasoning more human-aligned.
  • Prompt strategy matters – Optimization and iterative refinement outperform naive algorithm mimicry.
  • Adaptive benchmarks are critical – Static leaderboards miss whether a model can adapt to evolving constraints.

For organizations exploring AI in strategic planning, procurement, or grant allocation, this research suggests that:

  1. You can start with unstructured stakeholder input—AI can still produce viable allocations.
  2. Test multiple reasoning prompts, not just the obvious algorithmic mimic.
  3. Incorporate evolving constraints into evaluation to simulate real-world volatility.

Cognaptus: Automate the Present, Incubate the Future