From Ballots to Budgets: Can LLMs Be Trusted as Social Planners?

TL;DR for operators

This paper asks a deceptively operational question: can an LLM act as a social planner when it must allocate a fixed budget across competing public projects? Not in the inspirational LinkedIn sense. In the literal sense: choose project IDs, stay within budget, maximise community utility, and return a valid allocation.

The authors use participatory budgeting as the testbed. That matters because participatory budgeting is already a mechanism for turning messy human preferences into funded projects. The paper therefore avoids the usual toy benchmark problem where the model “reasons” inside a sterile puzzle box. Here the model faces a recognisable management problem: competing initiatives, scarce capital, voter or stakeholder preferences, and feasibility constraints.¹

The headline result is not that LLMs now deserve the finance department’s approval stamp. Please, no. The useful result is narrower and more interesting: stronger models, especially Qwen2.5-72B, can produce reasonably high-utility allocations under budget constraints, and they sometimes perform better when preferences are expressed in natural language or inferred from metadata than when preferences are handed over as clean structured matrices. That is a useful crack in the wall between optimisation software and human preference gathering.

The catch is equally important. Models still fail in operationally familiar ways: they stop too early, leave large chunks of budget unused, follow a “loose” version of an algorithm they were explicitly told to execute, and sometimes produce explanations that do not match the final allocation. The final JSON is what gets actioned; the beautiful explanation does not buy the playground.

For business use, the correct architecture is not “LLM as autonomous allocator.” It is “LLM as preference interpreter and candidate generator, surrounded by deterministic constraint checks, optimisation baselines, audit logs, and human approval.” Less magical. More useful. A recurring theme, unfortunately for magic.

Budget allocation is where AI competence stops being theatrical

A budget meeting is a useful antidote to vague AI claims.

Everyone has preferences. Nobody has enough money. Some projects are cheap but marginal. Some are expensive but politically necessary. Some pairs should not be funded together because they compete for the same land, equipment, channel, staff, or executive patience. The decision-maker must turn partial, noisy, often badly expressed preferences into an allocation that is both acceptable and feasible.

That is why participatory budgeting is a better test for LLM decision-making than another riddle about coloured objects and imaginary people with hats. Participatory budgeting already has a mechanism: citizens or stakeholders submit projects, voters express preferences, and a planner selects a subset of projects under a budget constraint. The computational version is crisp enough to evaluate, but realistic enough to hurt.

The paper by Sankarshan Damle and Boi Faltings uses this mechanism in two ways. First, it treats participatory budgeting as a practical resource-allocation task for LLM agents. Second, it treats the same task as an adaptive reasoning benchmark. That second move is the more strategic one. If a benchmark can vary projects, budgets, preferences, formats, and constraints, it becomes harder for models to coast on memorised benchmark artefacts. It also lets evaluators ask questions that resemble actual deployment questions: does the model respect constraints, infer preferences, handle natural language, and degrade gracefully when the input is incomplete?

The paper’s contribution is therefore not “LLMs can do budgeting.” That would be a dangerously tidy summary. The contribution is that mechanism design gives us a way to test whether an LLM can sit inside a constrained decision process without embarrassing everyone in the room.

The mechanism: from votes to feasible project portfolios

The underlying participatory budgeting setup is simple enough to state and strict enough to expose model weakness.

There is a set of projects. Each project has a cost. There is a total budget. Voters express support for projects, and the planner must choose a subset of projects whose total cost does not exceed the budget. The allocation should maximise utility, where utility is derived from voter support. In the paper’s main setup, the reference point is the Utilitarian Greedy algorithm, which repeatedly selects the feasible project with the best utility-to-cost ratio until no further project can be added.

That greedy baseline is important because it plays two roles at once.

First, it is a practical allocation rule. It gives the model a procedural target: calculate utility per cost, rank projects, add feasible projects, skip infeasible ones, stop only when no more feasible additions exist.

Second, it is an evaluation baseline. The paper reports normalised average utility, meaning the utility of the LLM’s allocation divided by the utility of the greedy allocation. A score of 1 would match the greedy baseline. A score below 1 means the model left utility on the table relative to that baseline.

The model is also evaluated on instruction-following. This is not etiquette. It is deployment safety in boring clothing. Instruction-following checks whether the output is in the requested format and whether the allocation respects the budget. Later, when the authors introduce project conflicts, instruction-following also includes the requirement not to select conflicting projects together.

This is the paper’s central evaluation frame:

Mechanism component	What the model must do	Why it matters operationally
Preference input	Read or infer stakeholder preferences	Most organisations do not have perfectly structured preference matrices lying around, because life is inconsiderate
Budget constraint	Select only projects whose total cost fits the budget	A clever infeasible allocation is still infeasible
Utility objective	Approximate high-social-welfare selection	The model must do more than produce plausible prose
Output discipline	Return valid project IDs in the required format	Downstream systems need parsable, auditable decisions
Constraint extension	Handle conflicts between projects	Real portfolios contain interactions, not just independent line items

The mechanism-first view matters because it changes the article’s question. We are not asking whether an LLM “understands budgeting.” We are asking where, inside a known allocation mechanism, language models add value—and where they need adult supervision, preferably deterministic and humorless.

The real test is not clean votes. It is messy preference input

The paper tests three versions of the participatory budgeting instance.

The first is the Plain PB Instance, or PPI. This is the clean version. The model receives project details, numerical voter preferences, and budget constraints.

The second is the Vote-Removed PB Instance, or VRPI. Here the votes are removed. The model receives project descriptions, budget information, and voter metadata such as age, sex, and education. It must infer likely preferences from that sparse context.

The third is the Natural Language Votes PB Instance, or NLVPI. Here the original numerical votes are translated into natural-language descriptions. The model has preference information, but it arrives in a format closer to what humans might actually submit.

This design directly challenges a common assumption: that LLMs are most useful only after humans have already structured the world into neat tables. For conventional optimisation, that assumption is often true. If you already have a clean objective function and a clean constraint set, use an optimiser. It will not hallucinate a municipal park.

But LLMs may be valuable one step earlier: at the boundary where preferences are still text, metadata, comments, justifications, stakeholder interviews, proposal descriptions, and survey fragments. The paper’s most interesting result appears there. Across the main 70B/72B-class models, natural-language and inferred-preference setups often match or outperform the structured-input version on average utility. The average normalised utility across all models and methods is 0.67 for PPI, 0.68 for VRPI, and 0.70 for NLVPI.

That does not mean natural language is magically better than structured data. The paper’s smaller-model ablations push against that lazy conclusion. In the 7B–14B experiments, smaller models generally struggle more with preference inference and natural-language variants, while doing relatively better when explicit voter preferences are provided. In other words: the “messy input advantage” is not universal. It appears more plausible for stronger models, under this experimental setup, with these instances.

That is a useful distinction for operators. If the model is weak, structured input may still be the safer route. If the model is strong, the value may lie in reducing the cost of preference elicitation.

Three prompting strategies, three different failure surfaces

The authors test three prompt strategies: Greedy, Optimization, and Hill Climb.

Greedy asks the model to mimic the Utilitarian Greedy algorithm directly. This sounds easy: compute utility-to-cost ratios, sort, add feasible projects, stop when no more can fit. Of course, “sounds easy” is how many LLM evaluation failures begin.

Optimization frames the task as a knapsack-style constrained optimisation problem. It still uses utility-to-cost logic, but the prompt encourages a broader optimisation framing.

Hill Climb asks the model to generate smaller candidate subsets, merge them into a candidate pool, and then build the final allocation through a greedy local-search-like process.

The performance pattern is instructive. Greedy prompting often underperforms because models do not reliably execute the greedy algorithm as specified. They loosely follow the order, omit projects, reorder candidates, or stop early. The paper includes an example where LLaMA-3.1-70B selects projects costing only USD 137,000 despite a USD 500,000 budget, while the greedy baseline uses USD 481,400. That is not a rounding error. That is leaving the meeting before the agenda is finished.

Optimization and Hill Climb generally improve results, especially for weaker large models. Qwen2.5-72B under Optimization reaches normalised utility of 0.896 on PPI, 0.820 on VRPI, and 0.792 on NLVPI. Under Hill Climb, Qwen2.5-72B reaches 1.000 instruction-following on NLVPI, with normalised utility of 0.774. LLaMA-3.1-70B also benefits from Hill Climb, reaching 0.842 normalised utility on NLVPI.

The mechanism-level interpretation is that prompting does not merely “explain the task.” It changes the model’s search behaviour. A prompt that says “follow this greedy algorithm” may not cause the model to execute the algorithm faithfully. A prompt that encourages optimisation or staged refinement may produce better feasible allocations because it gives the model a more forgiving reasoning scaffold.

That is not an argument for prompt poetry. It is an argument for treating prompt design as part of the control system.

Prompt strategy	Likely purpose in the paper	What it supports	What it does not prove
Greedy	Test whether models can imitate a known allocation algorithm	Models can sometimes approximate procedural allocation, but often execute it loosely	That LLMs can reliably run algorithms from instructions
Optimization	Test whether optimisation framing improves allocation quality	Stronger models can achieve higher utility under explicit constrained optimisation framing	That the model has solved the underlying optimisation problem exactly
Hill Climb	Test whether staged candidate generation improves search	Decomposition can improve utility and instruction-following in several settings	That local search is inherently superior across larger or harder PB instances
Conflict extension	Test benchmark adaptability under additional constraints	PB can be modified to test richer real-world feasibility	That LLMs are ready for unsupervised multi-constraint allocation

The obvious summarisation would say “Optimization and Hill Climb perform better.” The useful interpretation is sharper: LLM allocation quality depends heavily on how the decision procedure is represented at inference time. That is exactly the kind of dependency that production systems must measure, not assume.

Qwen2.5-72B is strongest, but the pattern is not a coronation

Across the main experiments, Qwen2.5-72B offers the best trade-off between utility and instruction-following. The paper’s instruction-following and average-utility tables show it performing strongly across PPI, VRPI, and NLVPI, particularly under Optimization and Hill Climb.

The model’s advantage is clearest when the task requires both feasible output and useful allocation. For example, in the IF-AU trade-off figure, Qwen2.5-72B under Optimization and Hill Climb shows high combined values, especially in NLVPI. This matters because utility without feasibility is theatre, and feasibility without utility is bureaucratic tidiness. Operators need both.

But the model ranking is not the only lesson.

LLaMA-3.1-70B-NV sometimes performs very well on instruction-following, including 0.96 on PPI under Optimization and 0.97 on PPI under Hill Climb. Yet high instruction-following does not guarantee high utility. Under Optimization on PPI, LLaMA-3.1-70B-NV has strong IF but only 0.505 normalised utility. The model obeys the envelope while leaving value inside it. Very corporate, but not ideal.

The reverse pattern also appears: a model may produce high-utility allocations in some settings while being less consistent on format or constraints. That is why the paper’s paired use of AU and IF is valuable. It prevents the evaluator from being seduced by a single metric.

For business readers, this is the procurement lesson: model selection for decision support cannot be reduced to leaderboard rank. You need task-specific tests that measure both outcome quality and operational validity. If the model gives a high-value recommendation that violates budget, conflict, compliance, or approval constraints, the recommendation is not high-value. It is a nicely formatted problem.

The faithfulness problem: the explanation can be right while the allocation is wrong

One of the paper’s most practically relevant observations is about faithfulness.

The authors ask models to return both an allocation and a reasoning trace. In one sample output, LLaMA-3.1-70B initially selects projects whose total cost exceeds the budget. The model then explains that it adjusted the selection to fit the budget. But the final allocation field still contains the over-budget selection.

This is a particularly dangerous failure mode because the reasoning looks corrective while the action remains invalid. The narrative says, in effect, “I fixed it.” The final object says, “No, you didn’t.”

The paper treats this correctly. Its instruction-following metric marks such cases as failures because the final allocation is what would be executed in a real system. That is exactly the right operational standard. No procurement system should care that the model’s chain of reasoning had a moment of self-awareness if the final purchase order still exceeds budget.

This has a broader implication for AI governance. Explanations are not controls. They are evidence, and sometimes bad evidence. A reasoning trace can help debug behaviour, but it should not be trusted as proof that the output satisfies constraints. Hard validation must operate on the final output object.

For organisations building LLM-assisted planning workflows, this means every allocation recommendation needs at least four post-processing checks:

Check	Purpose
Budget validation	Confirm total selected cost is within the available budget
Constraint validation	Confirm project conflicts, eligibility rules, and capacity limits are respected
Objective comparison	Compare the recommendation against deterministic baselines or exact solvers where available
Explanation-output consistency	Check whether the rationale supports the final selected set, not merely something nearby

The last check is optional only if nobody reads the explanation. Since people do read explanations, and worse, believe them, it is not optional.

Preference inference is promising, but it is not mind-reading

The VRPI setup is where the paper makes its most provocative move. It removes explicit votes and gives the model only project information plus voter metadata. The model must infer likely preferences and then allocate projects accordingly.

The authors evaluate this using fraction overlap: how much the model’s inferred high-preference projects overlap with the actual high-vote projects from the original structured instance. They report top-75%, top-50%, and top-25% overlap variants. The main text focuses on top-75% overlap.

The top-75% results are mixed but meaningful. LLaMA-3.1-70B-NV under Optimization reaches 0.731 overlap with 0.656 normalised utility. Qwen2.5-72B under Optimization reaches lower top-75% overlap, 0.653, but higher normalised utility, 0.820. That difference is important. Correctly identifying popular projects and producing the highest-utility feasible portfolio are related, but not identical. Costs and constraints mediate the translation from preferences to allocation.

The paper links this to Theory of Mind inference: the ability to infer another agent’s preferences from context. For business use, I would phrase it less grandly. The model is not reading minds. It is performing preference imputation from limited metadata and project descriptions. That is still useful. It is just not telepathy, despite what a vendor deck may imply after two espressos.

The right business application is not to replace stakeholder consultation with demographic guessing. That would be both ethically dubious and operationally lazy. The better application is to use LLMs to fill gaps, flag likely preference clusters, summarise unstructured input, and generate candidate allocations when preference collection is incomplete or costly.

This distinction matters because metadata-based inference can encode bias, stereotype, or spurious correlation. The paper’s metadata is limited to age, sex, and education. More fine-grained information might improve prediction, as the authors note, but it could also increase governance burden. The more personal the inference layer becomes, the more careful the audit trail must be.

Project conflicts make the benchmark more realistic—and expose the baseline

The paper extends the participatory budgeting setup by adding project conflicts. Two projects may be individually attractive but impossible to fund together. This resembles many real portfolio decisions: mutually exclusive product features, incompatible infrastructure choices, overlapping real estate uses, duplicate vendor systems, or two executives trying to claim the same budget line. Nature is healing.

To handle conflicts, the authors modify the greedy baseline so that it selects projects only if they fit within budget and do not conflict with already selected projects. This creates a richer benchmark because instruction-following now requires three things: correct format, budget feasibility, and conflict feasibility.

Qwen2.5-72B again performs best in this extended setting. In PPI with conflicts, it reaches 0.865 normalised AU and 0.778 IF. In VRPI with conflicts, it again reaches 0.865 normalised AU, though IF drops to 0.681. The stricter constraint environment narrows the interpretation: the model can produce allocations close to the modified greedy baseline, but consistency remains imperfect.

The theoretical appendix adds a useful twist. Once project conflicts are introduced, the modified greedy baseline is no longer as strong. The authors prove that its welfare can be unboundedly worse than the optimal allocation, and that it is not optimal up to one project. That means the conflict experiment is not simply “LLM versus strong algorithm.” It is partly a demonstration that the benchmark itself can be adapted, and that baseline choice becomes more delicate as constraints become more realistic.

This is a valuable lesson for enterprise evaluation. When you add real-world constraints, your old baseline may stop being a reliable gold standard. A heuristic that works well for independent projects may become weak when projects interact. If an LLM beats a weak heuristic, that is interesting, but it is not victory over optimisation. It may simply mean the heuristic brought a spoon to a knife fight.

What the paper directly shows

The paper directly supports four claims.

First, participatory budgeting can function as a practical LLM resource-allocation task and as an adaptive benchmark. The same framework can vary preference format, project set, budget, and constraints.

Second, prompt strategy materially affects performance. Greedy prompting often causes loose procedural imitation and premature stopping. Optimization and Hill Climb prompts generally improve normalised utility and sometimes instruction-following.

Third, stronger models can use unstructured or incomplete preference signals. In the main large-model experiments, natural-language votes and vote-removed metadata cases often perform as well as or better than the structured PPI setting on average utility.

Fourth, constraint extensions reveal both model behaviour and benchmark fragility. Adding project conflicts makes instruction-following stricter and demonstrates that a greedy baseline can become theoretically weak.

Those are already useful findings. They are enough. We do not need to inflate them into “AI can govern cities now,” a sentence that should cause immediate procurement review.

What Cognaptus infers for business use

The business inference is that LLMs are more plausible as translation and mediation layers than as final allocators.

In many organisations, resource allocation does not fail because nobody owns Excel. It fails because preferences are fragmented across emails, proposals, workshops, stakeholder calls, customer comments, Jira tickets, board memos, and political realities nobody wants to put in the deck. Classic optimisation methods need structured inputs. LLMs may help create those inputs or produce candidate portfolios from them.

A sensible enterprise architecture would look like this:

Unstructured stakeholder input
        ↓
LLM preference extraction / clustering / imputation
        ↓
Structured preference and constraint representation
        ↓
Deterministic optimisation or benchmark heuristics
        ↓
LLM-generated explanation and scenario comparison
        ↓
Hard validation, audit log, human approval

The LLM is useful at the language-heavy edges: extracting preferences, translating proposals into project attributes, generating candidate allocations, explaining trade-offs, and comparing scenarios for decision-makers. The deterministic layer remains responsible for feasibility, exact constraint checking, baseline comparison, and auditability.

This division of labour is not glamorous. That is why it might work.

Where the result should not be overextended

Several boundaries matter.

The experiments use 24 Mechanical Turk PB instances from Pabulib, selected partly because they are in English, small enough for context windows, and include metadata. This makes the setup appropriate for controlled evaluation, but not automatically representative of every municipal, corporate, or humanitarian allocation environment.

The natural-language vote representation is generated from structured votes using GPT-4o. That helps simulate natural-language preference input, but it is not the same as raw citizen comments, messy workshop notes, or adversarial stakeholder submissions. Real human preference text is less polite, less consistent, and often far more strategic. A tragedy for parsers, but a fact.

The VRPI metadata is limited. Age, sex, and education may be enough to test preference inference, but they are a thin basis for real decisions. Using such attributes in actual allocation workflows would raise fairness, privacy, and governance questions. The paper’s result should be read as evidence of technical capability, not a recommendation to allocate services based on demographic inference.

The prompt space is also not exhausted. The authors note that prompts may not be fully optimised. Better representations, different decoding settings, larger context windows, retrieval, tool use, or solver-integrated workflows could change performance.

Finally, the paper evaluates a single mechanism family. Participatory budgeting is a strong testbed, but resource allocation includes many other structures: scheduling, procurement, capacity planning, grant review, portfolio optimisation, claims triage, and emergency logistics. The mechanism-first lesson transfers more safely than the exact numbers.

The practical governance rule: never action the allocation before the validator does

The most operationally useful rule from this paper is simple:

Do not let the language model be the final constraint checker.

The model can suggest. It can infer. It can translate. It can compare. It can explain. It can produce a candidate allocation that is surprisingly reasonable. But the system should still validate the final project set against budget, conflicts, eligibility, and any hard business rules.

This is not distrust for theatrical effect. It follows directly from the paper’s failure cases. A model may say it followed the greedy rule when it loosely followed it. It may say it adjusted the budget when the final allocation remains over budget. It may stop early and leave value unused. It may perform well on average but fail on the instance that happens to be yours.

A production workflow should therefore log at least the following:

Artefact	Why it should be logged
Original stakeholder input	To preserve the source of inferred preferences
Extracted preference representation	To audit how text became structure
Candidate allocation	To compare model recommendations across runs or models
Deterministic validation result	To prove feasibility before action
Baseline allocation	To know whether the LLM improved anything
Human override or approval	To maintain accountability where stakes justify it

If that sounds like governance overhead, yes. That is what serious deployment looks like after the demo ends.

Conclusion: the value is not autonomous budgeting, but disciplined preference machinery

This paper is valuable because it refuses to test LLMs only where language is easy and consequences are imaginary. Participatory budgeting forces the model into a constrained mechanism: preferences in, projects out, budget respected. There is nowhere to hide except in an invalid JSON object, and the paper checks for that too.

The strongest insight is not that LLMs beat algorithms. They generally do not, and when the baseline weakens under conflicts, interpretation becomes more delicate. The stronger insight is that LLMs may help bridge the gap between human preference expression and formal allocation machinery. That is a real operational gap. It shows up in civic planning, enterprise portfolio management, grant allocation, procurement, product roadmapping, and any setting where many people want many things and the budget remains stubbornly finite.

So, can LLMs be trusted as social planners?

Not as autonomous planners. Not yet, and probably not in the clean end-to-end form vendors will enjoy drawing on slides.

But as preference interpreters, candidate generators, and explanation layers inside a validated allocation system? Yes, cautiously—and more importantly, testably. Participatory budgeting gives us a way to measure that usefulness before the model touches real money.

Which is good. Budgets are where AI optimism goes to meet arithmetic.

Cognaptus: Automate the Present, Incubate the Future.

Sankarshan Damle and Boi Faltings, “LLMs for Resource Allocation: A Participatory Budgeting Approach to Inferring Preferences,” arXiv:2508.06060, full version published in the Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025), https://arxiv.org/abs/2508.06060. ↩︎

TL;DR for operators#

Budget allocation is where AI competence stops being theatrical#

The mechanism: from votes to feasible project portfolios#

The real test is not clean votes. It is messy preference input#

Three prompting strategies, three different failure surfaces#

Qwen2.5-72B is strongest, but the pattern is not a coronation#

The faithfulness problem: the explanation can be right while the allocation is wrong#

Preference inference is promising, but it is not mind-reading#

Project conflicts make the benchmark more realistic—and expose the baseline#

What the paper directly shows#

What Cognaptus infers for business use#

Where the result should not be overextended#

The practical governance rule: never action the allocation before the validator does#

Conclusion: the value is not autonomous budgeting, but disciplined preference machinery#