When Democracy Meets the Algorithm: Auditing Representation in the Age of LLMs
Agenda-setting is where participation quietly becomes power.
Anyone can invite hundreds of people to submit questions. That part is now cheap. The difficult part arrives ten minutes later, when an expert panel has time to answer only seven of them, and someone has to decide which seven. This is the small administrative hinge on which democratic legitimacy loves to swing. A moderator chooses. A platform ranks. An LLM summarises. Everyone else is told, usually with a straight face, that their concerns were “reflected”.
Reflected where, exactly?
The paper Question the Questions: Auditing Representation in Online Deliberative Processes tackles that hinge directly.1 Its contribution is not another optimistic sermon about AI rescuing civic life from the tragic burden of reading. The useful part is sharper: it defines a way to audit whether a selected slate of questions leaves a large, coherent group of participants effectively uncovered.
That distinction matters. The paper is not asking whether AI-generated questions sound fluent. They do. This is 2025; fluent prose is no longer a moat, it is a default setting. The harder question is whether those questions represent the structure of participant concerns better than a human moderator, a random draw, or an optimisation procedure that selects from the original participant questions.
The answer is interesting precisely because it is not the usual “LLMs beat humans” theatre. Algorithmic methods often improve representativeness over historical human selection. LLM-generated slates can sometimes match or outperform extractive optimisation. But they do not uniformly dominate it. The business lesson is therefore not “replace moderators with models”. It is “audit the agenda before you call it representative”.
Annoying, perhaps. Also progress.
The bottleneck is not participation; it is compression
The setting is deliberative democracy: citizens deliberate in small groups, propose questions, and then an expert panel answers a limited number of them. In the paper’s data, this comes from twelve sessions across three deliberative polls conducted by the Stanford Deliberative Democracy Lab: eight America in One Room democratic reform panels, two 2023 Meta Community Forum sessions on AI chatbots, and two 2024 Meta Community Forum sessions on AI agents.
The workflow is simple enough. Participants discuss a topic, submit questions, and then a final slate of $k$ questions is selected from a much larger set of $m$ proposed questions. The expert panel answers the slate. The selected questions become the bridge between public input and expert response.
This is where representation can fail without looking like failure.
A moderator may choose questions that are clear, urgent, balanced, politically safe, or simply easier to merge. An LLM may produce a polished synthesis that feels broad while quietly smoothing away minority concerns. An optimisation algorithm may select awkward participant-written questions that cover the space well but sound less elegant than a conference brochure. Each method has a different failure mode.
The paper’s mechanism-first insight is that representation should be audited at the point of compression. If hundreds of inputs must become a handful of outputs, the central question is not whether the final slate is attractive. It is whether any sizeable group of participants with a shared concern would have been better served by an unchosen question.
That is a measurable claim.
Justified representation turns “being heard” into an audit condition
The paper borrows from social choice theory, specifically justified representation, or JR. In its intuitive form, JR says: if there are $n$ participants and $k$ questions can be asked, then any group of roughly $n/k$ participants with a shared concern is large enough to deserve at least one representative question in the slate.
This is not equality of vibes. It is proportionality with teeth.
Suppose 56 participants submit questions and the panel can answer 7. A group of 8 participants is large enough to correspond to one slot. If those 8 participants all care about a similar unselected question, and none of the selected questions gives them comparable value, the slate has a representation problem.
The paper extends this into a quantitative audit. Instead of only asking whether JR passes or fails, it computes a JR value, $\alpha_{JR}(W)$, for a slate $W$. The easiest way to read it is:
- values below 1 mean the slate satisfies JR;
- lower values mean stronger representation, because even smaller coherent groups are protected;
- values above 1 mean some group large enough to deserve a slot may be left uncovered.
The audit searches for a “blocking coalition”: a group of participants who would all prefer some unselected proposed question over the selected slate. The largest such dissatisfied coherent group determines the score.
This is a useful reframing. Representation is no longer treated as a post-hoc narrative. It becomes a diagnostic: find the group the slate failed to cover.
That diagnostic is especially valuable when AI is involved, because LLM output can look inclusive while being very good at hiding what it omitted. Smooth language is not evidence of fair coverage. It is evidence that the model owns a thesaurus.
The utility model is deliberately simple, and that is both strength and weakness
To audit a slate, the authors need to estimate how much utility a participant receives from any selected question. They do this with a transparent proxy: each participant is associated with the question they proposed, and their utility for another question is based on semantic similarity to that original question. In implementation, the paper uses cosine similarity between question embeddings.
So if my original question is about voter education for ranked-choice voting, I receive higher utility from selected questions that are semantically close to that concern, and lower utility from questions about unrelated electoral reforms.
The participant’s utility for the whole slate is treated as unit-demand: the value of the slate is the value of the best-matching question in it. That choice is important. The authors note that additive utilities make JR too easy to satisfy in this setting, because several weakly similar questions can accumulate into apparent coverage even if no selected question really represents the participant’s concern. Unit-demand forces the slate to contain at least one meaningfully relevant item.
This is one of the paper’s better modelling decisions. In deliberation, being “a little bit represented” by many unrelated questions is not the same as having one relevant question asked. Anyone who has watched a committee “address” an issue by mentioning three adjacent topics knows the difference.
The embedding proxy is tested in two ways. First, the authors validate cosine similarity on the Quora Question Pairs dataset, where the task is to distinguish duplicate from non-duplicate questions. Three embedding models—Qwen3 0.6B, all-MiniLM-L6-v2, and OpenAI’s text-embedding-3-small—achieve AUC scores between 0.868 and 0.876. This is not proof that embeddings capture democratic preference. It is evidence that the similarity signal is not random noise wearing a lab coat.
Second, the authors run a cross-model robustness check: slates optimised using one embedding model are evaluated using other embedding models. The resulting JR values remain generally low, suggesting that the audit outcome is not wholly dependent on one embedding provider.
That said, the boundary is real. A participant’s question is only a proxy for their concern. Cosine similarity is only a proxy for utility. The study is retrospective, so participants were not asked whether the final slates actually made them feel represented. For business use, this is not fatal. It means the metric should be treated as an auditable signal, not as revealed human satisfaction.
A dashboard is not democracy. It is just a better place to start arguing.
The algorithmic contribution is about making the audit usable
The paper also contributes the first algorithms for auditing JR in the general utility setting. In approval voting, JR verification is relatively straightforward: check whether a large enough group approves an unselected item while approving none of the selected items. With continuous utilities, the problem becomes harder because one must reason over thresholds: at what utility level does a group become dissatisfied enough to form a blocking coalition?
The naïve algorithm checks every proposed question against every relevant participant utility threshold. It runs in $O(mn^2)$ time.
The improved algorithm exploits sorting. For each alternative question, it sorts participants by their utility for that question and compares this against sorted utility thresholds for the selected slate. This identifies the largest blocking coalition in a single pass over sorted values, giving a runtime of $O(mn \log n)$.
That matters operationally. An audit that works only after the meeting is a post-mortem, not a moderation tool. The paper’s platform integration points toward real-time use: generate candidate slates, compute representation scores, inspect heatmaps showing which participant questions are covered, and let moderators compare options before the expert panel begins.
This is the difference between ethical decoration and operational governance. A principle that cannot enter the workflow tends to become a PDF. We already have enough PDFs bravely defending accountability from folders nobody opens.
What the experiments are actually testing
The empirical section compares five approaches to question slates:
| Method | Type | Role in the paper |
|---|---|---|
| Random | Extractive | Baseline for what happens without intelligent selection |
| Human | Extractive/historical | Actual moderator-selected questions used in prior deliberations |
| IP | Extractive | Integer-program selection from participant questions, optimised for JR |
| LLM | Abstractive | GPT-4o-generated summary questions, averaged across 100 sampled slates |
| LLMbest | Abstractive | Best-of-100 LLM slate according to the audit score |
This setup is important because it separates two questions that are often lazily merged.
The first question is whether algorithmic selection can improve representativeness over human moderation. The answer is generally yes in this dataset.
The second question is whether abstractive LLM generation is better than selecting original participant questions. The answer is: sometimes, not always.
The table of results is the paper’s main evidence. Across twelve sessions, the integer-program extractive method produces JR values between 0.244 and 0.525. Since all are below 1, these slates satisfy JR. That is expected, because the integer program is directly optimising the representation criterion, though the authors note the exact optimisation problem is NP-hard and use a practical integer-programming formulation.
Human-selected slates range from 0.738 to 2.234. Half of the human slates fail JR by exceeding 1. This is not an indictment of moderators as careless humans doing human things, though naturally that remains an available interpretation for the impatient. It shows that unaudited selection can miss coherent groups even in professionally run deliberations.
LLM-generated slates are more nuanced. The average LLM slate across 100 runs ranges from 0.545 to 1.052. The best-of-100 LLM slate ranges from 0.280 to 0.583. That means audited LLM generation can produce strong representative slates. But the “best” part is doing work here: the paper samples multiple LLM outputs and selects using the JR audit. The audit is not a footnote to the model. It is how the model becomes reliable enough to use.
A simple reading of the results looks like this:
| Finding | Evidence | Business meaning | Boundary |
|---|---|---|---|
| Human moderation often beats random selection, but not reliably enough | Human slates are generally better than random, yet 6 of 12 fail JR | Professional judgement benefits from audit support | Historical moderators may have optimised for other legitimate goals |
| Extractive optimisation gives strong representation guarantees | IP slates all score below 1 | Selecting original user input can be transparent and auditable | Exact optimisation can be computationally heavy at larger scale |
| LLMs can generate representative and polished summaries | LLMbest slates all score below 1 and sometimes match or beat IP | Abstractive synthesis may improve readability without sacrificing coverage | Requires sampling and auditing; a single LLM output is not guaranteed |
| Embedding-based audit is plausible but not definitive | QQP AUC around 0.868–0.876 and cross-model checks are stable | Similarity-based coverage can be operationalised today | Similarity is not the same as participant-validated utility |
The most useful result is not that LLMs can help. We knew they could help produce plausible text. The useful result is that LLMs need a representation audit because their advantage is conditional.
In some sessions, abstractive slates match or surpass the best extractive slates, particularly where fewer questions are proposed. In larger sessions, LLM generation may be more feasible at interactive speeds than solving for the best extractive slate. But the integer-program approach remains a critical benchmark because it selects from the participants’ own words. That has trust value. People may accept an awkward question they recognise as theirs more readily than a beautifully phrased synthesis that quietly removed the point.
The paper’s Table 2 makes this visible. For one America in One Room session, the IP slate and LLMbest slate both achieve a JR value of 0.421, while the human slate scores 0.842. The IP slate contains participant-style questions, sometimes rough and specific. The LLM slate is cleaner, more generalised, and easier to read. Same audit score, different democratic texture.
That difference is not cosmetic. In civic or enterprise settings, transparency can be part of representation. A selected question copied from an employee, customer, or citizen has provenance. A generated synthesis has elegance. Elegance is useful. Provenance is safer. A sensible workflow may need both.
The platform integration is where the paper becomes more than theory
The authors integrate the auditing and question-generation workflow into an online deliberation platform used in hundreds of deliberations across more than 50 countries. The platform can generate LLM summary questions, show which participant questions are most similar to each generated question, and export similarity data for auditing.
This matters because representation audits are only valuable if they appear before the final decision, not after the press release.
A practical version of the workflow looks like this:
- collect participant questions;
- generate candidate slates using human selection, extractive optimisation, LLM synthesis, or hybrids;
- compute $\alpha_{JR}$ for each slate;
- inspect which participant clusters are covered or uncovered;
- select a final slate with both representation and communication quality in view;
- disclose the logic, at least internally, so the process has an audit trail.
That workflow applies beyond civic deliberation. Any organisation that compresses many inputs into a smaller agenda faces the same problem.
A board collecting shareholder questions before an annual meeting. A company summarising employee concerns from an engagement survey. A regulator clustering consultation responses. A product team choosing which customer complaints to escalate. A hospital network triaging patient advisory feedback. In each case, the danger is not that some topics are excluded. Exclusion is inevitable when attention is limited. The danger is that sizeable coherent groups are excluded without anyone noticing.
This is where the paper’s business relevance sits. It does not offer a universal fairness engine. It offers a way to audit compression.
For business leaders, this is agenda-risk management
The commercial analogue of deliberative failure is not always scandal. Often it is quieter: a stakeholder group concludes that consultation was performative, a minority customer segment sees its problem disappear into a summary, or an internal AI tool produces executive-ready synthesis that optimises for readability over representational fidelity.
The paper suggests a practical governance pattern:
| Governance layer | What it asks | How the paper informs it |
|---|---|---|
| Representation audit | Which coherent groups are uncovered? | Use JR-style blocking-coalition detection |
| Model comparison | Does the LLM improve on simpler methods? | Compare LLM slates against extractive and human baselines |
| Provenance check | Are outputs traceable to real inputs? | Show nearest participant questions for each selected or generated item |
| Human oversight | What should moderators decide after the audit? | Use scores and heatmaps as decision support, not automatic authority |
| Boundary review | What does the metric fail to capture? | Validate utility assumptions where stakes justify it |
The return on investment is not only faster summarisation. Faster summarisation is the obvious benefit, and obvious benefits are often where strategy goes to nap. The deeper value is reducing agenda risk: the risk that a process claims inclusion while structurally overlooking a group large enough to matter.
That risk is especially relevant for LLM deployments because generated summaries can be persuasive even when they are incomplete. A human reviewer may see a coherent list of questions and assume coverage. The JR audit asks a different question: is there a group of people whose own proposed question is closer to an unselected alternative than to anything in the final slate?
This is a much better governance question than “does the summary look balanced?” The answer to that one is usually “yes, in Arial”.
The limitations are not boilerplate; they define proper use
The paper’s most important limitation is the utility proxy. Participants are represented by the question they proposed, and utility is inferred through embedding similarity. This is transparent and scalable, but it cannot capture everything a participant values. Two questions may be semantically close but politically different. One question may contain a tone, premise, or institutional concern that embeddings flatten. A participant may also care about multiple issues, while the model associates them with one submitted question.
The second limitation is retrospective evaluation. The audit is applied to historical slates. The authors validate embedding similarity using external similarity judgments and cross-model robustness checks, but they do not directly ask participants whether the audited slates felt representative. Future work could integrate participant feedback during live deliberations, which would help distinguish semantic similarity from perceived representation.
The third boundary is that JR is only one representation axiom. It is meaningful because it identifies large coherent groups that deserve coverage, but stronger proportionality concepts exist, including variants discussed in the broader social choice literature. The paper itself notes that future work could audit stronger guarantees such as BJR.
Finally, the comparison between extractive and abstractive approaches is not a final verdict on LLMs. The LLM setup uses GPT-4o, a specific prompt, temperature-one sampling, shuffled input order, and best-of-100 selection. Different models, prompts, languages, topics, or deliberative cultures may shift performance. The study covers 12 sessions, not the entire moral universe. A pity, but science remains inconveniently finite.
These limitations do not weaken the paper’s practical message. They sharpen it. Use the audit as a diagnostic layer. Do not confuse it with ground truth. Combine it with human judgement, provenance, participant validation where possible, and clear disclosure about how final questions were chosen.
The future is not AI moderation; it is audited moderation
The tempting headline is that LLMs can help democracy scale. The better headline is that scaled participation without audited compression is just a larger funnel into a smaller room.
This paper offers a concrete way to inspect that funnel. It turns representation from an aspiration into a measurable condition: if a group is large enough, coherent enough, and uncovered by the final slate, the process should know before the questions reach the expert panel.
For civic platforms, that means AI-generated summaries should be scored, compared, and traced back to participant input. For companies, it means stakeholder summarisation systems should not be evaluated only on fluency, speed, or executive friendliness. For AI governance teams, it means the audit layer may matter more than the generation layer.
The strongest version of this workflow is hybrid. Let humans bring judgement. Let extractive optimisation preserve participant provenance. Let LLMs produce readable synthesis. Then audit all of it.
Because the real democratic question is not whether AI can write better questions.
It is whether we can prove whose questions survived the compression.
Cognaptus: Automate the Present, Incubate the Future.
-
Soham De, Lodewijk Gelauff, Ashish Goel, Smitha Milli, Ariel Procaccia, and Alice Siu, “Question the Questions: Auditing Representation in Online Deliberative Processes,” arXiv:2511.04588, November 2025, https://arxiv.org/abs/2511.04588. ↩︎