Agents are easy to create. That is now the boring part.
Give one LLM a persona, give another LLM a conflicting persona, add a shared task, let them talk, and suddenly the demo looks like a little society. A farmer argues with a conservationist. A rural teacher argues with an urban parent. A policy maker tries to sound balanced, because apparently even simulated bureaucracy has survival instincts.
The harder question is not whether these agents can talk. They can. The harder question is whether we can steer the dialogue process itself without retraining the model, rewriting the whole system, or pretending that “be thoughtful and evidence-based” is an engineering architecture.
The paper “Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts” tackles exactly that question.1 Its core move is simple but useful: treat the prompt not as a decorative instruction, but as an action generated by a lightweight policy. In other words, instead of thinking “we wrote a better prompt,” the paper asks: what if prompt construction is the control surface for a multi-agent system?
That distinction matters. A prompt is usually discussed as a sentence-level craft problem. A policy is a system-level control problem. The first makes people argue about wording. The second makes them ask what state the agent observes, what action it takes, what parameters govern that action, and how the resulting behavior is measured.
The paper does not prove that LLM agents are realistic humans. It does not show that synthetic societies can replace surveys, fieldwork, negotiation research, or actual customers. Thank goodness. We have enough imaginary people in dashboards already.
What it does show is narrower and more operationally useful: by changing prompt components, rule templates, and weights, we can measurably shift how LLM agents respond, rebut, cite evidence, repeat themselves, and maintain stance across multi-round dialogue.
That makes the paper interesting not because of the land-use or education-policy debates it simulates, but because of the mechanism it proposes.
The paper’s real object is not a conversation, but a control loop
The authors formalize a multi-agent dialogue as a state-action process. Each agent has a state built from task/persona information, dialogue memory, and retrieved external knowledge. A policy maps that state into a prompt. The LLM executes the prompt and produces the next utterance. The utterance then enters the shared dialogue memory, shaping the next round.
That sounds more complicated than “agent says something,” but the extra structure is doing real work.
The paper decomposes the constructed prompt into five components:
| Component | Meaning in the paper | Operational role |
|---|---|---|
| $T$ | Task and persona description | Defines the agent’s role, stance, and goal |
| $M$ | Dialogue history memory | Gives the agent recent conversational context |
| $D$ | Retrieved external knowledge | Supplies evidence through RAG-style retrieval |
| $R$ | Rule template | Specifies response structure and interaction style |
| $W$ | Weights | Controls how strongly $T$, $M$, and $D$ influence the prompt |
The most useful idea is not any single component. Most agent systems already have personas, memory, and retrieval. The useful idea is that these components become adjustable policy parameters rather than loose prompt ingredients.
The authors define three rule-template settings. None gives no explicit structural rule beyond the concatenated information blocks. Light asks the agent to answer the question, provide one or two evidence items from retrieved knowledge, respond to memory if necessary, and stay within a length limit. Struct forces a more detailed response skeleton: extract supporting points, opposing points, unresolved conflicts, and opportunities for cooperation before generating a concise response.
That rule ladder is important because it separates two different steering strategies.
A light rule nudges the agent toward evidence and relevance. A structured rule forces the agent to process the dialogue through a categorical frame. The first is like giving a meeting participant a speaking guideline. The second is like forcing everyone to fill out a mini-consulting template before opening their mouth. Sometimes that improves discipline. Sometimes it makes the conversation sound like it was trapped inside a strategy workshop and never returned.
Weights turn prompt ingredients into behavioral levers
The second steering mechanism is the weight vector $W$.
The weights do not numerically multiply hidden activations inside the model. They are translated into prompt instructions. A low persona weight tells the agent to keep the role implicit and focus on arguments. A high persona weight tells it to explicitly speak from the assigned role and state its stance first. A low memory weight tells it to explicitly speak from the assigned role and state its stance first. A low memory weight lets the agent respond without summarizing prior dialogue. A high memory weight tells it to recap recent turns and resolve pending points. A low evidence weight makes retrieved evidence optional; a high evidence weight asks the agent to provide concrete evidence before concluding.
This is not mathematically deep in the way reinforcement learning is mathematically deep. But it is operationally clever. It converts vague prompt-tuning into a small set of named levers:
| Lever | When raised, the agent should become more… | Likely business use |
|---|---|---|
| Persona weight $w_T$ | Role-consistent, stance-explicit, sometimes more conflict-oriented | Simulating stakeholder pressure, negotiation positions, sales personas |
| Memory weight $w_M$ | Context-aware and responsive to prior turns | Long-running support, project agents, research discussions |
| Evidence weight $w_D$ | Grounded in retrieved documents | Compliance, research, analyst workflows, expert-assistant systems |
| Rule structure $R$ | Formatted, constrained, less free-form | Auditable deliberation, meeting summaries, structured decision support |
The paper also adds adaptive weights. Over time, agents rely less on retrieved external knowledge and more on dialogue memory, while behavior-based corrections increase evidence or memory weights if the previous response failed to use evidence or respond to memory.
That adaptive scheduler is best read as an implementation detail plus an exploratory control extension. It is not the paper’s strongest proof. The strongest idea is simpler: once prompt components are parameterized, a system designer can adjust dialogue behavior without touching model weights.
The experiments test behavioral steering, not “truth”
The experimental setup uses two public-policy scenarios: land-resource use and educational-resource allocation. Each scenario has three agents with distinct roles. In the land scenario, the agents are a farmer, a conservationist, and a community representative. In the education scenario, they are a rural teacher, an urban parent, and a policy maker.
The agents are powered by different open models: Qwen3-8B, Llama3-8B, and Mistral-7B. Each agent has role-specific external knowledge. The dialogue lasts ten rounds. For each scenario, the authors use five topic queries and five independent runs.
The evaluation uses five metrics:
| Metric | What it tries to capture | Interpretation boundary |
|---|---|---|
| Responsiveness | Whether the current utterance addresses recent dialogue | A relevance signal, not deep understanding |
| Rebuttal | Whether the agent explicitly opposes a recent utterance | Argument activity, not argument quality |
| Non-repetition | Novelty relative to the agent’s previous utterance | Less repetition, not necessarily better reasoning |
| Evidence usage | Whether phrases from retrieved knowledge appear in output | Surface grounding, not full factual correctness |
| Stance shift | Similarity between output and persona description | Role alignment, not human-like belief formation |
This matters because the paper is not measuring whether the agents reached wise conclusions about land policy or education funding. It is measuring whether prompt-policy parameters change observable dialogue dynamics.
That is the correct target. If the claim is “policy-parameterized prompts can steer multi-agent dialogue,” then the relevant evidence is not whether the farmer won the debate. It is whether the same system behaves differently when the rule template or weight settings change.
Light rules increase evidence use; structured rules reduce repetition
The main experiment compares rule templates while holding the weights fixed at $w_T = 1.0$, $w_M = 1.0$, and $w_D = 1.0$.
The overall pattern is clear enough, though not always as clean as a product slide would prefer.
| Scenario | Rule | Responsiveness | Rebuttal | Non-repetition | Evidence usage | Stance |
|---|---|---|---|---|---|---|
| Land | None | 0.85 | 0.28 | 0.43 | 0.10 | 0.48 |
| Land | Light | 0.85 | 0.31 | 0.33 | 0.28 | 0.47 |
| Land | Struct | 0.80 | 0.22 | 0.62 | 0.20 | 0.47 |
| Education | None | 0.85 | 0.15 | 0.47 | 0.16 | 0.48 |
| Education | Light | 0.88 | 0.13 | 0.38 | 0.30 | 0.47 |
| Education | Struct | 0.89 | 0.07 | 0.62 | 0.20 | 0.51 |
The strongest result is evidence usage. In both scenarios, the Light rule produces the highest overall evidence-use score: 0.28 in Land versus 0.10 under None, and 0.30 in Education versus 0.16 under None. That is a meaningful behavioral shift. It suggests that mild structural instruction can make agents use retrieved knowledge more often.
The second strong result is non-repetition. Struct produces the highest non-repetition in both scenarios: 0.62 in Land and 0.62 in Education. This makes intuitive sense. If an agent must extract categories of points before responding, it is less likely to simply rephrase itself.
But the trade-off is visible. In Land, Struct lowers responsiveness from 0.85 to 0.80 and lowers rebuttal from 0.28 to 0.22. In Education, Struct keeps responsiveness high but reduces rebuttal to 0.07. A highly structured agent may sound cleaner, but also less naturally engaged in direct disagreement.
This is the first business lesson: structured prompts do not simply “improve” agents. They reallocate behavior.
A customer-support agent may benefit from Light rules if the goal is to cite policy documents without becoming robotic. A compliance-review agent may benefit from Struct if the goal is auditability and non-repetition. A negotiation simulator may not want Struct if it suppresses direct rebuttal. Prompt policy is not a universal medicine. It is a steering wheel. Sadly, steering wheels can also turn into walls.
Weight sensitivity shows interaction effects, not isolated magic knobs
The paper then varies individual weights in the Land scenario. This is best interpreted as a sensitivity test: it asks whether the system’s behavior changes when persona, memory, or evidence emphasis is raised or lowered.
Several findings are useful.
First, responsiveness stays above 0.8 across weight configurations in the main sensitivity table. That suggests the system remains broadly responsive even when the emphasis among persona, memory, and evidence changes.
Second, increasing the persona weight $w_T$ to 1.5 increases rebuttal strongly. Under None, rebuttal rises from 0.28 in the baseline to 0.45. Under Light, it rises from 0.31 to 0.45. Under Struct, it rises from 0.22 to 0.47. This is exactly what a persona weight should do: stronger role identity creates more visible disagreement.
Third, evidence use depends on both evidence weight and rule template. When $w_D = 1.5$, the None condition reaches evidence usage of 0.30, up from 0.10 at baseline. But under Light, baseline evidence usage is already 0.28 and remains 0.28 when $w_D = 1.5$. In contrast, when $w_D = 0.5$, Light still reaches 0.39 evidence usage, much higher than None at 0.10.
That cross-over is more interesting than a simple “higher evidence weight means more evidence” story. It suggests that rules and weights can substitute for each other in some cases. A Light rule can force evidence use even when the evidence weight is low. Without that rule, the system may need stronger evidence emphasis to produce similar behavior.
For business systems, this means you should not tune prompt levers one at a time and assume independent effects. Persona, memory, evidence, and structure interact. A retrieval-heavy prompt with weak response rules may behave differently from a lightly weighted retrieval prompt with a strong evidence-use rule.
In practical terms: prompt governance needs experiments, not folklore.
Adaptive weights regulate trajectories more than averages
The adaptive-weight experiment is easy to overread, so let’s not.
The authors initialize weights and then update them over the conversation. The memory weight rises over rounds, while the evidence weight falls over rounds; behavior-based correction increases a weight if the previous response failed to use evidence or memory. In the Land scenario, the overall averages under adaptive weights are similar to the non-adaptive baseline:
| Rule | Responsiveness | Rebuttal | Non-repetition | Evidence usage | Stance |
|---|---|---|---|---|---|
| None | 0.86 | 0.26 | 0.45 | 0.18 | 0.48 |
| Light | 0.85 | 0.39 | 0.37 | 0.29 | 0.45 |
| Struct | 0.82 | 0.13 | 0.52 | 0.16 | 0.48 |
The authors note that adaptive weights do not substantially alter mean performance. The more relevant effect appears in round-wise trajectories. Early evidence use can increase when the evidence weight is corrected upward from too low a setting. Later evidence use can decline as the scheduled evidence weight decreases.
So the adaptive scheduler is not a magic performance booster. It is a way to shape when certain behaviors occur.
That distinction matters. In business workflows, average behavior is often less important than phase behavior. A research agent may need heavy evidence use early, then more synthesis later. A negotiation simulator may need strong persona commitment at the opening, then more memory sensitivity as concessions accumulate. A customer-service escalation agent may need to become less script-bound and more context-aware as the conversation lengthens.
Adaptive weights are useful if the desired behavior changes across stages. If the desired behavior is constant, the scheduler may simply add complexity with a nicer name.
The ablation study explains why each component matters
The ablation study removes or combines $T$, $M$, and $D$ to isolate their roles. This is the paper’s most useful diagnostic section because it explains the mechanism behind the main results.
With no component, evidence usage is essentially zero and stance is low. With only $T$, rebuttal rises and stance alignment rises. With only $D$, evidence usage rises. With only $M$, non-repetition drops sharply, because the agent has prior dialogue available and may reuse or respond around it.
That last point is important. Memory does not automatically make agents better. It makes them more context-dependent. Sometimes that means coherence. Sometimes it means recycling. Anyone who has watched an LLM say “as mentioned earlier” while mentioning the same thing for the fourth time will recognize the phenomenon.
A compact reading of the ablation is:
| Component pattern | What the result supports | What it does not prove |
|---|---|---|
| Only $T$ raises rebuttal and stance | Persona/task drives role-consistent disagreement | Persona makes the agent socially realistic |
| Only $D$ raises evidence usage | Retrieval can ground dialogue in supplied materials | Retrieved evidence is used correctly or completely |
| Only $M$ lowers non-repetition | Memory changes continuity and can increase repetition | Memory always improves discussion quality |
| Combined components produce mixed effects | Prompt components interact and partially offset each other | More prompt ingredients always improve the system |
For enterprise use, this is probably the most practical lesson in the paper. If an agent swarm behaves badly, do not simply add more context, more documents, more persona, and more rules. Diagnose which component is producing which behavior.
The boring debugging question—“which block of the prompt caused this?”—is actually the useful question.
Backbone variation is robustness evidence, not the main thesis
The paper also tests different backbone model configurations. In the main setup, the three agents use different models. The authors compare this with a homogeneous all-Qwen3 setup and alternative heterogeneous assignments.
The homogeneous setup generally produces lower responsiveness, rebuttal, and non-repetition than the heterogeneous setup. In the all-Qwen3 condition, overall Land responsiveness ranges from 0.59 to 0.74 depending on rule, compared with 0.80 to 0.85 in the main Land setup. Non-repetition is also lower in the homogeneous condition.
The likely purpose of this test is robustness and implementation sensitivity. It supports the idea that rule effects are not purely an artifact of one exact model assignment, while also suggesting that model diversity contributes to richer dialogue.
It does not prove that heterogeneous agent systems are always better. It shows that in this controlled setup, using different backbone LLMs produced more dynamic conversational behavior. That is a useful design hint, not a law of artificial sociology.
The appendix also extends the Land scenario with additional farmer agents. This is best treated as an exploratory extension. The authors show that policy-parameterized prompts remain effective when more agents are added, but the experiment is still small and scenario-specific.
What business teams can actually take from this
The paper’s business relevance is not “simulate society and ask the bots what customers want.” That is the lazy version, and it deserves to be left outside in the rain.
The stronger interpretation is that policy-parameterized prompts offer a control layer for multi-agent systems. This matters wherever organizations are building agent teams that must deliberate, critique, retrieve evidence, or maintain role discipline.
Potential applications include:
| Business setting | Useful prompt-policy control | Why it matters |
|---|---|---|
| Research-agent teams | Raise evidence weight, use Light rules | Forces more document-grounded claims without over-structuring dialogue |
| Synthetic stakeholder review | Raise persona weight selectively | Makes agents represent conflicting operational interests more clearly |
| Negotiation simulation | Tune persona and rebuttal behavior | Creates more realistic pressure without retraining models |
| Customer-service escalation | Increase memory weight over turns | Helps agents respond to unresolved issues rather than restart the script |
| Compliance or audit workflows | Use Struct rules and high evidence weight | Improves traceability and reduces repetitive free-form answers |
| Product strategy debate agents | Mix backbone models and role prompts | Encourages diversity of critique instead of one-model groupthink |
The ROI relevance is not mainly lower model-training cost, though avoiding training is useful. The deeper value is cheaper behavioral diagnosis.
If an agent team is too repetitive, inspect memory and rule structure. If it ignores documents, adjust evidence rules and evidence weight. If it refuses to disagree, increase persona emphasis or modify the rule template. If it becomes theatrical and ungrounded, reduce persona intensity and raise evidence requirements.
This is a more mature way to think about prompt engineering. Not “write a better prompt,” but “define the behavioral levers, measure the output, and tune the control policy.”
Still not glamorous. Usually the useful engineering isn’t.
Where the evidence stops
The paper’s boundaries are clear.
First, the experiments use two public-policy scenarios: Land and Education. That is enough to demonstrate mechanism, not enough to generalize to all enterprise settings.
Second, each dialogue runs for ten rounds with a small number of agents. Long-running business agents may face context drift, tool failures, retrieval degradation, user interruptions, and conflicting operational goals.
Third, several metrics are proxies. Evidence usage checks whether key phrases from retrieved knowledge appear in the output. That is useful, but it is not the same as verifying factual correctness, legal validity, or faithful use of evidence. Rebuttal measures explicit opposition, not argument quality. Stance similarity measures alignment with persona description, not stable belief.
Fourth, the judge model and embedding model shape the evaluation. The paper addresses this partly by noting that several metrics are embedding-based and by testing independent judge models in a controlled setting, but production systems still need domain-specific evaluation.
Fifth, the knowledge bases were collected from public materials and summarized/supplemented for the experiment. That is acceptable for research, but in business settings the quality of the document base will often dominate the quality of the dialogue. A bad evidence base with a high evidence weight is not governance. It is very confident paperwork.
These limitations do not weaken the core mechanism. They define its proper use.
The paper should be read as a framework for controllable dialogue behavior, not as proof that LLM societies are faithful replicas of human institutions.
The real lesson: prompt policy is cheap governance
The paper gives us a useful vocabulary for agent design.
A persona is not just a role card. It is a stance-control lever. Memory is not just chat history. It is a continuity-and-repetition lever. Retrieval is not just RAG decoration. It is an evidence-use lever. Rule templates are not formatting preferences. They are behavioral constraints. Weights are not mathematical sophistication for its own sake. They are a way to make these choices inspectable and tunable.
This is why the mechanism-first reading matters. If we summarize the paper as “structured prompts improve multi-agent dialogue,” we miss the actual contribution. The paper is not about finding one good prompt. It is about making prompt construction itself into a policy.
For Cognaptus-style business automation, that is the practical bridge. Multi-agent systems will not become useful merely because we give them charming job titles and let them chat. They become useful when their behavior can be configured, measured, debugged, and revised.
The next stage of agent engineering will probably not be “more agents.” It will be better control over the agents we already know how to create.
Tiny policies, large behavioral effects. Prompt politics, basically.
The machines have not become a society. But they have become a meeting.
And now, unfortunately, we need governance for the meeting.
Cognaptus: Automate the Present, Incubate the Future.
gue via Policy-Parameterized Prompts,” arXiv:2603.09890, 2026. https://arxiv.org/abs/2603.09890
-
Hongbo Bo, Jingyu Hu, and Weiru Liu, “Influencing LLM Multi-Agent Dial ↩︎