Prompts are now office furniture.
Everyone has them. Everyone complains about them. Nobody is quite sure who owns the standard version. One team keeps a Notion page of “best prompts.” Another hides theirs in a spreadsheet. A third tells new staff to “just ask clearly,” which is not a method, but it does have the administrative elegance of doing nothing.
The paper behind this article, Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect, argues for a more disciplined view: the problem is not merely prompt quality, but intent transmission.1 In other words, the user has a goal, the model receives a prompt, and somewhere between those two things the goal is compressed, distorted, or politely murdered.
That framing matters because it moves the discussion away from prompt cleverness and toward communication infrastructure. A good enterprise AI workflow should not depend on whether one employee has learned the correct incantation from a viral prompt thread. It should give the model a stable representation of what the user wants: task, purpose, audience, context, method, constraints, format, and tone.
The paper tests that idea across three structured frameworks: PPS/5W3H, CO-STAR, and RISEN. The headline result is not “5W3H beats everything.” That would be a convenient marketing sentence, and therefore suspicious. The more useful result is subtler: structured intent decomposition appears to be the active ingredient, while the specific framework matters less under the paper’s current evaluation scale.
That is the part worth understanding.
The mechanism: structure reduces what the model has to guess
A normal prompt asks the model to infer too much.
“Analyze China’s EV market competition” looks clear because humans are charitable readers. But the instruction hides several unresolved questions. Who is the analysis for? Investors, executives, students, policy analysts, or a general reader? Should it emphasize pricing, supply chains, technology, consumer demand, geopolitics, or financial performance? Is the output meant to be a briefing, memo, slide outline, report, or argument? Should it be cautious, assertive, comparative, or explanatory? Should it include numbers? Should it discuss 2024, 2026, or a five-year outlook?
The model can still produce something fluent. Fluency is not the problem. The problem is that the output may satisfy a plausible version of the task rather than the user’s actual version of the task.
Structured intent frameworks attack this ambiguity by forcing the missing dimensions into the open. PPS/5W3H uses eight dimensions: What, Why, Who, When, Where, How-to-do, How-much, and How-feel. CO-STAR uses Context, Objective, Style, Tone, Audience, and Response. RISEN uses Role, Instructions, Steps, End goal, and Narrowing constraints.
They differ in vocabulary, coverage, and length. But they share the same operating principle: decompose the request before asking the model to execute it.
| Framework | What it makes explicit | What the paper’s results suggest |
|---|---|---|
| PPS / 5W3H | Task, purpose, audience, time, environment, method, quantity, tone | Broad intent coverage, but possibly more processing overhead in some settings |
| CO-STAR | Context, objective, style, tone, audience, response format | Strong alignment under the study’s goal-alignment metric |
| RISEN | Role, instructions, steps, end goal, narrowing constraints | Strong alignment under the same metric, despite fewer explicit dimensions |
The mechanism is therefore not mystical. Structure works because it converts implicit intent into explicit constraints. Less mind-reading, fewer default assumptions, less room for the model to wander off into what it thinks the task probably means. A revolutionary discovery, apparently: telling systems what you want improves the chance they do it.
The experiment compares intent formats, not just prompt polish
The controlled experiment uses 3,240 model outputs: 3 models, 3 languages, 6 prompt conditions, 3 domains, and 20 tasks per domain. The models are Claude Sonnet 4, GPT-4o, and Gemini 2.5 Pro. The languages are Chinese, English, and Japanese. The task domains are travel, business, and technical writing.
The six conditions are important because they separate several ideas that are often lazily bundled together:
| Condition | Meaning | Likely purpose in the experiment |
|---|---|---|
| A: Simple prompt | One-sentence task description | Main unstructured baseline |
| B: Raw JSON | PPS specification in raw JSON | Tests whether machine-readable structure alone helps |
| C: Manual 5W3H | Expert-written structured prompt | Tests manually crafted structured intent |
| D: AI-expanded 5W3H | One-sentence prompt expanded into 5W3H by lateni.com | Tests AI-assisted intent expansion |
| E: CO-STAR | Prompt structured with CO-STAR | Framework comparison |
| F: RISEN | Prompt structured with RISEN | Framework comparison |
Outputs are judged by DeepSeek-V3 on a 1–5 goal-alignment score. This is not a perfect evaluation design, and the paper is unusually explicit about that. But it is aimed at the right construct: not whether the answer sounds nice, but whether it satisfies the user’s likely intent.
That distinction is not academic hair-splitting. In business use, a beautifully written answer to the wrong task is still wrong. It is just wrong with better typography.
The first result: structured frameworks converge near the top
The most important table in the paper shows the grand mean goal-alignment scores across all models, languages, and domains:
| Prompt condition | Grand mean goal-alignment score |
|---|---|
| A: Simple prompt | 4.463 |
| B: Raw JSON | 4.141 |
| C: Manual 5W3H | 4.683 |
| D: AI-expanded 5W3H | 4.930 |
| E: CO-STAR | 4.978 |
| F: RISEN | 4.983 |
The simple business reading is clear: structured, readable prompts outperform simple prompts and raw JSON. Raw JSON performs worst overall, which is a useful corrective for teams that confuse “structured” with “dumped into a machine-looking format.” A schema that humans and models do not comfortably process can become friction, not discipline.
The more delicate interpretation is about the three structured frameworks. CO-STAR and RISEN slightly outperform AI-expanded 5W3H in the aggregate, but the differences are tiny. The paper reports a TOST equivalence test with a ±0.2 margin on the 1–5 scale and finds D, E, and F statistically equivalent under that criterion.
This does not prove that CO-STAR, RISEN, and 5W3H are truly identical. It means they all saturate the current measurement instrument. When scores are compressed between 4.930 and 4.983, the evaluation scale may no longer be sensitive enough to detect meaningful differences among already-strong methods.
That is the first misconception to avoid. The paper is not a tournament where one prompt framework wins a medal. It is evidence that dimensional decomposition itself is doing a large share of the work.
The second result: language becomes less of a moving target
The paper’s cross-language result is stronger than the framework-ranking story.
Unstructured prompts show much higher score variance across languages. Structured prompts sharply reduce that variance:
| Condition | Cross-language standard deviation across all models |
|---|---|
| A: Simple prompt | 0.470 |
| B: Raw JSON | 0.824 |
| C: Manual 5W3H | 0.378 |
| D: AI-expanded 5W3H | 0.127 |
| E: CO-STAR | 0.020 |
| F: RISEN | 0.019 |
The paper describes the drop from 0.470 to roughly 0.020 as a reduction of up to 24×. For a company operating across regions, that is the operationally interesting part.
A multilingual AI workflow often fails in boring ways. The same internal request produces different levels of completeness in English, Chinese, and Japanese. Local teams adapt prompts differently. Evaluation becomes inconsistent. The central team then announces a “prompt governance initiative,” and everyone quietly continues using their private templates because the official one is unreadable.
Structured intent offers a cleaner approach: standardize the request dimensions, not the exact prose. The surface language can change; the intent schema remains stable.
What the paper directly shows is that structured formats reduce language-dependent variation in a controlled benchmark. What Cognaptus infers is that organizations should treat multilingual prompt design as a schema problem before treating it as a translation problem. Translate the words, yes. But first decide which intent fields must exist.
The third result: weaker models benefit more from explicit intent
The paper’s “weak-model compensation effect” is the most commercially tempting result, so it needs the most careful interpretation.
The gain from simple prompt A to AI-expanded 5W3H condition D differs sharply by model:
| Model | Mean D–A gain |
|---|---|
| Claude | +0.217 |
| GPT-4o | +0.178 |
| Gemini | +1.006 |
Gemini gains 4.6× more than Claude from structured prompting. The paper interprets this as evidence that structure compensates for weaker baseline inference. A stronger model can infer missing intent dimensions from minimal cues. A weaker model cannot. So when the prompt externalizes those dimensions, the weaker model catches up.
There is a practical implication here, but it is not the lazy version.
The lazy version says: “Use structured prompts and cheaper models will become just as good as premium models.” That is procurement poetry. Please do not put it in a board memo without adult supervision.
The more defensible version is this: for tasks where failure is caused by missing intent rather than missing knowledge or reasoning capacity, structured prompts may allow weaker or cheaper models to perform much closer to stronger models. That distinction matters. If the task requires deep reasoning, domain expertise, tool use, or high factual precision, structure cannot magically create capability. But if the task is failing because the model does not know the audience, format, constraints, or success criteria, structure can do real work.
This is why the result matters for AI operations. Many enterprise AI tasks are not frontier-intelligence tasks. They are specification tasks: summarize this for a CFO, rewrite this for clients, classify this according to our policy, draft a response with these constraints, prepare a first version in this format. In such workflows, better intent encoding may reduce the need to route everything to the most expensive model.
That is not glamorous. It is also where ROI tends to hide.
The user study supports usability, but not clean causality
The paper adds a 50-person user study. Participants used their own tasks and habitual AI models under three sequential conditions: a simple prompt, an unmodified AI-generated 5W3H prompt, and a user-modified 5W3H prompt.
The results are directionally useful:
| User-study measure | Simple prompt | 5W3H modified | Interpretation |
|---|---|---|---|
| Satisfaction score | 3.16 | 4.04 | Higher satisfaction after structured intent and user calibration |
| Weighted mean interaction rounds | 4.05 | 1.62 | About 60% fewer rounds |
| Users needing 0–1 rounds | 28% | 60% | More users reach usable output quickly |
| Users needing 5+ rounds | 32% | 8% | Fewer long correction loops |
The paper also reports that 82% of users needed to adjust at most two of the eight AI-generated dimensions, and no users reported that the expansion was broadly inaccurate.
For product design, this is useful. It suggests that AI-assisted intent expansion can reduce the blank-page problem: users do not need to know all the dimensions in advance; the system can propose them, and the user can correct them.
But the user study is not a clean causal proof. It used a fixed sequence: A → D_raw → D_mod. Participants performed the same self-selected task repeatedly. Naturally, by the third attempt they may understand their own task better, learn from earlier outputs, and judge with different expectations. The paper acknowledges this as a fundamental design limitation, not a decorative footnote.
So the right reading is: the user study supports practical usability and directional efficiency gains. It does not isolate the exact causal share attributable to structured intent versus repetition, learning, and task refinement.
That is still valuable. Business tools are allowed to work through helping users clarify their own requests. In fact, that may be the point.
The boundary condition: too much structure can become overhead
The paper’s most interesting negative result is the GPT-4o Japanese anomaly.
In most cells, structured prompts improve goal alignment. But for GPT-4o in Japanese under condition D, the score falls to 4.600, below the simple-prompt baseline of 4.950. The paper’s domain breakdown shows the problem is concentrated in complex tasks:
| Domain | GPT-4o Japanese D | Claude Japanese D | Difference |
|---|---|---|---|
| Travel | 5.000 | 5.000 | 0.000 |
| Business | 4.450 | 4.950 | -0.500 |
| Technical | 4.350 | 5.000 | -0.650 |
The paper calls this “encoding overhead.” When the structure becomes too dense for the model-language-task combination, the model may fail to execute all requirements. The prompt has more complete intent, but the receiver cannot process it cleanly.
This is exactly the kind of result enterprises should pay attention to because it prevents a stupid implementation rule: “Always use the fullest prompt template.”
No. Use enough structure to remove ambiguity, not so much structure that the model spends its limited execution budget juggling fields. A short customer-service rewrite may not need eight dimensions. A legal-risk memo may need more. A multilingual technical report may need structure, but it may also need staged execution, intermediate outlines, or a stronger model.
The paper’s own conceptual relationship is useful:
This is not a formal fitted model. Treat it as a practical design heuristic. Structure has benefits, but those benefits are not monotonic. More fields are not always better. Anyone who has filled out a 14-column project intake form already knew this, but it is nice when the models agree.
What this means for enterprise AI design
The paper’s results point toward a shift from prompt engineering as individual craft to intent design as organizational infrastructure.
That does not mean every employee should write in 5W3H. It means the system should help users supply the dimensions that matter for a task.
A practical enterprise workflow could look like this:
| Layer | Design question | Operational consequence |
|---|---|---|
| User interface | Which intent fields should the user see? | Reduce blank-page prompting without overwhelming users |
| Intent schema | Which fields are mandatory for this task type? | Standardize requests across departments and languages |
| Expansion layer | Can AI infer missing fields for user review? | Lower the skill requirement for non-expert users |
| Model routing | Which model is sufficient after intent is structured? | Avoid paying premium-model prices for specification failures |
| Evaluation | Did the output satisfy the declared intent fields? | Make quality review less subjective |
| Versioning | Which schema version produced the output? | Support auditability and reproducibility |
This is where the “protocol” language becomes useful, as long as we do not overstate it. The paper does not establish PPS as an engineering protocol in the TCP/IP sense. It supports a weaker and more practical claim: structured intent can behave like a protocol-like layer because it standardizes what is transmitted between user and model.
For business systems, that is enough to matter. A protocol-like intent layer can make AI interactions more testable. It gives product teams something to validate. It gives compliance teams something to inspect. It gives operations teams something to improve without rewriting every prompt by hand.
The object of governance becomes the intent schema, not the employee’s phrasing style. That is a healthier place to be.
What the paper shows, what we infer, and what remains uncertain
| Category | Claim | Status |
|---|---|---|
| Directly shown | Structured prompts outperform simple prompts and raw JSON on the paper’s goal-alignment metric | Supported by the controlled experiment |
| Directly shown | D, E, and F score very similarly under the current 1–5 GA scale | Supported, but possibly ceiling-limited |
| Directly shown | Cross-language variance falls sharply under structured frameworks | Supported by reported variance comparisons |
| Directly shown | Gemini benefits much more from AI-expanded 5W3H than Claude or GPT-4o | Supported; interpretation still partly affected by headroom |
| Directly shown | User satisfaction rises and interaction rounds fall in the user study | Supported directionally, but sequence effects limit causal attribution |
| Cognaptus inference | Structured intent can serve as a reusable internal request protocol | Plausible operational interpretation |
| Cognaptus inference | Better intent schemas may reduce reliance on premium models for some workflows | Plausible for specification-heavy tasks, not all tasks |
| Still uncertain | Which structured framework is truly best under finer evaluation | Not resolved by this paper |
| Still uncertain | How results generalize across broader user populations and task types | Requires larger, counterbalanced studies |
The most useful business lesson is therefore not “adopt PPS,” “adopt CO-STAR,” or “adopt RISEN.” The useful lesson is: decide which intent dimensions your workflow cannot afford to leave implicit.
Different functions need different schemas. A sales email generator needs audience, tone, product context, objection handling, and call-to-action. A financial analysis assistant needs time horizon, data source assumptions, audience, methodology, and risk boundaries. A legal summarizer needs jurisdiction, document type, user role, confidence level, and escalation conditions. A customer support assistant needs policy scope, customer history, tone, allowed remedies, and refusal rules.
The framework label matters less than the discipline of making the missing fields visible.
The uncomfortable part: prompt governance is product design
Many organizations still treat prompts as text assets. They collect them, polish them, and paste them into tools. That approach works for demos. It becomes fragile in operations.
A prompt library answers the question, “What words should we use?”
An intent protocol answers a better question: “What information must be present before the model is allowed to act?”
That difference changes ownership. Prompt libraries can be maintained by enthusiastic users. Intent protocols require product managers, domain experts, compliance reviewers, and workflow owners. Annoying, yes. Also how serious systems are usually built.
This paper is valuable because it gives empirical support to something many AI builders have learned by bruising their foreheads against production workflows: the model is often not the only weak link. The request interface is weak. The user’s intent is underspecified. The evaluation target is fuzzy. The organization then blames the model for being inconsistent when the input channel was basically vibes with punctuation.
Boundaries before adoption
The paper’s limitations are not minor, and they affect how the results should be used.
First, the evaluation scale is coarse. When structured frameworks all score above 4.9 on a 5-point metric, the comparison among them is compressed. A more sensitive evaluation might show that one framework handles certain tasks better than another.
Second, there is no external gold-intent document created independently before the prompt conditions. Richer structured prompts may give the judge more explicit criteria to reward. That does not make the result useless; it does mean the experiment partly measures the value of explicit specification as judged from the available prompt-output pair.
Third, the judge is a single LLM. DeepSeek-V3 is independent of the tested models and the expansion model, but LLM-as-judge evaluation remains imperfect. Multi-judge evaluation or blinded human expert evaluation would strengthen the findings.
Fourth, the user study is small, technically skewed, and fixed-order. Its results are promising for usability, not definitive for causality.
Fifth, the author created PPS and the lateni.com expansion platform used for condition D. The paper discloses this conflict. The open dataset helps, but independent replication would be especially valuable.
These boundaries do not erase the findings. They tell us how to implement them: pilot structured intent schemas in real workflows, compare against current prompt practices, measure task success and correction rounds, and avoid declaring one universal framework winner before the evaluation instrument can actually detect the difference.
Conclusion: the next prompt is probably a form
The paper’s quiet message is that prompt engineering is maturing into interface design.
The early phase of generative AI rewarded clever phrasing. The operational phase rewards reliable intent capture. That means prompts will increasingly look less like magical text and more like structured forms, reusable schemas, validation rules, model-routing inputs, and audit logs.
This sounds less romantic than “talk naturally to the AI.” It is also more likely to work.
For businesses, the strategic question is not whether employees should learn yet another prompt acronym. The question is which parts of intent are too important to leave as inference. Once that is clear, the framework becomes a tool, not a religion.
Structure is not the whole answer. It will not rescue weak reasoning, missing data, or bad governance. But it can reduce ambiguity, stabilize multilingual workflows, and make lower-cost models more useful in specification-heavy tasks.
In other words: the prompt is no longer the unit of strategy. The protocol is.
Cognaptus: Automate the Present, Incubate the Future.
-
Peng Gang, “Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect,” arXiv:2603.29953, 2026. https://arxiv.org/pdf/2603.29953 ↩︎