Review queue.
Someone has to decide whether an image is “unsafe,” “misleading,” “healthy,” “premium,” “clickbait,” “brand-safe,” or “not really our vibe.” The label sounds simple until the first borderline case appears. A salad with too much cream. A gaming ad that hints at easy money but never quite says it. A before-and-after photo where the “achievement” is visible only if one is feeling generous.
This is where many AI workflows quietly become theater. The business team says, “We know it when we see it.” The model, tragically, does not attend that meeting.
The paper Agile Deliberation: Concept Deliberation for Subjective Visual Classification addresses exactly this gap.1 Its central argument is not that vision-language models need one more clever prompt, nor that users simply need to label a few more images. The problem is earlier and more annoying: users often do not yet have a stable definition of the concept they want the model to classify.
That sounds like a human problem. It is. Which is precisely why it becomes a model problem.
The paper introduces Agile Deliberation, a human-in-the-loop framework for subjective visual classification. The key move is to treat the concept definition itself as something that evolves through interaction. The definition is not only a policy note for humans. It is also the prompt that drives the VLM classifier. In other words, the system does not separate “what the team means” from “what the model sees.” It forces them into the same artifact.
That is the useful part.
The real bottleneck is not labeling; it is discovering the boundary
Many human-in-the-loop AI systems assume the user already knows the target concept clearly enough to provide useful supervision. That assumption works when the concept is “dog,” “car,” or “tomato.” It breaks when the concept is subjective, contextual, or policy-shaped.
A content moderation team may agree that certain images should be removed, but disagree on why. A marketplace team may want to promote “high-quality product photos,” but that phrase hides a small war of preferences: lighting, clutter, authenticity, staging, resolution, lifestyle context, and probably one manager’s unresolved trauma from bad catalog photography.
The paper calls the missing process concept deliberation: the iterative work of clarifying what a subjective category includes, excludes, and treats as borderline. The authors ground this design in interviews with five professional content moderation experts and qualitative analysis of twenty expert-authored concept definitions. That expert analysis is not the main performance evidence. It is design evidence: it explains why the system has two stages rather than simply dumping examples into an active learning loop.
The expert pattern is straightforward:
- First, scope the concept into structured positive and negative sub-concepts.
- Then, refine the definition by looking at borderline images that expose where the current wording fails.
The important word is borderline. Ordinary examples help the user confirm what they already believe. Borderline examples force the user to make the belief operational.
Agile Deliberation turns a vague concept into a classifier prompt
Mechanically, Agile Deliberation has two stages.
The first stage is concept scoping. The user starts with a concept name and short description. The system decomposes it into unit concepts and then into candidate positive and negative sub-concepts. For “healthy food,” for example, the positive side may include prepared meals with vegetables, lean proteins, fruit, or healthy beverages. The negative side may include processed food, raw ingredients, or cases where food is not actually the main subject.
The user reviews these sub-concepts, keeps what fits, rejects what does not, and produces an initial structured definition. This definition then becomes the first VLM prompt.
The second stage is concept iteration. The system retrieves semantically borderline images, asks the user to label them, collects optional comments, and refines the definition. Each new definition is evaluated against all labels collected so far, and the system greedily selects the candidate definition that gives the best F1 score on the accumulated user-labeled set.
The paper formalizes this as a loop: build a definition $d$, induce a classifier $f_d$, collect labels $\mathcal{L}_t$, generate candidate refinements $\mathcal{C}_t$, and choose the next definition by maximizing F1 over the labels collected so far. The equation is less interesting than the product implication: the prompt is no longer a static instruction; it is a negotiated boundary object.
That phrase sounds academic because it is useful. The definition must satisfy two parties at once. Humans need it to be readable and consistent. The model needs it to be operational enough to classify images. Agile Deliberation’s main contribution is tying those two needs together instead of pretending they are separate departments.
Why semantic borderline cases beat ordinary active learning here
A natural reaction is: “Isn’t this just active learning?”
Not quite.
Classic active learning often looks for model uncertainty: examples where the classifier probability is near the decision boundary. That works when the classifier’s probability is meaningful. But with prompted generative VLMs, the output is not necessarily a calibrated probability of human ambiguity. A model may sound confident while missing the human issue completely. Very impressive. Very modern.
Agile Deliberation instead targets semantic borderline cases: examples near the natural-language boundary implied by the current concept definition. The system generates borderline queries, retrieves candidate images, removes duplicates, clusters visually similar items, selects useful clusters, and mines coherent ambiguity dimensions so each review batch focuses on one type of conceptual tension.
This matters because good deliberation is not “show me random hard cases.” It is “show me several examples that all pressure the same rule.”
For a business workflow, that distinction is large. Random hard cases exhaust reviewers. Coherent borderline batches teach the policy.
The workflow is the contribution, not any single module
The paper’s prototype contains three main components: decomposition, borderline image retrieval, and concept refinement. It uses Gemini-Pro 2.5 for concept decomposition and Gemini-Flash 2.5 for other tasks such as image classification, borderline query generation, and ambiguity mining. The system does not fine-tune the foundation models; it relies on inference access and prompt-based classification.
That implementation choice is not a small detail. It means the framework is positioned as an operational layer around available VLMs, not as a new foundation model. For enterprise adoption, that is usually where the budget conversation becomes less theatrical.
The pipeline can be summarized like this:
| Mechanism | What it does | Operational consequence |
|---|---|---|
| Concept scoping | Breaks a vague concept into positive and negative sub-concepts | Gives teams a structured first draft of the policy boundary |
| Borderline retrieval | Finds images that stress-test the current definition | Moves review time from random browsing to targeted ambiguity discovery |
| User feedback | Captures labels and comments on ambiguous images | Converts tacit judgment into explicit decision rules |
| Greedy prompt refinement | Generates candidate definitions and keeps the one best aligned with accumulated labels | Turns the evolving concept into a usable VLM prompt |
| Optional user editing | Lets users inspect and revise the updated definition | Keeps the process legible instead of hiding it inside model magic |
The most business-relevant feature is not automation alone. It is legible automation. A system that silently “improves” the classifier is hard to govern. A system that updates a structured definition can be reviewed, debated, versioned, and audited.
That is what policy-heavy AI workflows need. Not because everyone loves governance documents. Nobody does. But because when subjective classification goes wrong, someone eventually asks, “What rule did the system apply?” Shrugging at a prompt is not a strategy.
The main experiment tests alignment under live concept formation
The authors did not evaluate Agile Deliberation using only a fixed benchmark. That would have missed the point, because subjective concepts evolve during the session. Instead, they ran 18 live user sessions, each about 90 minutes long, with nine non-expert participants. Each participant completed two sessions on different concepts: one using Agile Deliberation and one using Manual Deliberation.
The two concepts were:
- Paid to play, a moderation-style concept involving images that promise unrealistic rewards for online entertainment as clickbait.
- Healthy food, a curation-style concept involving subjective judgments about food imagery.
The baselines were well chosen for the paper’s claim:
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Zero-shot VLM classification | Main automated baseline | Shows whether the initial concept description is enough | Does not test deliberation or prompt refinement |
| Modeling Collaborator | Comparison with automated decomposition | Tests whether LLM-generated richer prompts help without user deliberation | Does not capture evolving user intent |
| Manual Deliberation | Human-in-the-loop baseline | Tests whether users can manually search for borderline cases and refine prompts | Does not isolate which Agile module caused improvement |
| F1 across iteration rounds | Process evidence | Shows performance generally trending upward despite fluctuations | Does not prove longer sessions always improve results |
| User survey and interviews | Usability evidence | Tests cognitive effort, stress, and preference | Does not establish enterprise-scale productivity gains |
| Transfer to Qwen VLMs in appendix | Exploratory extension | Suggests definitions may transfer beyond the original Gemini classifier | Does not evaluate the full Agile pipeline on other models |
This distinction matters. The main evidence is the live user study and its performance/user-experience results. The appendix transfer test is useful, but it is not a second full validation of the framework. It applies generated definitions to additional VLMs; it does not re-run the entire human-in-the-loop process with those models.
The numbers say the system narrows the boundary, not merely “improves accuracy”
The paper uses F1, precision, and recall rather than accuracy because each participant’s personalized definition can create imbalanced test labels. That is the right choice. In subjective classification, accuracy can become a decorative number attached to a poorly understood label distribution.
For participants in the Agile Deliberation condition, the system improved F1 over zero-shot for both concepts:
| Concept | Zero-shot F1 | Modeling Collaborator F1 | Agile Deliberation F1 | Agile F1 gain over zero-shot |
|---|---|---|---|---|
| Paid to play | 0.48 | 0.53 | 0.59 | +11% |
| Healthy food | 0.48 | 0.50 | 0.58 | +10% |
Across the paper’s reported averages, Agile Deliberation achieved about 10.5% higher F1 than zero-shot and about 7% higher F1 than Modeling Collaborator in the main results. The abstract reports 7.5% over automated decomposition baselines; the body rounds the average comparison to 7%.
The more revealing pattern is precision and recall. For Agile participants, precision rose sharply while recall fell:
| Concept | Precision shift | Recall shift | Interpretation |
|---|---|---|---|
| Paid to play | 0.35 → 0.57 | 0.80 → 0.63 | The classifier became much less over-inclusive |
| Healthy food | 0.33 → 0.44 | 0.99 → 0.84 | The classifier stopped treating almost everything plausible as positive |
This is not a generic “model got smarter” story. It is a boundary-sharpening story.
Zero-shot VLMs tended to cast the net too widely. Agile Deliberation helped users articulate exclusions, edge cases, and thresholds. In business terms, this is often exactly what matters. A moderation system that catches everything suspicious but floods reviewers with false positives is not aligned. A brand-safety system that says “yes” to every vaguely clean image is not aligned. A product-quality classifier that cannot say “close, but no” is not aligned.
The paper’s result says that deliberation improved the model’s ability to say “no” in a way closer to the user’s emerging concept. That is a practical achievement. It is also less glamorous than benchmark climbing, which is probably why it is more useful.
Manual deliberation helps, but it asks users to invent their own edge cases
The comparison with Manual Deliberation is especially important because it separates the value of “having a human involved” from the value of the structured system.
Manual Deliberation gave participants an image search engine and even access to detailed prompts from the automated Modeling Collaborator baseline. That is a generous baseline. Still, participants struggled to find useful borderline images. The paper reports that manual participants tried an average of 7.3 search queries when looking for borderline cases, often focusing on a narrow ambiguity type.
This is the hidden burden in many “human-in-the-loop” designs. They ask the human to provide judgment, but also to invent the diagnostic cases, search strategy, failure taxonomy, and prompt repair plan. At that point the loop contains a human, yes. It also contains a lot of unpaid product management.
Agile Deliberation reduces that burden by structuring the search. It does not ask users to be professional prompt engineers. It asks them to react to concrete borderline examples and explain why the current definition fails.
The user survey supports this reading. Compared with Manual Deliberation, Agile Deliberation showed significantly lower reported effort to achieve good performance: 3.11 versus 4.67 on the survey item where lower is better. It also showed significantly lower negative emotion such as insecurity, stress, or annoyance: 1.67 versus 3.00. All participants preferred Agile Deliberation.
The non-significant survey items should not be overplayed. Success in articulating concept ideas and ease of creating a comprehensive definition were directionally positive but not the main statistically supported claims. The stronger conclusion is narrower: the structured workflow reduced effort and stress while delivering more consistent classifier gains.
That is already enough.
The appendix transfer test is promising, but it is not a deployment guarantee
The paper includes an additional appendix experiment on transferability. The authors took concept definitions produced using Gemini 2.5 Flash and applied them to two Qwen models: Qwen3-VL-8B and Qwen3-VL-30B-A3B-Instruct.
This test asks a useful question: if Agile Deliberation produces richer definitions, do those definitions remain useful outside the original VLM?
The answer is cautiously positive. Agile definitions still produced gains on the Qwen models, with the stronger pattern appearing on the more capable model. For example, on Qwen3-VL-30B-A3B-Instruct, Agile definitions improved paid-to-play F1 from 0.49 to 0.59 and healthy-food F1 from 0.47 to 0.52 for the Agile participant condition.
But this should be read as an exploratory extension, not as proof that one Agile-generated prompt will travel perfectly across all VLMs. The full pipeline was not re-run across different model families. The paper itself notes that the implementation primarily relies on one generative model in the main pipeline and that broader evaluation across models remains future work.
The business translation is simple: structured definitions are likely more portable than unstructured vibes. But model-specific behavior still matters. Unfortunately, “portable vibes” remains unavailable as an enterprise feature.
What businesses should actually learn from this paper
The obvious summary is that Agile Deliberation improves subjective visual classification. True, but too small.
The better business lesson is that subjective AI deployment needs a concept-authoring workflow before it needs another model comparison spreadsheet.
In many organizations, the workflow currently looks like this:
- A team writes a vague category name.
- A model or vendor returns classifications.
- Reviewers complain about false positives and false negatives.
- Someone edits the prompt.
- Nobody can remember why the boundary moved.
- Repeat until morale improves.
Agile Deliberation suggests a more disciplined alternative:
| Business problem | Agile Deliberation response | Practical value |
|---|---|---|
| Policy terms are vague | Turn them into structured positive and negative sub-concepts | Makes subjective policy inspectable |
| Reviewers disagree on edge cases | Surface coherent borderline batches | Forces useful disagreement early |
| Prompts drift after manual edits | Refine definitions against accumulated labels | Preserves consistency across rounds |
| Experts waste time searching for examples | Automate borderline retrieval and ambiguity mining | Reduces cognitive search cost |
| Model behavior is hard to audit | Keep the evolving definition human-readable | Supports governance and handoff |
This is especially relevant for content moderation, ad review, brand safety, marketplace curation, visual compliance, trust-and-safety queues, and internal document/image triage where the category is partly normative. The framework is less relevant when the label is objective, abundant, and already well-defined. If the task is “detect whether a cat is present,” concept deliberation is probably overkill. The cat, unlike the policy team, has boundaries.
The ROI pathway is not only fewer labels. It is cheaper boundary discovery, faster reviewer alignment, and fewer downstream disputes over what the classifier was supposed to do. That is harder to measure than annotation cost, but in policy-heavy workflows it may be the bigger cost center.
Where the evidence stops
The paper is careful about its limits, and readers should be too.
The study involved nine participants and eighteen sessions. That is substantial interaction time for an HCI-style study, but not a large enterprise deployment. The participants were non-experts recruited within the authors’ organization, not professional moderators operating under production pressure. The study used two concepts, which are plausible and useful but not enough to cover the full variety of subjective visual policy work.
The evaluation also does not isolate component-level effects. We do not know from this paper how much of the gain comes from concept scoping, borderline retrieval, ambiguity clustering, prompt refinement, or the interface design. The authors intentionally evaluate the workflow as an integrated system, which is reasonable because those pieces are tightly coupled in practice. Still, a buyer or product team should not read the paper as proof that every individual module is independently necessary.
There is also a model-dependence boundary. The main implementation uses Gemini models, with an appendix transfer test on Qwen models. That makes the framework plausible as a model-agnostic layer, but not proven across the broader ecosystem of VLMs, image retrieval systems, enterprise data distributions, and latency/cost constraints.
Finally, the paper evaluates classifier alignment after deliberation sessions. It does not measure long-term organizational effects: reviewer consistency over weeks, policy drift, appeal outcomes, audit quality, legal risk, or operational throughput. Those are exactly the metrics a business pilot would need.
The product lesson: build the policy workbench, not just the classifier
The deepest contribution of Agile Deliberation is that it changes where the intelligence is supposed to live.
A weaker system treats the human as a label source and the model as the classifier. A better system treats the human as the evolving source of judgment and the model as a tool for making that judgment explicit, testable, and executable.
That is a different product category. It is not merely “AI image classification.” It is a concept workbench for subjective visual policy.
For Cognaptus readers, the takeaway is not to copy this exact architecture tomorrow. The takeaway is to stop treating subjective classification as if the hard part begins after the policy is written. In many real workflows, the policy is not written. It is discovered through examples, arguments, exceptions, and edge cases. Agile Deliberation gives that discovery process a computational shape.
And that is why the paper matters. It does not promise that a VLM can magically understand what your team means by “high quality,” “unsafe,” “misleading,” or “healthy.” It offers something more useful: a way to help the team find out what it means, and then make the model follow that definition.
Small miracle. No vibes API required.
Cognaptus: Automate the Present, Incubate the Future.
-
Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan, Enming Luo, Chun-Ta Lu, Tushar Dogra, Ranjay Krishna, and Ariel Fuxman, “Agile Deliberation: Concept Deliberation for Subjective Visual Classification,” arXiv:2512.10821. https://arxiv.org/abs/2512.10821 ↩︎