Humans in the Loop, Not Just the Dataset

TL;DR for operators

AI-assisted monitoring does not become trustworthy because a human occasionally clicks “wrong label.” It becomes useful when the whole product is designed to capture, validate, resolve, and redeploy human judgement.

The paper behind this article studies an open-source Telegram monitoring tool being developed with civil society organisations, using conspiracy-theory classification as the working scenario.¹ Its practical contribution is a workflow: Telegram posts are classified, CSO users review labels during their normal monitoring work, their feedback is stored with metadata, and that accumulated feedback becomes a gold-standard dataset for model evaluation and refinement.

The paper compares two routes. A fine-tuned domain model offers consistency, reproducibility, and control, but needs retraining discipline and technical capacity. A prompted LLM offers flexibility, longer-context handling, and easier task switching, but introduces output variability, prompt governance, and external-service privacy concerns. Neither is crowned winner. Sensible, if disappointingly unfashionable.

For operators, the point is simple: the model choice is secondary to the feedback system. Without representative feedback, disagreement handling, validation splits, rollout procedures, and privacy-aware infrastructure, “human-in-the-loop” is just a slogan wearing a UX badge.

The real product is not the classifier

A familiar AI product fantasy goes like this: take a difficult social problem, add a model, show a confidence score, then invite humans to “review” whatever the system emits. The user becomes a moral decoration on top of automation. Very tasteful. Usually useless.

The paper takes a more practical view. Civil society organisations monitoring harmful content on Telegram are not passive recipients of machine output. They have domain expertise, contextual knowledge, and operational constraints. The proposed tool therefore places feedback inside the monitoring workflow itself: users search Telegram channels, see posts in a feed, receive a binary conspiracy-theory classification with a confidence score, and can correct the label directly where they work.

That sounds modest. It is actually the architectural centre of the paper.

The feedback is not treated as anecdotal irritation. It is stored with the original classification and metadata, gradually forming a gold-standard dataset. That dataset then supports evaluation and refinement. In the fine-tuning path, it can inform future retraining. In the prompting path, it can guide prompt revision or few-shot example selection. Either way, the human judgement becomes system memory rather than vanishing into a support ticket.

This is where the paper’s title matters. Civil society is “in the loop” not merely because someone can override the model. They are in the loop because their interactions change the evidence base on which the next version of the system is judged.

Fine-tuning gives control; prompting gives manoeuvrability

The paper frames the implementation choice as a comparison between two strategies: fine-tuning a domain-specific model and prompting a large language model. The accepted temptation is to ask which one is better. The useful question is duller and more expensive: better under which operating constraints?

Fine-tuning starts from a BERT-based classifier trained on Telegram data for conspiracy-theory detection. Its appeal is control. A fine-tuned classifier can produce more consistent outputs, is easier to reproduce, and can be deployed in a way that gives the organisation stronger control over model behaviour. For monitoring work, that matters. If the same post receives different labels depending on prompt phrasing, API version, or model mood, the tool becomes a source of operational fog.

Prompting has a different strength. It is flexible. A prompted LLM can adapt more easily to task variations, longer inputs, raw or noisy posts, and adjacent tasks such as summarisation or question answering. In a domain like Telegram monitoring, where discourse mutates quickly and posts are messy, multilingual, ironic, or semi-coded, that flexibility is not decorative. It may be the difference between a tool that keeps pace and one that becomes historically interesting.

The trade-off is not symmetrical. Fine-tuning asks for technical capacity up front: retraining schedules, data management, validation discipline, and deployment infrastructure. Prompting lowers some of that engineering burden but creates its own governance problems: prompt versioning, output parsing, example selection, evaluation, and privacy exposure if external APIs are used.

The paper’s comparison can be translated into an operator-facing decision table:

Decision pressure	Fine-tuned model	Prompted LLM	Practical reading
Need stable, reproducible labels	Stronger fit	Weaker fit	Use fine-tuning when consistency is a governance requirement, not just a preference.
Need rapid task switching	Weaker fit	Stronger fit	Use prompting when the monitoring task is still evolving.
Need local control over data	Stronger fit if locally deployed	Riskier if API-based	External LLM services may be operationally convenient and legally awkward. Lovely combination.
Need low technical maintenance	Weaker fit	Stronger fit at first	Prompting is easier to start, but not free to govern.
Need adaptation from user feedback	Stronger path through retraining	Possible through prompt revision and few-shot updates	Both need validation; neither improves just because users complain.
Need robust handling of noisy inputs	More limited	Stronger	Telegram is not a clean benchmark dataset with better manners.

This table should not be read as a scoreboard. It is a procurement map. The right architecture depends on whether the organisation values reproducibility, adaptability, cost control, local deployment, or fast experimentation most.

Feedback is biased before it becomes useful

The paper’s most important operational warning concerns feedback quality. Users are more likely to correct a model when it is wrong than to confirm when it is right. This negativity bias is normal human behaviour. Nobody writes a thank-you note to a classifier for correctly labelling a boring post.

For model improvement, though, silence is ambiguous. Did the user agree with the label? Did they skip the post? Did they not understand the interface? Did they scroll past while half-asleep, which is perhaps the most common state of modern knowledge work?

The paper considers two routes. One is explicit representative feedback: ask users to rate a full sample of posts, including examples where the model may be correct. This gives cleaner evaluation data but creates extra work and pulls users away from their normal monitoring routine. The other is implicit affirmation: treat posts that were seen, clicked, or left unmarked as weak signals of agreement, perhaps weighting actions differently.

That second route is seductive and dangerous. It makes the feedback loop less burdensome, but it requires tracking which users saw which posts. That creates technical complexity and privacy concerns. It also risks turning ordinary interface behaviour into labelled evidence without enough epistemic humility. A click is not a belief. A scroll is not a vote. A non-action is not necessarily agreement, no matter how much the dashboard would like it to be.

For product teams, this is the core design problem. Human feedback is not raw truth. It is behavioural data produced by an interface, under time pressure, inside an organisation with its own incentives. Treat it as a signal, not scripture.

Multi-user disagreement is a governance feature, not a data-cleaning nuisance

The paper also flags a second issue: multiple users may disagree about the same post. In harmful-content monitoring, that is not an edge case. It is the job.

One user may see coded extremist language where another sees ordinary political complaint. One organisation may classify conspiracy-theory content differently from another because its mandate, country context, or risk threshold differs. The easiest technical response is to collapse disagreement into a single label. The better product response is to preserve enough disagreement to make the system governable.

The paper suggests that explicit feedback should take precedence over implicit signals, and that user conflicts may need a dedicated resolution interface. That is more than a UX detail. It is a recognition that classification is partly institutional judgement.

This matters for any enterprise adopting human-in-the-loop AI, not only civil society monitoring. In compliance, credit review, fraud detection, HR screening, medical triage, or public-sector casework, disagreement among experts is not simply noise. It reveals where policy, risk appetite, and domain interpretation diverge.

A serious feedback system therefore needs three things:

Requirement	Why it matters	What happens if ignored
Feedback provenance	The system must know who labelled what, when, and under which context.	Labels become anonymous residue, impossible to audit.
Conflict resolution	Disagreement must be routed, reviewed, or preserved deliberately.	The dataset quietly encodes whoever clicked last.
Governance rules	Teams need rules for when feedback updates prompts, retraining sets, or deployment versions.	“Continuous improvement” becomes continuous drift with better branding.

This is where the business relevance becomes broader than content moderation. The paper is really about turning expert judgement into operational infrastructure.

The figure is a workflow, not an experiment

The paper includes a workflow diagram showing how feedback moves through both the fine-tuned and prompt-based routes. It is useful, but it should be interpreted correctly.

It is not an experimental result. It does not show that one method outperforms the other. It does not report accuracy, precision, recall, user satisfaction, cost reduction, or moderation outcomes. Its purpose is implementation design: classify all data, collect user feedback, evaluate with accumulated feedback, detect performance drops or domain shifts, then refine either by retraining and fine-tuning or by prompt engineering.

That distinction matters because AI papers often invite accidental over-reading. A workflow diagram can look like a validated system if one is in a generous mood. Here, the authors are explicit that the work is in progress and that both approaches are being prototyped.

The table comparing fine-tuning and prompting serves a similar role. It is not a benchmark table. It is a design comparison across flexibility, control, resources, and adaptation. It supports a decision framework, not a winner’s podium.

Paper element	Likely purpose	What it supports	What it does not prove
Feedback workflow figure	Implementation detail and system design	How user feedback can feed either retraining or prompt refinement	That the system improves performance in deployment
FT vs P comparison table	Design trade-off comparison	Why fine-tuning and prompting suit different constraints	That either strategy is empirically superior
Discussion of negativity bias	Operational risk analysis	Why feedback capture must be designed carefully	The magnitude of the bias in this tool
Future hybrid strategies	Exploratory extension	Possible direction: pre-label easy cases, route uncertain cases to humans	That hybrid routing has been tested successfully

This disciplined reading protects the paper from both underestimation and hype. It is not “just a concept note,” because workflow design is where many AI systems fail. But it is also not evidence that CSO monitoring has been solved. Nobody gets to declare victory on a four-page work-in-progress paper. Even AI optimism has paperwork.

Prompting is flexible, but prompt governance is still governance

Prompt-based systems are often sold as lighter-weight alternatives to fine-tuning. In some ways, they are. A team can modify instructions, add few-shot examples, change task framing, and extend the system to adjacent tasks without retraining a model.

But the paper correctly notes that prompt effectiveness still requires structured feedback, technical guidance, and robust evaluation. Few-shot prompting depends heavily on the number and selection of examples. A prompt that works on one validation sample may fail when discourse shifts, Telegram channels change tone, or users introduce new categories of concern.

This is the part many business teams miss. Prompting moves work from model training into prompt lifecycle management. It changes the governance object; it does not remove governance.

A prompted monitoring system needs version control for prompts, test sets for prompt revisions, failure logs, parsing rules, and deployment procedures. It also needs rules for when a prompt can be changed immediately and when a change requires validation. The paper notes that prompt updates can be applied quickly, but still need monitoring. In business language: fast deployment is not the same as safe deployment.

Fine-tuning has a more visible engineering cost. Prompting hides some of its cost in operations. Pick your poison; at least label the bottle.

Privacy and cost decide more than model elegance

The paper is especially grounded when it discusses infrastructure. CSOs often operate with limited resources. They also handle politically sensitive data. That combination constrains model choice more sharply than a leaderboard would.

External LLM APIs may be attractive because they avoid local GPU costs and lower implementation barriers. But sending Telegram monitoring data to third-party services raises privacy and compliance concerns. Local deployment gives more control but may require hardware that smaller organisations do not have, especially for larger LLMs.

The authors therefore suggest prioritising smaller or distilled models to balance performance, cost, and accessibility. That recommendation is less glamorous than “use the most capable model available,” which is exactly why it is useful.

For business teams, the equivalent lesson is straightforward. The best AI architecture is not the one with the most impressive demo. It is the one that can survive budget, compliance, maintenance, audit, and user adoption. A brittle high-performing model is still brittle. A flexible model that cannot be governed is still a liability. A private model nobody can afford to run is a theoretical virtue.

The paper’s CSO context makes this unusually visible, but the same logic applies across enterprises. AI product strategy is mostly constraint management with nicer icons.

Hybrid routing is the likely destination

The paper’s future-work section points toward hybrid strategies: high-confidence cases can be pre-labelled automatically, while uncertain cases are routed to users for validation. This is probably where many real monitoring products will land.

A hybrid system uses automation where the model is confident and preserves human attention for ambiguous cases. Done well, this can reduce workload while improving the quality of feedback. Done poorly, it can create blind spots by hiding false confidence behind automation.

The design challenge is to decide what counts as “uncertain.” Confidence scores from classifiers are not always calibrated. Prompted LLMs may produce fluent explanations for shaky judgements. Domain shift can make yesterday’s high-confidence pattern unreliable today. If hybrid routing is adopted, it needs calibration checks, sampling of supposedly easy cases, and periodic review of automated decisions.

There is also a collaboration question. The authors mention the possibility of sharing annotated data or federating models across CSOs. That could improve scalability, but it introduces governance problems. Different organisations may label content differently because their missions differ. Pooling labels without reconciling those differences can produce a dataset that looks larger while becoming conceptually muddier.

Shared infrastructure is attractive. Shared definitions are harder. Naturally, the hard part is the one that matters.

What Cognaptus infers for business practice

The paper directly shows a proposed workflow and a comparison of two model-integration strategies for CSO-operated Telegram monitoring. It does not show production performance, user adoption rates, or cost-benefit outcomes. Cognaptus’ business reading is therefore an inference from the design logic, not a claim of demonstrated ROI.

That inference is still useful.

First, feedback should be treated as a product capability, not a support feature. If user corrections are valuable, the system must capture them with provenance, context, and enough structure to support evaluation. A comment box is not a feedback architecture.

Second, model strategy should follow governance needs. If reproducibility and control dominate, fine-tuning or smaller local models may be preferable. If task flexibility and fast adaptation dominate, prompting may be appropriate, provided prompt changes are evaluated rather than improvised.

Third, validation must sit between feedback and deployment. The paper’s workflow includes evaluation before updated models or prompts are deployed. That step is essential. Without it, feedback loops can amplify bias, overfit to vocal users, or chase recent anomalies.

Fourth, disagreement should be designed for. Expert conflict is not a defect in the dataset. It is information about policy boundaries, organisational priorities, and ambiguous cases. Mature systems preserve and resolve that information deliberately.

Fifth, infrastructure constraints are strategic constraints. Privacy, hardware access, API dependence, and maintenance capacity are not implementation footnotes. They determine what kind of AI system can responsibly exist.

The boundary: this is a design paper, not proof of superiority

The main limitation is simple: the paper is a work in progress. It proposes and compares approaches; it does not report a deployed evaluation showing that feedback improves classification accuracy, reduces analyst workload, or increases trust among CSO users.

There are no ablations. No benchmark table of model performance. No longitudinal field study. No quantitative measurement of negativity bias. No cost model. No user-study evidence showing which feedback interface works best.

That does not make the paper weak. It makes it a design-stage contribution. Its value lies in naming the operational pieces that must exist before performance claims would mean much: representative feedback, implicit-versus-explicit signal handling, conflict resolution, validation sets, retraining schedules, prompt governance, staged rollout, privacy-aware infrastructure, and long-term adaptation.

The wrong reading is “fine-tuning versus prompting: who wins?” The better reading is “what organisational machinery must surround either choice before it becomes useful?”

That machinery is where most AI deployment conversations become less glamorous and more honest. Which is, regrettably, where the work begins.

Conclusion: put the loop where the work happens

“Human-in-the-loop” is often used as a comfort phrase. It suggests that if a human appears somewhere near the model, responsibility has been restored. The paper pushes toward a stricter version: put the loop inside the work, capture the judgement, validate the update, govern the change, and keep the system affordable enough for the people who actually need it.

For civil society monitoring Telegram, that means CSO expertise should shape not only final decisions but also the evolving evidence base behind the classifier or prompt. For businesses, the lesson is broader. A useful AI system is not just a model attached to a workflow. It is a workflow that knows how to learn without forgetting who gets to decide what learning means.

Humans in the loop, then, should not be decorative. They should be structural.

Cognaptus: Automate the Present, Incubate the Future.

Milena Pustet, Elisabeth Steffen, Helena Mihaljević, Grischa Stanjek, and Yannies Illies, “Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool,” arXiv:2507.06734, 2025. https://arxiv.org/abs/2507.06734 ↩︎

TL;DR for operators#

The real product is not the classifier#

Fine-tuning gives control; prompting gives manoeuvrability#

Feedback is biased before it becomes useful#

Multi-user disagreement is a governance feature, not a data-cleaning nuisance#

The figure is a workflow, not an experiment#

Prompting is flexible, but prompt governance is still governance#

Privacy and cost decide more than model elegance#

Hybrid routing is the likely destination#

What Cognaptus infers for business practice#

The boundary: this is a design paper, not proof of superiority#

Conclusion: put the loop where the work happens#