Opening — Why this matters now
Health misinformation is not a fringe problem anymore. It is algorithmically amplified, emotionally charged, and often wrapped in scientific‑looking language that fools both humans and machines. Most AI fact‑checking systems respond by doing more — more retrieval, more reasoning, more prompts. This paper argues the opposite: do less first, think harder only when needed.
The result is a clean two‑stage architecture that treats consensus as a signal and debate as a last resort — a surprisingly disciplined approach in an era of over‑confident language models.
Background — From retrieval to reasoning overload
Classic medical fact‑checking pipelines relied on encoder models (BERT and friends): retrieve evidence, classify veracity, move on. They were brittle, database‑dependent, and expensive to maintain.
LLM‑era systems fixed retrieval flexibility but introduced a new failure mode: reasoning without filtering. Web agents, RAG pipelines, and multi‑turn verifiers often assume that more text automatically means more truth. It doesn’t. Conflicting evidence simply gets averaged into confident nonsense.
Recent multi‑agent debate frameworks helped by introducing adversarial reasoning — but at a cost: debate everywhere, all the time. That’s intellectually elegant and computationally wasteful.
This paper’s core insight is blunt and useful: most claims don’t need a debate.
Analysis — What the paper actually does
The system is deliberately split into two asymmetric stages.
Stage 1: Agreement Score Prediction
Before any debate, the system asks a simpler question: Do credible sources broadly agree?
For each claim:
-
Key entities are extracted.
-
Multiple targeted queries retrieve articles.
-
Each article is evaluated along three axes:
- Relevance: does it truly address all entities?
- Quality weight: does it resemble a real scientific paper (methods, results, limitations, statistics)?
- Verdict: support or refute.
These signals are combined into a normalized agreement score between −1 and 1.
If the score exceeds a threshold, the system issues a verdict immediately. No theatrics.
Stage 2: Multi‑Agent Debate (only when needed)
If agreement is weak, then the gloves come off.
- A Support Agent and Refute Agent receive curated, high‑quality evidence.
- They argue in structured rounds.
- A Judge Agent monitors coherence, evidence strength, and argumentative progress.
- Debate ends early if clarity emerges; otherwise it is forcibly resolved after a fixed number of rounds.
Crucially, debate is not used to discover evidence — only to reason over already‑filtered material.
That distinction matters.
Findings — Results that actually say something
The numbers tell a consistent story across three health datasets.
| Setting | Key Outcome |
|---|---|
| Stage‑1 only | Competitive with strong web‑agent baselines |
| Stage‑1 + Debate | +3 to +8 F1 improvement on contested claims |
| High‑agreement subset | >90% F1 without debate |
Two observations stand out:
- Most claims are settled cheaply. Roughly half are resolved confidently without debate.
- Debate improves precision, not just recall. This is rare — and important — in medical contexts.
The system outperforms a recent step‑by‑step reasoning baseline while being more selective about when it reasons at all.
Implications — Why this design matters beyond healthcare
This architecture quietly challenges several LLM dogmas:
- Reasoning is not always beneficial. Sometimes it amplifies noise.
- Agents should not always talk. Silence can be a signal of confidence.
- Explainability scales better when conditional. Forced explanations degrade quality.
For regulated industries — healthcare, finance, compliance — this pattern is gold:
Score first. Debate only when disagreement is real.
It reduces cost, limits hallucination surfaces, and produces explanations exactly where auditors expect them.
Conclusion — Less drama, more truth
This paper does not propose a smarter language model. It proposes a smarter workflow.
By treating consensus as a computational primitive and debate as an exception, it moves LLM systems one step closer to how careful humans actually reason — quietly confident when evidence aligns, argumentative only when it doesn’t.
In a space flooded with maximalist agent architectures, this restraint is the real innovation.
Cognaptus: Automate the Present, Incubate the Future.