DISARM, but Make It Agentic: When Frameworks Start Doing the Work

Taxonomies do not investigate campaigns by themselves

A framework is a very respectable filing cabinet.

DISARM, the Disinformation Analysis and Risk Management framework, gives analysts a standardized vocabulary for describing foreign information manipulation and interference, or FIMI. It organizes influence operations into tactics, techniques, and procedures. That is useful. It gives researchers, governments, platform teams, and security practitioners a shared language instead of a pile of screenshots, vibes, and mutually incompatible spreadsheets.

But a shared language is not yet an investigation.

The paper An Agentic Operationalization of DISARM for FIMI Investigation on Social Media takes that gap seriously.¹ Its real contribution is not “AI detects disinformation,” which would be a suspiciously convenient headline and therefore should be handled with tongs. The contribution is more operational: the authors turn DISARM from a static taxonomy into an executable investigative workflow.

That difference matters. A static framework helps analysts label what they already understand. An agentic workflow helps them search, hypothesize, test, verify, and document what they may not yet have noticed.

The system in the paper does not replace expert FIMI investigators. It does something more credible: it compresses the early and middle stages of investigation, maps machine-generated findings back into a recognized taxonomy, and forces those findings through statistical and human verification. In other words, it tries to make the analyst faster without pretending the analyst has become unnecessary. A rare moment of restraint. We should enjoy it.

The paper’s central move is making DISARM executable

The paper begins from a practical bottleneck. FIMI campaigns are coordinated, cross-platform, multilingual, and increasingly AI-assisted. DISARM gives investigators a common schema for classifying campaign behavior, but mapping large social media datasets into that schema remains labor-intensive.

The authors’ answer is an agentic pipeline. Importantly, they clarify that “agentic” here does not mean a classical multi-agent society with many independent artificial workers arguing in a conference room. It is a workflow centered on a single LLM-driven agent that runs through multiple investigation iterations, keeps history, selects techniques, generates hypotheses, executes queries and scripts, and stores findings.

The workflow has two broad phases:

Investigation planning and execution, where the system explores data, selects DISARM techniques, formulates hypotheses, gathers evidence, and ranks signals.
Evidence assessment, where complex findings are decomposed into atomic claims, converted into detection conditions, statistically checked against labels or annotations, and finally reviewed by human domain practitioners.

That is the mechanism. Not magic classification. Not a “truth detector.” Not a moderation button with a military haircut.

The interesting part is how the system moves from taxonomy to testable evidence. DISARM supplies the behavioral vocabulary. The agent supplies iterative operationalization. SQLite stores the evidence trail. Statistical verification disciplines the LLM’s claims. Human evaluation catches what still requires judgment.

A simplified version looks like this:

Raw social media data
        ↓
Neutral exploratory data analysis
        ↓
DISARM technique selection
        ↓
Hypothesis and test planning
        ↓
SQL / Python evidence gathering
        ↓
Signal scoring and structured logging
        ↓
Atomic evidence extraction
        ↓
Machine verification against labels
        ↓
Human practitioner review

The “framework starts doing the work” only because the system surrounds the framework with procedural machinery. DISARM is still the map. The pipeline is the vehicle.

Deferred anomaly detection is a quiet but important design choice

One of the better design decisions in the paper is easy to miss: the first iteration performs exploratory data analysis without immediately searching for anomalies or assigning interpretive meaning.

That sounds modest. It is not.

In FIMI analysis, premature interpretation is dangerous. If the system begins by looking for manipulative behavior, it may force normal platform patterns into suspicious categories. High volume becomes “flooding.” Repeated slogans become “coordination.” Similar comments become “automation.” Sometimes those are real signals. Sometimes the internet is just being the internet, an unfortunate baseline condition.

The authors instead separate initial understanding from later attribution. The first stage establishes dataset scale, account counts, temporal range, and platform-specific structure. Only after that does the agent begin selecting DISARM techniques and generating hypotheses.

This is not merely a cleanliness preference. It changes the epistemic order of the workflow:

Stage	What the system does	Why it matters
Initial EDA	Describes the dataset without assigning suspicious meaning	Reduces premature framework-driven interpretation
Technique selection	Chooses relevant DISARM techniques using prior findings and data characteristics	Keeps investigation tied to a shared taxonomy
Planning	Converts a technique into hypotheses, methods, SQL queries, and success criteria	Makes taxonomy operational rather than decorative
Evidence gathering	Runs SQL and Python analyses, then scores signal strength	Produces measurable findings rather than free-text impressions
Verification	Converts claims into detection predicates and checks them statistically	Prevents LLM-generated evidence from floating away elegantly
Human review	Practitioners evaluate the verified claims in context	Preserves expert judgment where intent and context matter

This sequence is the paper’s real architecture. The authors are not saying, “Let an LLM decide whether something is disinformation.” They are saying, “Let an LLM help structure and execute a hypothesis-driven investigation, then verify the pieces.”

That is a much more useful sentence, even if it is less likely to be printed on a conference banner.

The agent does not merely label TTPs; it chooses what to investigate next

After the initial exploratory pass, the system uses an LLM-guided explore–exploit strategy to select DISARM techniques. The paper describes this as inspired by multi-armed bandit logic, though not a formal statistical bandit algorithm.

The distinction matters. The agent is not optimizing a mathematically specified reward function. It is using prior findings to decide whether to cover untested techniques or dig deeper into promising ones.

The selection criteria include:

whether a technique may produce new insight;
whether it matches observed dataset characteristics;
whether it can be operationalized through SQL queries;
whether a more specific sub-technique is available.

That last point is practical. A broad category may sound impressive but be useless for measurement. A narrower technique can become a query, a metric, a threshold, and eventually an evidence claim. In operational work, specificity is not a luxury. It is the difference between “something is weird here” and “these accounts were created in a suspicious temporal cluster.”

Once a DISARM technique is selected, the system generates an investigation plan. The plan may include a hypothesis, analysis method, SQL query, and success criteria. For example, a technique related to flooding the information space can become a test for abnormal temporal concentration, hourly outliers, or burst metrics.

This is where the paper’s mechanism-first reading pays off. The value is not in the agent knowing the word “flooding.” The value is in converting “flooding” into something observable.

A framework becomes useful when it can produce a test.

Atomic evidence is the paper’s main anti-handwaving device

LLM-generated analysis has a familiar failure mode: it produces a finding that sounds coherent but contains several embedded claims. Some may be true. Some may be weak. Some may be untestable. The paragraph smiles politely and hopes nobody asks for receipts.

The authors address this through atomic evidence decomposition.

Instead of treating a complex finding as a single result, the pipeline breaks it into discrete claims that can be independently verified. Each atomic evidence unit is then translated into an executable detection condition, applied to the population, and compared against available labels or annotations.

The verification process includes:

parsing the natural-language evidence into a SQL predicate;
identifying accounts matching that condition;
building a confusion matrix against ground truth or annotations;
applying effect-size and significance tests.

A claim passes only if it meets three criteria: the threshold is explicit, the odds ratio reaches the required effect-size threshold, and Fisher’s exact test reaches the significance threshold.

The paper uses an odds ratio threshold of 3.0 and a statistical significance threshold of $p < 0.05$. The exact formula is not the point. The governance point is that evidence claims must become testable objects.

This is a useful pattern beyond FIMI. Any organization using LLMs for investigation—fraud analysis, compliance review, insider-risk triage, brand-safety monitoring, supplier-risk scanning—faces the same problem. The model can produce plausible synthesis. The organization needs auditable claims.

Atomic evidence is one way to make the model stop narrating and start submitting workpapers.

The evaluation tests workflow quality, not autonomous truth

The paper evaluates the pipeline on two practitioner-annotated datasets.

The first is a Chinese information operation dataset from X, focused on coordinated attacks on Guo Wengui. The authors merge two China datasets, restrict the data to Mandarin-language campaign-related posts, clean and tokenize the text, and retain posts with at least 10 valid tokens. The resulting dataset contains 617 information-operation accounts and 157 organic accounts, with 40,863 cleaned Chinese-language posts.

The second is a Russian Telegram campaign dataset related to Moldova’s 2025 election. It covers 88 political and news channels with open comments. The final dataset contains 595,814 comments from 30,297 unique accounts, including 584 verified bot accounts that produced 86,448 comments, or 14.5% of all comments.

These datasets matter because the system is not being tested in a vacuum. It is being tested against practitioner-annotated or manually verified evidence. That allows the authors to ask whether the agent’s proposed evidence corresponds to known campaign signals.

But the reader should interpret the evaluation carefully. The test is not: “Can the system fully detect FIMI?” The test is closer to: “Can the system generate DISARM-aligned, statistically verifiable evidence that usefully supports investigation?”

The paper reports two levels of evidence quality:

Evaluation level	Likely purpose	What it supports	What it does not prove
Atomic evidence pass rate	Main evidence of claim-level reliability	Some generated claims survive statistical verification	The system’s full investigation output is reliable by default
Technique pass rate	Main evidence of technique-level usefulness	A technique can be useful if at least one generated claim passes	Every claim under that technique is valid
Human evaluation	Qualitative validation and boundary detection	Practitioners find useful signals and missing dimensions	Fully automated classification is ready
Cost and runtime report	Implementation detail and feasibility signal	The workflow may be inexpensive to run in this setting	Costs generalize across models, platforms, or data regimes

This distinction prevents a common mistake: overreading pass rates as if they were end-to-end detection accuracy. They are not. They are evidence-validation rates inside an investigative pipeline.

That is still valuable. It is just valuable in the way a good analyst’s notebook is valuable, not in the way a production classifier’s ROC curve is valuable.

The numbers show useful selectivity, not blanket reliability

Across both datasets, the system produced 84 atomic evidence claims. Of those, 24 passed verification and 60 failed, for a combined atomic evidence pass rate of 28.6%.

At first glance, that may sound low. It is actually more interesting than low.

If the system were marketed as an autonomous detector, a 28.6% claim-level pass rate would be uncomfortable. If the system is understood as an investigative assistant generating candidate evidence for verification, that rate looks different. It means most generated claims should not be trusted without verification, but a meaningful minority survive. The paper’s own design assumes exactly that.

The dataset-level split is also informative:

Dataset	Atomic evidence claims	Pass rate	PASS	FAIL
Russia / Telegram	42	38.1%	16	26
China / X	42	19.0%	8	34
Combined	84	28.6%	24	60

The Telegram dataset produced a higher atomic evidence pass rate than the X dataset. The paper does not turn this into a sweeping platform theory, and neither should we. A safer interpretation is that the type of bot-like behavioral patterns in the Telegram case may have been more easily captured by the pipeline’s available metrics than the coordination patterns in the X case.

Technique-level pass rates look stronger:

Dataset	Technique iterations	Pass rate	PASS	FAIL
Russia / Telegram	14	64.3%	9	5
China / X	14	35.7%	5	9
Combined	28	50.0%	14	14

This is where interpretation matters. A technique iteration passes if any evidence in that iteration passes. That means the technique-level metric asks whether a DISARM-guided investigative direction yielded at least one validated signal. It does not mean that the whole bundle of generated evidence was correct.

For business and operational use, that distinction is not pedantic. It tells us how to deploy such a system. Do not use it as a final classifier. Use it as a structured hypothesis generator that can surface promising investigative directions, each of which must still be validated.

The machine is useful. The machine is not absolved.

The China/X case shows the pipeline can rediscover known campaign structure

In the China/X dataset, human evaluation found that the agent surfaced a suspicious burst in account creation dates. Practitioners corroborated this as related to the creation of information-operation accounts. In particular, most accounts created in August 2018 were IO accounts, with reported precision of 0.97.

That is a strong signal. It also illustrates what the pipeline does well: it can find a measurable behavioral cluster that corresponds to campaign infrastructure.

Additional signals such as empty profile descriptions, campaign-related keywords, and interaction patterns were less important as primary detectors because the account-creation burst was already highly precise. Still, these secondary features helped characterize subgroups of IO accounts.

The missing signal is equally important. The paper notes that the current pipeline did not capture temporal synchronization, such as users repeatedly retweeting the same posts within short time windows across multiple days. Human analysts observed this coordination in retweets, replies, and especially mentions, but the agent-generated evidence did not surface it.

That absence is not a small technical footnote. Temporal synchronization is often central to coordinated behavior. A system that detects static or aggregate behavioral features but misses synchronized action can still be useful, but it will under-detect certain campaign mechanics.

So the China/X case gives a balanced result:

What the pipeline captured	What it missed	Interpretation
Account-creation burst with high precision	Repeated short-window temporal synchronization	Good at surfacing some structural behavioral signals, weaker at dynamic coordination patterns
Secondary profile and content features	Rich temporal interaction sequences	Useful for characterization, not sufficient for full campaign reconstruction

This is exactly why mechanism-first reading is better than headline reading. “Agentic DISARM works” is too vague. “Agentic DISARM can surface certain measurable behavioral clusters but needs better temporal modeling for synchronization” is less glamorous and much more useful.

The Telegram case shows complementary discovery, not replacement of investigators

The Russian Telegram dataset provides the paper’s most operationally striking result: the pipeline surfaced more than 30 additional bot accounts that had not been detected in the initial human-led investigation.

That is the kind of result organizations should pay attention to, but not overinterpret.

The system also identified auxiliary metrics that aligned with patterns human analysts had seen before, including duplicated comments, shorter average comment length, and lower mean reactions for bot-generated comments. These signals helped corroborate suspicious borderline accounts.

However, the paper is careful about their limits. Duplicated comments can be meaningful, but only with context. Short comments, emojis, and platform-native repetition can appear organically. Message length may help, but it is not a standalone bot detector. Reaction counts can provide signal, but are shaped by channel dynamics and audience behavior.

The human-led investigation also used indicators the algorithm did not surface, including consecutive days of sustained activity and repeated dissemination of identical links to AI-generated video content. These are not minor omissions. They show that expert investigators often notice campaign-specific operational patterns that are hard to capture with generic behavioral metrics.

The Telegram result therefore supports a complementary model:

Agentic pipeline contribution	Human analyst contribution
Rapidly generates candidate patterns	Judges campaign context and intent
Maps findings to DISARM techniques	Distinguishes bots, trolls, and organic high-activity users
Produces auditable, testable claims	Notices evolving infrastructure and narrative tactics
Surfaces additional suspicious accounts	Weighs heterogeneous evidence into final classification

The system is a force multiplier. It is not a court.

That is still a meaningful achievement. In high-tempo investigations, narrowing the candidate pool and producing standardized evidence can save analyst time. The fact that it found more than 30 previously undetected accounts suggests practical search value, especially in gray-zone cases where manual review has already found the obvious targets.

The cost result is promising, but should not be treated as universal economics

The authors report that the workflow ran on Claude Code with Claude Opus 4.5, deliberately split across multiple runs to prevent shortcutting of iterative steps. The process performed 15 iterations, including an initial exploratory iteration, with approximately 35 minutes of API runtime and total cost of $11.40.

This is a useful feasibility signal. It suggests that a structured agentic investigation can be run cheaply enough to support repeated analysis rather than one-off research theater.

But the cost number should be interpreted as implementation-specific. It depends on model pricing, context size, dataset preparation, toolchain design, number of iterations, and how much human review is required afterward. The API bill is not the total cost of an investigation. Anyone pretending otherwise has either never paid analysts or has discovered a new accounting standard called “ignore salaries.”

Still, the operational point remains strong. If an agentic workflow can produce DISARM-aligned candidate evidence in under an hour for modest direct compute cost, it becomes plausible as an analyst-side tool for triage, exploration, and repeated monitoring.

The business question is not “Can AI replace the FIMI team?” It is:

How much analyst time can be shifted from first-pass pattern hunting into expert judgment, escalation, and response design?

That is the ROI pathway.

The business value is workflow compression with audit trails

For firms, platforms, governments, and research teams, the paper’s practical relevance is broader than military FIMI investigation.

The core pattern is reusable: take a domain taxonomy, let an agent operationalize it into investigative hypotheses, force claims into atomic evidence, validate them where labels exist, and route outputs to human experts.

That pattern could matter in several business settings:

Use case	How the paper’s mechanism transfers	What remains uncertain
Platform trust and safety	DISARM-like behavioral taxonomies can guide investigation of coordinated manipulation	Access to reliable labels and platform metadata varies widely
Brand-risk monitoring	Agentic workflows can surface coordinated attacks, impersonation campaigns, or narrative flooding	Attribution and intent remain difficult outside high-confidence cases
Threat intelligence	Findings can be mapped into standardized TTP-style reporting for cross-team sharing	Multimodal and cross-platform campaigns need stronger integration
Election-risk analysis	Rapid hypothesis generation can help monitor emerging manipulation patterns	False positives have high civic and reputational cost
Compliance and investigations	Atomic evidence decomposition can make LLM-assisted review more auditable	Legal standards may require stricter evidentiary controls
Public-sector situational awareness	Standardized outputs can improve interoperability across agencies	Operational deployment requires governance, security, and review protocols

The most interesting business implication is not “AI for content moderation.” That category is already stuffed with systems that confidently confuse scale with understanding.

The better implication is \ast\astagentic investigative infrastructure\ast\ast. The paper shows how a workflow can turn a framework into structured work: select the relevant category, produce a hypothesis, gather evidence, verify claims, and preserve traceability.

That is useful wherever organizations face repeated, messy, evidence-heavy investigations.

Standardization is the underrated part of the value proposition

DISARM’s role is not only analytical. It is organizational.

When multiple teams investigate influence operations, they often produce incompatible outputs. One team describes narratives. Another describes accounts. Another tracks hashtags. Another stores suspicious URLs. Everyone is working hard; the resulting intelligence still arrives like five people carrying different pieces of furniture through the same doorway.

A taxonomy reduces that friction. An agentic taxonomy-driven workflow reduces it further by generating outputs already linked to recognized techniques.

The paper frames this partly in defense terms: situational awareness, human-machine teaming, and interoperability. For business readers, the same principle applies under less dramatic names:

\ast shared reporting formats; \ast comparable incident records; \ast reusable investigation templates; \ast auditable evidence trails; \ast easier handoff between analysts, managers, legal teams, and external partners.

Standardization does not make evidence true. It makes evidence easier to review, compare, escalate, and reuse.

That sounds administrative. It is also where many AI deployments quietly fail. The model produces something interesting, but the organization cannot absorb it into existing decision processes. Agentic output without workflow integration is just a very energetic intern.

This paper’s DISARM alignment is valuable because it gives the agent’s work a destination.

The boundary: behavioral evidence is powerful, but incomplete

The paper’s limitations are not generic “more research is needed” seasoning. They directly affect deployment.

First, the pipeline is currently strongest on textual social media data and associated metadata. The authors describe the framework as modality-agnostic by design, but this implementation focuses mainly on posts, comments, timestamps, account identifiers, and related metadata. Multimodal manipulation—images, video, memes, audio, synthetic personas across platforms—requires additional grounding.

Second, temporal synchronization is underdeveloped. This matters because many coordinated campaigns are defined not just by what accounts say, but by when they act together. The China/X case makes this visible.

Third, behavioral features alone cannot fully distinguish among bot farms, trolls, coordinated human users, and organic high-activity accounts. The paper notes that generative AI blurs some of these categories, but distinctions remain operationally relevant. Countermeasures differ depending on whether the activity is automated, semi-coordinated, human-operated, or infrastructure-linked.

Fourth, the evaluation uses two datasets. They are real and practitioner-annotated, which is valuable, but they do not establish broad generalization across platforms, languages, campaign types, or adversarial adaptations.

Fifth, verification depends on available labels or annotations. In the wild, labels are often incomplete, contested, or delayed. The system’s statistical verification step is only as useful as the reference data permits.

These boundaries do not weaken the paper’s main contribution. They define where the contribution should be used.

Use it for structured investigation support. Do not use it as an autonomous verdict engine. Use it to generate auditable leads. Do not use it to outsource responsibility for attribution, intent, or proportional response.

A boring boundary, perhaps. Also the line between a tool and a liability.

What Cognaptus infers, and what the paper directly shows

The paper directly shows that an LLM-driven agentic workflow can operationalize DISARM across two real-world FIMI datasets, generate atomic evidence claims, verify a subset of those claims statistically, and surface useful patterns for human analysts. It reports a combined 28.6% atomic evidence pass rate, a 50.0% technique pass rate, and more than 30 additional suspicious Telegram bot accounts surfaced beyond the prior human-led investigation.

Cognaptus infers that the larger business opportunity is not limited to disinformation research. The transferable pattern is \ast\asttaxonomy-grounded agentic investigation\ast\ast. Many organizations already have frameworks: risk taxonomies, control libraries, fraud typologies, incident response playbooks, compliance checklists, threat models. Most of those frameworks sit in documents. They help people classify work after the fact. They do not always help people do the work.

The paper suggests a different pattern:

Framework → Agentic operationalization → Atomic evidence → Verification → Human judgment

That sequence is more important than the specific use of DISARM.

What remains uncertain is generality. We do not yet know how well this approach performs across less-labeled datasets, richer multimodal campaigns, cross-platform coordination, adversaries adapting to the detector, or organizations with weaker data infrastructure. Nor does the paper provide a full production architecture for deployment in a live trust-and-safety or intelligence environment.

Still, as a research prototype, it points in a credible direction: make AI useful by embedding it inside a verification workflow, not by asking it to be wise.

When frameworks start doing the work, humans still own the judgment

The best way to read this paper is not as another entry in the “AI will detect disinformation” genre. That genre has produced many confident systems and many future apologies.

This paper is more practical. It asks how an established analytical framework can be converted into an agentic investigation process. It then adds the pieces that make the process less fragile: deferred interpretation, technique-guided hypothesis generation, executable queries, structured evidence logging, atomic claim extraction, statistical verification, and human review.

The results are imperfect in exactly the way useful investigative tools are imperfect. Many generated claims fail. Some technique paths produce no validated evidence. The system misses temporal synchronization and certain expert-recognized campaign signals. It needs labels for verification. It is not ready to become the final judge of coordinated manipulation.

But it does show that an agentic workflow can narrow the search space, rediscover known campaign structures, surface additional suspicious accounts, and produce evidence in a standardized form that humans can review.

That is the right kind of automation: not the theatrical replacement of expertise, but the disciplined compression of repetitive analytical labor.

Frameworks used to sit on the shelf and wait for analysts to apply them. This paper shows a version of the next step: frameworks that can guide agents through the messy middle of investigation.

The filing cabinet has not become an analyst. But it has learned to open the right drawer, pull the right folder, and attach a note saying, “This claim needs verification.”

For now, that is progress enough.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Kevin Tseng, Juan Carlos Fernandez Toledano, Bart De Clerck, Yuliia Dukach, and Phil Tinn, “An Agentic Operationalization of DISARM for FIMI Investigation on Social Media,” arXiv:2601.15109v3, 2026. ↩︎

Taxonomies do not investigate campaigns by themselves#

The paper’s central move is making DISARM executable#

Deferred anomaly detection is a quiet but important design choice#

The agent does not merely label TTPs; it chooses what to investigate next#

Atomic evidence is the paper’s main anti-handwaving device#

The evaluation tests workflow quality, not autonomous truth#

The numbers show useful selectivity, not blanket reliability#

The China/X case shows the pipeline can rediscover known campaign structure#

The Telegram case shows complementary discovery, not replacement of investigators#

The cost result is promising, but should not be treated as universal economics#

The business value is workflow compression with audit trails#

Standardization is the underrated part of the value proposition#

The boundary: behavioral evidence is powerful, but incomplete#

What Cognaptus infers, and what the paper directly shows#

When frameworks start doing the work, humans still own the judgment#