When Compliance Blooms: ORCHID and the Rise of Agentic Legal AI

Procurement is where compliance anxiety goes to acquire a purchase order.

A laboratory wants to buy an item. Perhaps it is ordinary. Perhaps it is dual-use. Perhaps it belongs under the U.S. Munitions List, Nuclear Regulatory Commission controls, the Commerce Control List, or the broad residual category of EAR99. The practical question is not just “what is this?” It is “what is this under the rules, according to which rule text, with enough evidence that someone can defend the decision later?”

That last clause is where most AI compliance demos quietly begin to sweat.

The ORCHID paper, Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property, addresses exactly this kind of problem: High-Risk Property classification in U.S. Department of Energy procurement workflows.¹ The system is not presented as a lawyer in a box, thankfully. It is a modular, agentic retrieval-augmented classification workflow that turns item metadata into policy-grounded evidence, a provisional classification, validator review, possible subject-matter-expert escalation, and an append-only audit trail.

The useful lesson is not that another LLM can classify another thing. We have seen that movie. It has too many sequels and not enough controls.

The useful lesson is that ORCHID treats legal classification as a governed workflow rather than a clever answer. Its real contribution is choreography: retrieval, description refinement, classification, validation, and feedback logging are separated into agents with explicit responsibilities. That distinction matters because compliance failures rarely come from one bad prediction alone. They come from untraceable reasoning, stale policy text, hidden uncertainty, missing reviewer intervention, and the eternal corporate hobby of discovering risk six months after the decision has already shipped.

The important move is not prediction; it is controlled handoff

ORCHID is built around a simple loop: Item → Evidence → Decision. The input is familiar enough: manufacturer, equipment or service, model number, and optionally a user-provided description. From there, the system retrieves relevant policy snippets, refines weak descriptions, proposes a classification, checks the evidentiary basis, and either emits a verified response or routes the case to a human reviewer.

That sounds procedural because it is. In high-risk compliance, procedural is a compliment.

The architecture uses a thin orchestrator sitting above specialised agents. The orchestrator does not generate the substantive model content. Instead, it schedules work, passes context, handles retries and timeouts, annotates each event with provenance, and carries run-card metadata such as retrieval configuration, model versions, hashes, and timestamps. The point is to make the decision path replayable rather than merely plausible.

Component	What it does	Operational consequence
IR, the information retrieval agent	Converts item fields into hybrid search queries over policy-scoped material	Keeps the system anchored to the relevant regulatory corpus
DR, the description refiner	Clarifies or rewrites weak item descriptions	Reduces garbage-in classification without pretending bad input is fine
HRP classifier	Produces a provisional label, confidence score, and cited snippets	Makes the model’s output a proposal, not a verdict
VR, the validator	Checks coverage and conflict, then emits AGREE, REVIEW, or CONFLICT	Turns uncertainty into a routing decision
FL, the feedback logger	Records reviewer decisions and rationales in an append-only store	Preserves institutional judgment for audit and future reference
Orchestrator	Coordinates typed messages and provenance across the workflow	Makes the system reproducible rather than artisanal

This decomposition is the mechanism-first heart of the paper. A monolithic model can produce a confident answer. ORCHID instead creates a sequence of constrained handoffs. Each handoff narrows the opportunity for a quiet hallucination to become an official decision.

There is a small but important cultural shift here. Many enterprise AI deployments still treat “agentic” as meaning “the model can do more things.” In ORCHID, agentic means something stricter: each agent has a bounded job, a defined input/output contract, and a place in the audit trail. The system is not more trustworthy because it has more moving parts. It is more trustworthy only because the moving parts are named, constrained, logged, and interruptible.

A bureaucracy with a message bus, basically. But in this case, that is progress.

Citations are not decoration; they are operating constraints

In ordinary business AI, citations often play the role of decorative seriousness. A model writes a paragraph, appends a few links, and everyone pretends that the chain of reasoning has been civilised.

ORCHID is stricter. Retrieval is limited to a versioned policy corpus covering USML, NRC, CCL, and EAR99 guidance. The system uses hybrid retrieval: lexical BM25 search for exact policy language, embeddings for semantic similarity, and reciprocal rank fusion to combine results. Retrieved text is then reduced into minimally sufficient citation spans that the classifier must cite.

The significance is not “RAG reduces hallucination,” a phrase now so overused it should be sent to a quiet farm. The significance is that ORCHID constrains what the classifier is allowed to see and what it is allowed to use as support. It cannot freely browse the web. It cannot import random vendor marketing copy as regulatory authority. It cannot cite vibes.

That matters because export-control and high-risk property classifications live in boundary conditions. A phrase in a product description may matter. A policy section may exclude what a similar section appears to include. A commercial item may become sensitive because of performance characteristics, intended use, or category overlap. Retrieval in this setting is not just information gathering; it is jurisdiction selection.

ORCHID’s validator then checks whether the proposed label has enough on-policy support and whether conflicting snippets are present. The validator is not a second philosopher pondering the nature of legality. It is a gate. If the evidence is weak or contradictory, the system can route the case to a subject matter expert.

This is where the design becomes more interesting than the model. The system does not merely ask, “what is the label?” It asks, “is there enough policy-grounded support for this label to move forward without human review?” That second question is usually where compliance actually lives.

Human-in-the-loop means the model is allowed to stop

The obvious misconception is that ORCHID automates export-control decisions. It does not. The paper is explicit that first model outputs are proposals, uncertain items are routed to SMEs, and final determinations remain with qualified reviewers.

This is not a timid caveat. It is the product design.

A good compliance assistant should not maximise answer production. It should maximise appropriately supported progression through a workflow. Sometimes that means answering. Sometimes it means escalating. Sometimes it means showing the contradictory evidence and refusing to tidy the mess into a suspiciously neat conclusion.

ORCHID’s human-in-the-loop design has three functions.

First, it provides a safety valve. Low-confidence or conflicted classifications do not need to be forced into a label just because the interface demands one. The validator can route the item for review.

Second, it captures expert feedback. The reviewer can accept or override a classification and record a short rationale. That rationale becomes part of the append-only audit store. The paper notes that feedback can be cached and associated with similar future items, although the preliminary reported results disabled the feedback logger due to data sensitivity. That distinction matters: the feedback mechanism is part of the architecture, but the paper’s numeric results should not be read as evidence that the feedback loop has already improved performance.

Third, it changes accountability. A model-generated classification without provenance is operationally awkward. A classification with item inputs, retrieved snippets, prompts, outputs, validator verdicts, reviewer actions, model identifiers, index snapshots, and timestamps is a different object. It is not automatically correct. It is inspectable. In regulated work, inspectable beats impressive.

The preliminary evidence is strongest where policy boundaries are sharp

The paper reports preliminary performance on real HRP cases, with the feedback agent disabled. The headline numbers are respectable but not miraculous: 70.37% binary accuracy, 63.12% weighted multi-class average, 88% accuracy for USML, 90% for NRC, 56% for CCL, and 40% for EAR99.

Reported result	Likely purpose in the paper	What it supports	What it does not prove
USML accuracy: 88%	Main evidence	ORCHID handles tightly regulated defence-related categories relatively well	That all export-control subclasses are solved
NRC accuracy: 90%	Main evidence	Nuclear-related high-risk categories appear comparatively separable	That performance generalises to every DOE site or procurement domain
CCL accuracy: 56%	Main evidence and boundary diagnosis	Dual-use classification remains difficult	That the system is ready for unsupervised determinations
EAR99 accuracy: 40%	Main evidence and boundary diagnosis	Low-risk or residual classification is fragile	That “not controlled” decisions can be safely automated
Binary accuracy: 70.37%	Summary evidence	ORCHID can support HRP triage better than a raw multi-class reading suggests	That fine-grained legal category assignment is reliable enough alone
Confusion matrix	Diagnostic evidence	Errors cluster around CCL/EAR99 ambiguity	That the validator is fully calibrated

The pattern is more important than the aggregate. USML and NRC categories perform much better than CCL and EAR99. That is not surprising. Highly regulated categories often have clearer textual anchors. Dual-use and residual categories are messier because they sit near the edge of control. EAR99, in particular, is not a clean affirmative domain so much as a classification after other controls fail to apply. Machines dislike “none of the above” almost as much as compliance teams dislike surprise audits.

The row-normalised confusion matrix sharpens the diagnosis. True CCL items are correctly predicted as CCL 56% of the time, but 36% are predicted as EAR99. True EAR99 items are correctly predicted only 40% of the time, while 50% are predicted as CCL. That is the real fault line: the system struggles at the commercial/dual-use boundary.

By contrast, true ITAR/USML cases are predicted correctly 88% of the time, and true NRC cases 90% of the time. Errors there are smaller and more concentrated. For business interpretation, this means ORCHID’s first practical role is probably not “replace classification staff.” It is “accelerate structured triage while surfacing hard boundary cases.” Less glamorous. More deployable.

The paper also states that ORCHID improves accuracy and traceability over a non-agentic baseline, but the accessible results do not provide a detailed comparative baseline table. That should make readers careful. The traceability improvement is well supported by the architecture. The accuracy improvement is plausible from the reported claim, but hard to size without the missing baseline detail, sample composition, and confidence intervals.

This is not a fatal flaw. It is a boundary. Early system papers often demonstrate feasibility before they deliver mature benchmark accounting. Still, anyone turning the result into a procurement case should distinguish between “promising controlled workflow” and “validated enterprise-grade compliance classifier.” Those are neighbours, not twins.

The demo proves the workflow, not the world

The demo scenario is useful, but it should be read correctly. The paper says the demo runs on-premise using low-sensitivity synthetic data. A chatbot-generated random procurement item is paired with a claimed ground truth, with the video example set to CCL. The purpose is to show the workflow: single-item submission, retrieval, cited reasoning, validator gating, reviewer feedback, and exportable audit artifacts.

That makes the demo an implementation detail, not performance evidence. It shows that the pieces can move together. It does not show that the system can handle the full ugliness of real procurement catalogues, incomplete specifications, inconsistent vendor naming, scanned technical sheets, multilingual documents, or regulatory edge cases with expensive consequences. Reality, inconsiderately, tends not to arrive as a clean UI form.

The user interface itself is still worth noting. ORCHID gives operators a submission form, a predicted HRP status, a control category, a confidence score, step-by-step reasoning, clickable citations, evidence tables, feedback fields, and export options in JSON, CSV, or PDF with version strips. That is not just UX polish. It is governance infrastructure.

A compliance AI system that cannot export its reasoning is not a compliance AI system. It is a liability with a loading animation.

The business value is backlog control, not robo-lawyering

The strongest business interpretation of ORCHID is not “AI can make legal decisions.” That is the wrong story and, in regulated sectors, a wonderfully efficient way to get everyone in the room nervous.

The stronger story is that AI can restructure the pre-decision workflow around evidence, escalation, and auditability.

For organisations with large procurement queues, dual-use technologies, sensitive facilities, or strict internal control requirements, the pain is rarely just the final label. The pain is throughput under uncertainty. Analysts must identify relevant rules, interpret technical descriptions, compare similar cases, document reasoning, and preserve defensible records. ORCHID targets that middle layer.

What the paper directly shows	Cognaptus business inference	Remaining uncertainty
A modular agentic RAG workflow can classify HRP items with policy citations and validator gating	Compliance assistants should be designed as controlled workflows, not answer engines	How performance changes at scale, across sites, and across procurement categories
Preliminary accuracy is strong for USML/NRC and weak for CCL/EAR99	The first ROI may come from routing and prioritisation rather than final automation	Whether CCL/EAR99 ambiguity can be reduced with better descriptions, retrieval tuning, or SME feedback
The system logs prompts, evidence, outputs, verdicts, and feedback	Audit artifacts may reduce review friction and post-hoc reconstruction cost	Whether auditors and legal teams accept these artifacts as sufficient in practice
The architecture supports on-prem operation with no external egress	Sensitive organisations can explore LLM assistance without sending data to hosted services	Cost, maintenance, and model-management burden remain organisation-specific
SME escalation is built into the workflow	Human expertise can be reserved for difficult cases instead of every routine search	The right confidence thresholds require calibration and governance

The architecture travels beyond DOE more easily than the accuracy numbers do. Finance, healthcare procurement, defence supply chains, aviation maintenance, sanctions screening, insurance claims, and regulated research administration all contain variants of the same problem: classify an object or transaction under rules, cite the governing text, escalate ambiguity, and preserve a record.

But the system’s portability depends on institutional plumbing. A company needs curated and versioned policy corpora. It needs stable item metadata. It needs SMEs willing to review escalations and record rationales. It needs governance for thresholds, overrides, corpus updates, and audit retention. Without those things, ORCHID becomes just another RAG app wearing a compliance badge. Very fashionable. Not necessarily useful.

Where the bloom wilts

ORCHID’s limitations are practical, not philosophical.

The first limitation is corpus dependency. The system is only as current and complete as the policy corpus it searches. In fast-moving regulatory areas, drift is not an exception; it is the operating environment. Versioning helps, but it does not eliminate the need for disciplined corpus maintenance.

The second limitation is input quality. The paper notes that sparse or poorly written descriptions reduce retrieval and classification quality, and that “no-description” mode is less reliable. This is exactly what one would expect. Manufacturer, item name, and model number may not encode the technical characteristics that determine control status. A vague description can turn a classification problem into a guessing game with citations.

The third limitation is modality. ORCHID currently supports English text and does not process multimodal inputs such as images or technical specification sheets. In procurement, critical details often live in PDFs, tables, diagrams, datasheets, labels, and vendor attachments. Ignoring those inputs may be acceptable for a demo. It is less acceptable for production.

The fourth limitation is evaluation maturity. The paper provides preliminary summary results and a confusion matrix, but not enough detail to fully assess dataset size, class balance, baseline magnitude, confidence intervals, or failure modes by item type. The reported pattern is useful. It is not yet a deployment guarantee.

The fifth limitation is legal status. ORCHID provides decision support, not legal or regulatory advice. That is not boilerplate. It defines the deployment model. The system can prepare evidence, propose labels, and route uncertainty. It cannot absorb accountability from the organisation using it.

In other words, ORCHID is best understood as an audit-aware classification assistant. It is not an automated compliance department. We may all continue sleeping indoors.

The larger lesson: trustworthy AI is mostly workflow design

ORCHID’s most important contribution is not its 70.37% binary accuracy. That number matters, but it is not the conceptual breakthrough. The important contribution is the design pattern: constrain retrieval, force citations, separate proposal from validation, route ambiguity to humans, and log the entire decision path.

This is what “agentic legal AI” should probably look like in serious environments. Not a theatrical swarm of agents debating policy like junior associates trapped in a slide deck. Not a single chatbot improvising legal confidence. A controlled workflow where each step creates evidence for the next step and leaves a record for the person who eventually has to defend the outcome.

The paper is early. The CCL/EAR99 boundary remains weak. The feedback loop is architecturally present but not numerically validated in the preliminary results. Multimodal and multilingual inputs remain outside scope. A real deployment would need careful calibration, SME governance, corpus maintenance, security review, and legal sign-off. So, the usual enterprise AI picnic, but with fewer ants and more acronyms.

Still, ORCHID points in the right direction. The future of legal and compliance AI is unlikely to be a machine that “knows the law” in some grand, cinematic sense. It is more likely to be a system that keeps the right policy text close, exposes uncertainty, asks humans at the right moment, and remembers exactly how a decision was made.

That may sound less exciting than autonomous legal reasoning. Good. In compliance, excitement is usually a symptom.

Cognaptus: Automate the Present, Incubate the Future.

Maria Mahbub et al., “ORCHID: Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property,” arXiv:2511.04956, 2025. https://arxiv.org/abs/2511.04956 ↩︎

The important move is not prediction; it is controlled handoff#

Citations are not decoration; they are operating constraints#

Human-in-the-loop means the model is allowed to stop#

The preliminary evidence is strongest where policy boundaries are sharp#

The demo proves the workflow, not the world#

The business value is backlog control, not robo-lawyering#

Where the bloom wilts#

The larger lesson: trustworthy AI is mostly workflow design#