Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture
Production AI has entered its awkward teenage phase. It can speak fluently, see impressively, forecast usefully, and still fail in ways that make operators quietly reach for the manual override.
The problem is not simply that models are too small, not enough tokens have been burned, or someone forgot to add “think step by step” to a prompt. The deeper problem is that many AI systems are being asked to reason directly from raw inputs that have not yet been converted into the right operational form.
A grid forecast is not just a time series. It is a spatially coupled physical system under weather shocks. A legal citation is not just text. It is a hierarchy of authority, interpretation, and professional risk. A traffic video is not just frames. It is a sequence of actors, locations, causes, and consequences.
Three recent papers make this point from very different directions. EnergyMamba builds an uncertainty-aware spatiotemporal forecasting model for energy consumption.1 Validate Your Authority benchmarks LLMs on multi-label legal precedent treatment and proposes a severity-aware metric for legal misclassification.2 MAVEN designs a multi-stage agentic annotation pipeline that converts raw videos into structured event descriptions before generating reasoning data.3
Taken separately, they look like papers from energy forecasting, legal NLP, and video reasoning. Taken together, they are more interesting. They form a reliability stack.
The shared lesson is simple: high-stakes AI does not become trustworthy by pointing a bigger model at raw data. It becomes trustworthy when the domain is first translated into explicit, auditable structure, and then judged with calibration or evaluation rules that reflect real-world consequences.
Yes, the boring word here is “structure.” As usual, the boring word is doing most of the work.
The Common Problem: Raw Inputs Hide the Real Task
Many business AI failures begin before the model “reasons.” The system receives the wrong representation of the problem.
In energy forecasting, a purely temporal model may learn yesterday’s load, weekly patterns, and seasonal rhythms. But it may miss the fact that demand in one region is related to neighbouring regions through geography, grid topology, and physical coupling. It may also report a clean point estimate when the operator actually needs a calibrated interval under abnormal conditions.
In legal precedent analysis, an LLM may read the citing case and choose a label. But legal treatment labels are not just text categories. Some labels are semantically close; some errors are professionally tolerable; others are disastrous. Confusing two neighbouring negative-treatment labels is not the same as treating an invalidated precedent as safe authority. Accuracy alone politely pretends those mistakes are comparable. They are not.
In video reasoning, a model may process frames and captions. But a useful accident, safety, or surveillance explanation requires more than “what appears in the scene.” It needs when the key event began, where it occurred, which actor caused what, what changed afterwards, and which details are relevant to downstream questions.
The three papers therefore sit at different layers of the same design principle:
| Layer in the reliability stack | What the paper contributes | Why it matters |
|---|---|---|
| Structured evidence generation | MAVEN turns raw video into a Multi-Scale Spatio-Temporal Event Description before Q&A generation. | Downstream reasoning is grounded in an auditable intermediate representation rather than a fog of frames and captions. |
| Domain-aware predictive modelling | EnergyMamba models energy demand as a spatially coupled temporal system and calibrates uncertainty online. | Forecasts become operational intervals, not just point predictions with false confidence. |
| Consequence-aware evaluation | The legal precedent benchmark uses hierarchical labels and Average Severity Error. | Model errors are judged by practical legal severity, not only by flat accuracy. |
The order is important. Evidence first. Model second. Consequence-aware validation third. This is less glamorous than a leaderboard screenshot, but much closer to how production systems survive contact with reality.
Step 1: Find the Structure the Model Would Otherwise Flatten
The first move is not architectural. It is diagnostic.
Ask: what part of the domain would a generic model flatten?
EnergyMamba starts from the observation that energy consumption is not merely a sequence of readings. The authors motivate their design with spatial autocorrelation and local heterogeneity in regional load data. They also identify heteroscedasticity: prediction residuals scale with load magnitude. This matters because a fixed-width or uncalibrated uncertainty estimate can become misleading across low-load and high-load regions.
So the paper treats the energy system as a graph. Nodes represent spatial regions; edges approximate spatial coupling. The model then injects graph-derived spatial context into a selective state-space temporal backbone. The important business point is not “Mamba is fashionable this quarter.” It is that the model’s temporal dynamics are conditioned on the domain’s spatial structure.
The legal paper finds a different hidden structure: precedent treatment is hierarchical and severity-weighted. The authors work with a dataset of 239 real-world legal citation relationships derived from expert annotations. The fine-grained labels include distinctions such as distinguished by, criticized by, not followed by, overruled, and others. The authors also define a high-level schema that groups fine-grained labels into broader categories such as LIMITED OR DISTINGUISHED, CRITICIZED OR QUESTIONED, INVALIDATED, CONFLICT NOTED, and NEUTRAL CITATION.
That grouping is not cosmetic. It admits that legal labels may be too subtle, sparse, or semantically overlapping for current LLMs to classify reliably at fine granularity. In the paper’s results, several fine-grained labels receive zero successful predictions, and the authors explicitly discuss the thin semantic line between labels such as criticizing and questioning a precedent. The benchmark therefore becomes a way of asking: can the model handle the level of distinction that the actual professional task requires?
MAVEN does the same thing for video. It refuses to treat raw video as sufficient context for reasoning-data generation. Instead, it introduces an intermediate representation, the Multi-Scale Spatio-Temporal Event Description, or MSTED. This is built from three captioning levels: global context, dense timestamped events, and fine-grained chunk descriptions. The MSTED then becomes the sole source for downstream question generation.
That “sole source” rule is the clever part. The Q&A generator cannot invent details that are not in the MSTED. Or, to be more precise, if it does, the system now has a clear checkpoint where the hallucination can be detected. The intermediate representation gives reviewers something to audit before thousands of synthetic reasoning examples are produced.
A generic model sees data. A reliable system sees structure.
Step 2: Convert the Structure into a Control Surface
Structure is useful only if it becomes something the system can act on.
EnergyMamba turns spatial structure into graph-enhanced temporal modelling. It uses a graph convolution component to extract spatial context and injects that context into a selective state-space model inside a U-Net-like architecture. The U-Net structure is used to capture multi-scale temporal patterns, while the graph component prevents the model from treating each region as an isolated time series.
Then the paper adds Adaptive Sequential Conformalized Quantile Regression, or AS-CQR. This is where the forecast becomes operationally useful. Instead of merely predicting a median or point estimate, the system predicts intervals and calibrates them over time.
The paper’s nonconformity score is normalized by predicted interval width, making calibration more scale-invariant across regions with different load magnitudes. It also updates the effective miscoverage rate based on recent coverage performance. In plainer English: if the intervals are missing too often, the system learns to widen them; if they are too conservative, it can tighten them.
That is a control surface. Not a vibe. Not a dashboard decoration. A real adjustment mechanism.
The legal paper creates a different control surface: a severity scale. It assigns high-level legal treatment categories an ordinal severity from NEUTRAL CITATION to INVALIDATED, then computes Average Severity Error. The metric is not mathematically exotic, but it is operationally sane.
A flat accuracy metric asks, “Was the label exactly right?” A severity-aware metric asks, “How bad was the mistake?”
For legal AI, that difference is not academic. If a system confuses two adjacent labels that both indicate some negative treatment, the business response may be review or clarification. If it fails to recognise that precedent has been invalidated, the response is closer to “please do not automate malpractice at scale.”
MAVEN’s control surface is the MSTED plus the pipeline hierarchy. When human reviewers identify systematic annotation issues, the agent classifies the problem using an error taxonomy, traces the root cause backward through the pipeline, and decides whether to rewrite a prompt or insert a new pipeline stage. For example, when fixed-length chunking splits important events across boundaries, the system can introduce event-centred highlight chunks.
This is more useful than telling the model to “be more accurate.” Very few production failures are fixed by scolding the model in natural language. Mature systems need to know where the failure entered the pipeline.
Step 3: Calibrate or Evaluate Against Consequence, Not Generic Accuracy
This is where the three papers become one argument.
EnergyMamba evaluates both deterministic prediction and uncertainty quantification. The paper reports that EnergyMamba improves average prediction accuracy by around 5% and uncertainty quantification by around 6% over 15 baselines across datasets from Florida, New York, and California. More importantly, the paper evaluates abnormal conditions such as hurricanes and heat waves. Under these conditions, the question is not whether the model looks elegant in normal weeks. The question is whether its uncertainty estimates remain useful when demand patterns shift.
For a utility operator, a narrow but wrong interval is not confidence. It is a trap wearing a lab coat.
The legal precedent paper similarly shows that flat accuracy hides risk. Gemini 2.5 Flash performs best on the high-level classification task, while GPT-5-mini performs best on the more complex fine-grained schema. But the paper’s more interesting contribution is not the model ranking. The useful contribution is the Average Severity Error metric, because it recognises that legal misclassification has unequal costs.
The paper also reports qualitative error patterns that matter for deployment. One model error involved misattributing the target of a court’s action: the model noticed a narrowing function but failed to recognise that the court was correcting a party’s argument rather than treating the precedent itself negatively. This is exactly the kind of mistake that looks “reasoned” while still being legally wrong. Delightful, in the way a polished trapdoor is delightful.
MAVEN evaluates whether structured annotation can produce reasoning capability that transfers. The paper applies MAVEN to more than 5,300 traffic videos, fine-tunes Cosmos-Reason2-8B, and evaluates on a private CCTV set and AccidentBench. CCTV-only training improves performance and matches Gemini 2.5 Pro on AccidentBench despite not seeing dashcam videos during training. Adding agent-adapted dashcam annotations and RL post-training pushes performance higher.
But again, the model comparison is not the deepest lesson. The deeper lesson is that structured intermediate representations can turn raw domain data into transferable training signal. The MSTED is not merely documentation. It is the compression boundary where messy perception becomes usable reasoning evidence.
A Practical Framework: Evidence Architecture Before Model Automation
For business teams, the combined message can be translated into a simple deployment framework.
| Design question | Bad shortcut | Better design pattern |
|---|---|---|
| What is the real object being modelled? | Treat the input as generic text, frames, or time series. | Identify the domain structure: graph, hierarchy, event chain, authority relationship, causal timeline. |
| What should the model see? | Feed raw data and hope the model infers the structure. | Build an intermediate representation that exposes the relevant structure. |
| How should uncertainty or error be handled? | Report a score, label, or point forecast. | Calibrate intervals, score severity, or expose audit checkpoints. |
| How should failures be improved? | Add more examples or rewrite the whole prompt. | Trace failures to the pipeline stage where information was lost or distorted. |
| How should business users trust the system? | Show benchmark rank. | Show evidence path, calibration behaviour, and consequence-aware error analysis. |
This framework applies beyond the three paper domains.
A logistics company forecasting delivery demand should not only forecast volume. It should model regional coupling, weather sensitivity, depot constraints, and uncertainty intervals.
A compliance platform should not only classify documents. It should map clauses, obligations, exceptions, severity tiers, and reviewer escalation paths.
A video analytics vendor should not only detect events. It should preserve actor roles, timestamps, spatial locations, causes, and consequences in a form that a human can inspect.
The pattern is the same: make the domain structure visible before allowing the model to act.
What the Papers Show—and What They Do Not
It is worth separating the papers’ evidence from the broader business interpretation.
The papers show that structured design improves domain-specific AI reliability in their respective settings. EnergyMamba shows that spatial context and adaptive uncertainty calibration can improve energy forecasting and interval reliability, including under abnormal weather-related shifts. The legal paper shows that LLM evaluation on precedent treatment requires hierarchical labels, expert data, and severity-aware metrics because flat accuracy misses the practical meaning of mistakes. MAVEN shows that an explicit intermediate event representation can support scalable video reasoning data generation, domain adaptation, and downstream model improvement.
The business interpretation is broader: production AI systems should be designed as evidence architectures, not just model wrappers.
That conclusion is not directly “proven” by any single paper. It is a synthesis across the three. But it is a useful synthesis because the papers solve different parts of the same reliability problem.
EnergyMamba answers: how do we make prediction under distribution shift less overconfident?
The legal paper answers: how do we evaluate errors when mistakes have unequal consequences?
MAVEN answers: how do we prevent downstream reasoning data from being built on unstructured, unauditable perception?
Together, they imply a useful operating principle:
Reliable AI is not raw input plus a powerful model. Reliable AI is structured evidence plus calibrated action plus consequence-aware evaluation.
Not as catchy as “AI agents will run the company by Friday,” admittedly. But considerably less likely to end in an incident report.
The Misconception to Avoid: “Just Add a Bigger Model”
The likely wrong reading is that each paper recommends a different technical gadget.
Use Mamba for energy. Use GPT-5-mini or Gemini for legal classification. Use an agentic annotation pipeline for video. End of story.
That reading misses the point.
EnergyMamba’s value comes from aligning model architecture with physical and statistical structure. The legal paper’s value comes from aligning evaluation with professional consequence. MAVEN’s value comes from aligning data generation with auditable event structure.
The common theme is not a model family. It is disciplined representation.
A bigger model may still help. But without the right evidence architecture, it may simply become more fluent at flattening the wrong input. In high-stakes settings, that is not intelligence. That is expensive compression damage.
What Business Teams Should Do Next
A manager evaluating AI automation in a serious workflow should ask five questions before asking which model is best.
First, what information must be made explicit before the model acts? For energy, that may be spatial coupling and uncertainty. For law, it may be treatment hierarchy and severity. For video, it may be event timing, actor roles, cause, and consequence.
Second, where is the audit checkpoint? MAVEN has MSTED. Legal precedent classification has supporting excerpts and severity analysis. EnergyMamba has prediction intervals and empirical coverage. If the system has no inspectable checkpoint, it is not a workflow. It is a magic show with invoices.
Third, how are errors weighted? A generic F1 score may be useful during research. It is not enough for operations where one class of error triggers legal exposure, safety risk, reserve misallocation, or customer harm.
Fourth, can the system adapt without hiding drift? EnergyMamba updates calibration online. MAVEN traces annotation errors to pipeline stages and can revise prompts or structure. Business systems need this kind of controlled adaptation, not silent model decay.
Fifth, what is the escalation rule? If uncertainty widens, if severity risk is high, or if evidence is incomplete, the system should know when not to proceed autonomously. “The model was confident” is not a governance policy. It is a post-mortem sentence.
The Larger Point: Evidence Is the Product
Enterprise AI is often sold as a model capability. In practice, the durable asset is the evidence architecture around the model.
The model can change. GPT today, Gemini tomorrow, Qwen next quarter, a fine-tuned domain model after procurement finally exhales. But the evidence architecture—the graph, schema, severity metric, event representation, calibration loop, review checkpoint, and failure taxonomy—becomes the organisation’s reusable intelligence layer.
That is why these three papers matter together. They do not just offer three technical solutions. They point toward a production discipline: before AI systems are allowed to predict, classify, recommend, or generate at scale, the business must decide what counts as evidence, what counts as uncertainty, what counts as a severe mistake, and where the system must pause for review.
Raw inputs are cheap. Structured evidence is expensive. That is precisely why it matters.
The next wave of useful AI will not be won by teams that merely attach larger models to messier workflows. It will be won by teams that build the intermediate layers where domain reality becomes legible to machines and auditable by humans.
Not bigger leaderboards. Better evidence architecture.
Cognaptus: Automate the Present, Incubate the Future.
-
Dahai Yu, Rongchao Xu, Lin Jiang, and Guang Wang, “EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction,” arXiv:2606.00506, 2026. https://arxiv.org/abs/2606.00506 ↩︎
-
M. Mikail Demir and M. Abdullah Canbaz, “Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification,” arXiv:2605.17691, 2026. https://arxiv.org/abs/2605.17691 ↩︎
-
Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, and Vidya Murali, “MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks,” arXiv:2605.21917, 2026. https://arxiv.org/abs/2605.21917 ↩︎