Rules are easy when the facts repeat themselves. The previous case had a bribe, this case has a bribe; the previous decision went one way, so the new decision should probably follow. That is the comforting version of precedent. It is also the version most likely to make legal AI look coherent in a demo and naïve in production. A small inconvenience, but tradition has survived worse.
Real precedents do not usually repeat at the same level of detail. They repeat through abstraction. One case says a child “tidied the room” because he folded his clothes. Another says a child “tidied the room” because she made the bed. Both share the same intermediate concept, but perhaps not with the same strength. The first act may be enough tidying for one purpose; the second may be more tidying, less relevant tidying, or tidying outweighed by worse conduct elsewhere. The problem is not that legal concepts are hierarchical. The problem is that hierarchy without strength can quietly flatten judgment.
That is the point of Henry Prakken and Wijnand van Woerkom’s paper, “Defending the Hierarchical Result Models of Precedential Constraint.”1 The paper replies to Trevor Bench-Capon’s criticism of hierarchical models of precedent, especially the concern that intermediate factors can be established with different strengths by different base-level facts. Prakken and van Woerkom do not simply say, “the model still works.” Their more useful answer is narrower and sharper: the factor-based hierarchical result model is more flexible than the criticism assumes, and the dimension-based hierarchical result model handles the real strength problem more naturally.
For legal AI, that distinction matters. A precedent engine that only records whether a factor is present will eventually confuse “same label” with “same legal force.” A better system needs to know whether it is comparing facts, intermediate concepts, or ordered degrees of a concept. That is not philosophical decoration. It is a data-modeling decision.
The debate is about constraint, not prediction
This paper is not about whether an AI model can predict case outcomes. It is about precedential constraint: when a previous decision forces, or constrains, a decision in a later fact situation. That difference is easy to miss.
A predictive model asks: “Given past cases, what outcome is likely?” A precedential constraint model asks: “Given this precedent, would deciding the new case differently be inconsistent?” The second question is stricter. It is closer to consistency checking than forecasting.
The paper focuses on result models, not reason models. In a result model, what matters is whether the previous outcome constrains the new outcome. It does not necessarily reconstruct the full explicit preference ordering a judge had in mind. Bench-Capon’s criticism, however, is often expressed in terms of preferences between factors: for example, whether behaving at home is more important than behaving at school, or whether folding clothes is more important than being inattentive in class. Prakken and van Woerkom point out that this is not exactly the language of the result model.
That sounds like a technical distinction. It is. It is also the kind of technical distinction that prevents a whole product category from being built on a category error.
A legal AI system may be asked to do at least three different things:
| Task | What the system must answer | Why factor labels alone are risky |
|---|---|---|
| Outcome prediction | “What is likely to happen?” | Correlation can hide why a precedent matters. |
| Consistency checking | “Would this decision conflict with prior cases?” | The relevant comparison may occur at different abstraction levels. |
| Explanation | “Why is this case treated like or unlike that case?” | The system must state whether a shared concept has the same strength. |
The paper is mainly about the second task, with implications for the third. It does not claim to solve legal prediction, legal drafting, or empirical validation. Good. One paper should not try to be a department store.
Flat factors are simple, but they compare cases too literally
Start with the flat result model. A case is described by factors, each supporting one side or the other. A later case is forced toward the same outcome if it contains at least the precedent’s pro-outcome factors and no additional relevant con-outcome factors that make it weaker than the precedent.
In plain English: if the new case is at least as strong for the winning side as the old case, the old case constrains the new one.
This works when the comparison is direct. If the precedent involved factor $F_1$ and the new case also has $F_1$, the system can compare them. If the new case lacks $F_1$, the system can distinguish it. The logic is clean.
The problem is that legal reasoning often does not remain at the base-factor level. “Folded clothes” and “made bed” may both instantiate a more abstract factor: “tidied room.” “Was inattentive in class” and “interrupted teacher” may both instantiate “behaved badly at school.” A flat model can record the base facts, but it does not naturally represent why different base facts belong to the same intermediate category.
That limitation is not unique to family ice-cream examples. Legal domains are full of intermediate concepts: reasonable care, questionable means, reliance, notice, substantial similarity, control, disclosure, good faith. These concepts are not decorative labels placed over facts after the real reasoning is done. They often are the reasoning.
A flat model can avoid some hierarchical confusion by refusing abstraction. But that is a cure with side effects. It makes consistency depend on exact factor matching, and exact factor matching is a poor description of how precedent travels across fact patterns.
Factor hierarchies add abstraction, but they do not automatically force rigidity
Bench-Capon’s criticism targets hierarchical models. The worry is that once base-level factors are grouped under intermediate factors, the model may overgeneralize. If Max received ice cream because folding clothes counted as tidying the room and being inattentive counted as misbehaving at school, then Emma may argue that making the bed also counts as tidying the room and interrupting the teacher also counts as misbehaving at school. If the intermediate comparison is simply “behaved at home beats behaved at school,” then Emma should also receive ice cream.
But perhaps that is wrong. Maybe interrupting the teacher is worse than being inattentive. Maybe making the bed tidies the room more than folding clothes, but not enough to offset the worse school behavior. The intermediate labels are the same, yet their strengths differ.
Prakken and van Woerkom’s first reply is that the factor-based hierarchical result model is not as rigid as the criticism assumes. In their model, cases may or may not include intermediate factors in the case description. That choice matters.
If Max’s case includes only base-level factors, then Emma’s case is compared at the base level. Max folded clothes; Emma made the bed. Max was inattentive; Emma interrupted the teacher. Since the base factors differ, Max’s case does not automatically force Emma’s outcome.
If Max’s case includes intermediate factors, then the model can compare at the abstract level. If both cases include “behaved at home” and “behaved at school,” then the precedent may constrain the later decision. The hierarchy is not a compulsory elevator that drags every case to the same abstraction floor. The model allows the case description to determine whether intermediate concepts are part of what was decided.
This is the paper’s first important correction. The reader’s likely misconception is that hierarchical precedent models mechanically force the same outcome whenever two cases share an abstract factor. They do not, at least not in the factor-based hierarchical result model defended here. The presence or absence of intermediate factors in the case description changes what is being compared.
The operational version is simple:
| Representation choice | What it means | Consequence for precedent |
|---|---|---|
| Omit intermediate factors | The precedent is treated mainly through base facts. | Later cases with different base facts can be distinguished. |
| Include intermediate factors | The decision maker is treated as having determined the intermediate concept. | Later cases sharing those intermediate concepts may be constrained. |
| Treat intermediate concepts as dimensions | The system records not only presence but degree. | Later cases can share a concept while differing in strength. |
That last row is where the paper moves from defense to genuine design lesson.
The real weakness is not hierarchy; it is binary abstraction
The factor-based model can answer part of Bench-Capon’s criticism. It can say: “Do not assume the intermediate factor was legally established unless the case description includes it.” That is useful. But it does not fully solve the strength problem.
Suppose two cases share the intermediate concept “behaved at school,” but one child was merely inattentive while another interrupted the teacher. If both are represented as the same binary factor, the model can only say the intermediate factor applies or does not apply. It cannot say that one case establishes the factor more strongly, or establishes its negative counterpart more strongly.
This is where dimensions enter.
A factor is binary: present or absent. A dimension is ordered: one value can be at least as favorable to one outcome as another value. In the dimension-based result model, each dimension has a set of possible values and a partial order. A precedent constrains a new case when the new case is at least as strong, according to those ordered values.
The hierarchical dimension-based result model extends this idea through the hierarchy. It does not abandon intermediate concepts. It gives them ordered values.
That is the central move. The paper does not say “use dimensions instead of legal concepts.” It says legal concepts can remain intermediate nodes, but they should sometimes be represented as dimensions rather than binary factors. “Tidied room” can be a matter of degree. “Behaved at school” can be a matter of degree. The model then asks whether the precedent lower-bounds or upper-bounds the later case on the relevant dimension.
For a legal AI system, this is the difference between a tag and a measurement scale. The tag says: “factor present.” The scale says: “present at this level, relative to this ordering.” Most deployed systems are comfortable with tags. Fewer are honest about scales.
The paper’s example shows how shared concepts can still permit distinction
The paper formalizes the Max and Emma example using dimensions. Most dimensions remain two-valued, but the intermediate dimensions $P$, $Q$, and $R$ receive natural-number values. Here, $P$ represents tidied room, $Q$ represents behaved at home, and $R$ represents behaved at school. Higher values are better for the pro-ice-cream outcome.
The model assigns Max’s case values such as:
- Max folded clothes: $F_1 = 1$
- Max was inattentive in class: $F_5 = 1$
- Max’s tidied-room value: $P = 2$
- Max’s behaved-at-home value: $Q = 2$
- Max’s behaved-at-school value: $R = 3$
- Max received ice cream: $\pi = 1$
Emma’s case is modeled differently:
- Emma made the bed: $F_2 = 1$
- Emma interrupted the teacher: $F_6 = 1$
- Emma’s tidied-room value: $P = 3$
- Emma’s behaved-at-home value: $Q = 3$
- Emma’s behaved-at-school value and outcome are left undefined initially.
This is not an empirical scorecard. The numbers are not measured from data. They are a formal device to represent the intuition that making the bed may establish tidying more strongly than folding clothes, while interrupting the teacher may be worse than merely being inattentive.
The key result is that Max’s case does not force Emma to receive ice cream. Emma is better on the home-behavior dimension, but worse on the school-behavior dimension. The precedent therefore does not lower-bound Emma’s outcome sufficiently to force the same decision. In the paper’s terms, Emma can be denied ice cream without violating precedential constraint.
Then the authors vary the example. Replace Emma’s interruption of the teacher with the same school behavior Max had: being inattentive in class. Now Emma is at least as good as Max on the relevant subordinate dimensions. In that altered case, Max’s precedent does force the decision that Emma should receive ice cream.
That paired comparison is doing the real work. It shows that the dimension-based hierarchy can distinguish cases that share intermediate concepts when their strengths differ, while still constraining cases when the relevant dimensional comparison supports constraint.
| Example | Shared abstraction | Strength-sensitive difference | Model implication |
|---|---|---|---|
| Max vs. Emma with teacher interruption | Both involve home and school behavior. | Emma’s school behavior is worse. | Max’s precedent does not force Emma’s ice cream. |
| Max vs. Emma with same inattentiveness | Both involve home and school behavior. | Emma is at least as good on relevant dimensions. | Max’s precedent can force Emma’s ice cream. |
This is the business-relevant mechanism hiding inside a formal toy example: two cases can share the same intermediate labels without being equivalent, and two cases can differ in base facts while still being equivalent at the right abstraction level.
This is formal evidence, not an experiment
There are no benchmark results, ablation tables, model accuracy numbers, or deployment case studies in the paper. The evidence is formal reconstruction: define the model, apply it to disputed examples, and show whether the alleged counterexample still goes through.
That matters for interpretation.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| MaxMonday example | Main argumentative setup | Shows ambiguity between flat and hierarchical interpretations. | Does not establish how courts actually encode precedent. |
| EmmaMonday example | Main stress test | Shows why strength of intermediate factors matters. | Does not quantify frequency of this issue in real legal data. |
| HRM reconstruction | Defense of factor-based model | Shows the model need not always decide through intermediate factors. | Does not fully solve strength representation. |
| DHRM reconstruction | Main formal answer | Shows dimensions can preserve hierarchy while modeling strength. | Does not validate a deployed legal AI architecture. |
| Trade-secret reference | Domain plausibility illustration | Suggests strength variation appears beyond toy examples. | Does not provide empirical performance evidence. |
So the correct reading is not “dimension-based legal AI is now proven superior.” That would be TED-talk mathematics, and we try to avoid those in daylight. The better reading is: if your legal AI product claims to support precedent-based consistency or explanation, then a purely binary factor representation is structurally incomplete for some classes of reasoning.
The paper demonstrates possibility and coherence, not production readiness.
The product lesson: separate facts, abstractions, and strengths
Legal AI systems often compress legal reasoning into labels. A document is tagged as containing notice, reliance, disclosure, control, injury, authorization, or breach. That may be useful for retrieval. It is not enough for precedential constraint.
The paper suggests a cleaner architecture:
- Base facts should record concrete factual predicates.
- Intermediate concepts should record legally meaningful abstractions.
- Dimensions should record ordered strength where a concept admits degree.
- Constraint logic should know which layer is being compared.
This matters because different legal-tech tasks need different layers. Search may tolerate crude factors. Risk scoring may benefit from weighted factors. Consistency checking needs a more careful structure: it must know when a precedent binds because the new case is stronger, and when it can be distinguished because a shared label hides a weaker or stronger underlying value.
A practical legal AI system could use this distinction in several ways:
| System component | Factor-only design | Dimension-aware design |
|---|---|---|
| Case retrieval | Finds cases sharing labels. | Ranks cases by abstract similarity and strength-sensitive comparability. |
| Precedent checker | Flags same-factor cases as potentially binding. | Tests whether the new case is at least as strong on ordered dimensions. |
| Explanation generator | Says “both cases involve the same factor.” | Says “both cases involve the same intermediate concept, but the new case is weaker on this dimension.” |
| Compliance workflow | Treats policy triggers as binary. | Tracks severity, completeness, or degree of satisfaction. |
| Human review | Leaves lawyers to infer strength manually. | Surfaces where strength assumptions drive the result. |
The ROI angle is not “automate lawyers away.” Very imaginative, very 2023. The more realistic value is cheaper diagnosis. A dimension-aware system can tell reviewers where the case comparison depends on strength assumptions, rather than burying those assumptions inside unexplained factor labels.
That is valuable in legal operations, compliance review, insurance claims, procurement disputes, HR policy enforcement, and any domain where organizations want consistent decisions across repeated but non-identical situations.
The business inference has boundaries
Here is what the paper directly shows: Bench-Capon’s criticism does not defeat the dimension-based hierarchical result model. It also does not apply as broadly to the factor-based hierarchical result model as one might think, because intermediate factors are not automatically included in every case description.
Here is what Cognaptus infers for business use: systems that support legal or compliance decision-making should avoid treating every legally relevant feature as a binary tag. They need representational choices for abstraction level and strength. Otherwise, they will overstate similarity in some cases and miss legally meaningful similarity in others.
Here is what remains uncertain: how to populate such dimensions in real systems.
That last problem is not small. The paper uses stylized values like $P = 2$ and $P = 3$. A deployed system would need a defensible way to assign values. Those values might come from expert rules, structured rubrics, human annotation, learned models, or a combination. Each route creates governance questions. Who defines the scale? How stable is it across domains? Can users contest a value? Does the system explain why one fact establishes a concept more strongly than another?
There is also a cut-off problem. Bench-Capon’s alternative, as discussed by the authors, is to transform dimensions back into factors by defining thresholds: for example, above this value the factor supports one side, below that value it supports the other. Prakken and van Woerkom note a drawback: cut-off points are often hard to identify. For product design, this is painfully familiar. Every elegant continuous score eventually meets a policy owner asking, “So what number means approve?”
Dimension-aware systems do not eliminate hard decisions. They relocate them to where they are visible.
The deeper lesson: explanation requires consistency at the intermediate level
One subtle point near the paper’s conclusion is easy to underread. Even if intermediate concepts are not used to constrain outcomes directly, they may be used to explain or justify outcomes. If so, those intermediate concepts should themselves be applied consistently.
That is important for AI explainability. Many legal and compliance AI systems produce explanations that mention intermediate concepts: “the customer exercised control,” “the employee had notice,” “the vendor failed to use reasonable safeguards.” But if those concepts are not part of the system’s constraint logic, the explanation can drift away from the decision structure.
In other words, the system may predict with one representation and explain with another. That is how explainability becomes theater: polished, fluent, and structurally disconnected from the actual decision logic.
A hierarchical model offers one way to reduce that gap. If intermediate concepts are represented explicitly, the system can check whether they are being used consistently. If they are also dimensions, the system can explain not merely that a concept applies, but how strongly it applies relative to prior cases.
For businesses, this is less glamorous than “AI lawyer.” It is also more useful. A decision-support system that can say, “This case is similar at the abstract level but weaker on the severity dimension,” gives human reviewers something they can audit. A system that only says, “similar precedent found,” gives them a puzzle with a progress bar.
Where this should change legal AI design
The practical design recommendation is not to replace all factors with dimensions. That would be another kind of overengineering. Some legal features really are close enough to binary for the task at hand. A filing was submitted or it was not. A deadline passed or it did not. A clause exists or it does not, at least before lawyers begin their usual sport of arguing about what “exists” means.
The better rule is selective dimensionalization.
Use binary factors when the legal question genuinely turns on presence or absence. Use intermediate factors when abstraction is legally meaningful. Use dimensions when the same intermediate concept can be established with materially different strength, severity, completeness, or quality.
A useful implementation framework would look like this:
| Question for system designers | If yes | If no |
|---|---|---|
| Does this fact matter only as present or absent? | Use a binary factor. | Consider abstraction or dimension. |
| Do several base facts support the same legal concept? | Add an intermediate factor or concept node. | Keep comparison at base level. |
| Can that concept be satisfied more or less strongly? | Represent it as an ordered dimension. | A binary intermediate factor may be enough. |
| Does precedent depend on whether the new case is at least as strong? | Use constraint logic over ordered values. | Retrieval or classification may suffice. |
| Will the concept appear in explanations? | Track consistency of concept application. | Avoid using it as decorative explanation. |
This is where the paper becomes relevant beyond AI and law research. Many business decision systems already contain informal hierarchies. Compliance teams map raw facts to risk indicators. HR teams map incidents to policy categories. Credit teams map borrower behavior to risk grades. Procurement teams map vendor behavior to performance concepts. The same representational problem appears again and again: the organization wants consistency, but its categories are too binary for the judgments they carry.
The legal domain simply makes the problem more visible because precedent turns inconsistency into doctrine instead of inconvenience.
Conclusion: precedent needs structure, but structure needs scale
The cleanest business takeaway from this paper is that “factor present” is too small a unit for serious precedent-based AI. It can support useful retrieval and some forms of classification. It cannot, by itself, capture the way legal reasoning moves between concrete facts, intermediate concepts, and strength-sensitive comparisons.
Prakken and van Woerkom’s defense of hierarchical result models is therefore not merely a formal reply to Bench-Capon. It is a reminder that abstraction is not the enemy. Bad abstraction is the enemy. A hierarchy can make precedent more rational if the system knows when intermediate concepts were actually established. A dimension-based hierarchy goes further by allowing those concepts to vary in strength.
That matters for any organization building AI decision support in rule-heavy domains. If the system treats every concept as binary, it will eventually confuse “same category” with “same case.” If it treats every difference as decisive, it will miss the point of precedent. The hard part is the middle: representing when two situations are alike enough, at the right level, and with the right strength.
Legal AI does not need more labels pretending to be reasoning. It needs representations that can admit nuance without dissolving into vibes.
Cognaptus: Automate the Present, Incubate the Future.
-
Henry Prakken and Wijnand van Woerkom, “Defending the Hierarchical Result Models of Precedential Constraint,” arXiv:2512.13505, 2025, https://arxiv.org/abs/2512.13505. ↩︎