Question banks work well until the examinee obtains the question bank.

After that, the test still produces scores. It may even produce beautifully precise rankings. What it no longer reliably produces is evidence that the examinee can solve unseen problems.

Large-language-model benchmarks face the same awkward lifecycle. A fixed evaluation set is published, discussed, copied into repositories, used in model-development pipelines, and eventually absorbed into training corpora. The benchmark remains visible; its diagnostic value quietly depreciates.

Encyclo-K proposes a structural response.1 Instead of treating finished questions as the permanent assets of a benchmark, it treats standalone knowledge statements as reusable components. Questions are assembled from those components at evaluation time.

The important change is not that Encyclo-K writes harder questions. It changes what benchmark builders curate.

That single design choice produces three consequences:

  1. Evaluation sets can be refreshed without rebuilding the underlying knowledge base.
  2. Each question can require the model to evaluate several knowledge statements together.
  3. Human effort shifts away from repeatedly writing complete questions and toward maintaining validated source material.

The result is closer to a test-generation system than a traditional question bank. It is also more complicated than the convenient slogan “dynamic benchmarks prevent memorization” suggests.

The benchmark problem begins with the unit being curated

Traditional knowledge benchmarks usually treat the question-answer pair as the smallest valuable object.

Someone writes or collects a question. Someone verifies its answer. The finished item enters a dataset. Every future evaluation reuses approximately the same object.

This creates three familiar weaknesses.

Question-level weakness Operational consequence
The complete question can enter training data A model may reproduce an answer without demonstrating generalization
One question usually targets one main concept Strong scores can hide difficulty combining several judgments
Each new item requires substantial design and review Expanding or refreshing the benchmark remains expensive

Encyclo-K changes the atomic unit from the question to the knowledge statement.

A statement is a self-contained description of a concept, rule, relationship, or factual claim. Once validated, it can appear in many different questions, alongside different neighboring statements and within different answer combinations.

This resembles the difference between storing finished reports and storing structured records. Finished reports are immediately readable, but difficult to recombine. Structured records demand more careful preparation, then support many downstream uses.

For benchmark construction, the preparation cost moves upstream. The benchmark team must create a reliable statement collection. Once that exists, software can repeatedly generate new evaluation sets.

Encyclo-K turns benchmark construction into a three-stage system

The paper builds Encyclo-K through three stages:

Authoritative textbook material
Validated standalone correct statements
Plausible but incorrect statement variants
Dynamically assembled multi-statement questions
Repeated model evaluation

Stage 1: Extract standalone correct statements

The authors collect material from 62 university- and graduate-level textbooks. The resulting collection spans 11 disciplines, 44 fields, and 62 subfields.

A vision-language model extracts knowledge statements from textbook screenshots. Statements are then filtered to remove unsuitable material, including references to external chapters, formula numbers, images, chemical equations, and ambiguous wording.

The final collection contains 21,525 correct statements.

Three annotators manually review all retained correct statements and report no significant issues. Human review therefore remains part of the pipeline. The cost reduction comes from avoiding repeated expert question-writing, not from making quality control disappear through the traditional academic technique of giving it a more exciting name.

Stage 2: Generate plausible incorrect statements

DeepSeek-R1 transforms the correct statements into incorrect counterparts.

The generation prompt asks for errors that remain professional, coherent, and deceptive. Suggested transformations include concept substitution, causal inversion, detail alteration, chronological displacement, logical gaps, and changes in scope.

This produces 21,494 incorrect statements, bringing the total collection to 43,019 statements.

The distinction between the two statement types matters. Correct statements inherit authority from the textbook source and receive complete manual review. Incorrect statements are model-generated and receive sampled review.

The authors manually inspect 200 generated incorrect statements. They find five cases where the supplied explanation of the error is inconsistent with the annotation, although the statements themselves remain factually incorrect. No additional processing is applied.

That review supports the feasibility of the method. It does not establish the correctness of every generated false statement with the same confidence as the fully reviewed correct collection.

Stage 3: Assemble questions dynamically

At evaluation time, software samples statements and inserts them into predefined question templates.

Under the standard configuration:

  • each question contains 8–10 statements;
  • each question provides 4–8 candidate options;
  • each option combines 2–4 statements;
  • the generated evaluation set contains 5,038 questions.

A model must inspect the statements, determine which are correct or incorrect, and select the option containing the appropriate combination.

The standard questions average approximately 3,113 tokens. Encyclo-K is therefore testing more than isolated factual recognition. It requires sustained attention across a long prompt, multiple local judgments, and a final mapping from those judgments to an answer option.

Dynamic composition refreshes the test without rebuilding the knowledge base

The most commercially useful feature of Encyclo-K is its separation between the stable knowledge layer and the temporary question layer.

A conventional benchmark refresh requires new questions. Encyclo-K can generate a new question set by changing the random seed used to select and combine existing statements.

The authors test this mechanism by generating five different evaluation sets and comparing three models across them:

Model Accuracy range across five generated sets
DeepSeek-R1 46.32%–48.84%
Qwen3-32B 36.34%–38.28%
Qwen2.5-32B-Instruct 23.35%–26.55%

The models retain the same ranking across all five sets, with no observed rank reversals.

This is a robustness test for refresh comparability. It shows that, for the evaluated models and configuration, random recombination can produce different question sets without making the resulting leaderboard unstable.

It does not directly test whether Encyclo-K is immune to training-data contamination.

Dynamic composition makes memorizing complete question-answer pairs less useful because the complete questions can change. Individual statements, however, remain stable assets. Should a model learn the truth label of every statement in the underlying collection, recombination alone would not prevent it from solving newly assembled questions.

The practical conclusion is narrower and more useful: Encyclo-K reduces dependence on a fixed answer key and makes repeated evaluation harder to overfit at the finished-question level. Protecting the statement bank and refreshing its contents remain necessary governance tasks.

Nine statements turn a fact check into a coordination problem

A model may correctly judge many individual claims and still fail when those claims appear together.

That is the reader misconception Encyclo-K exposes most effectively. High factual-benchmark performance does not necessarily mean a model can maintain several judgments, detect subtle errors, and convert the result into one coherent answer.

The main comparison is substantial:

Model Single-statement judgment Multi-statement question Decline
DeepSeek-R1 69.28% 48.99% 20.29 percentage points
Qwen3-32B 66.89% 37.28% 29.61 percentage points
Qwen2.5-32B-Instruct 59.11% 25.31% 33.80 percentage points

The performance loss is real. Its interpretation requires care.

Part of the decline reflects additional cognitive demands: statements compete for attention, errors may be subtle, and the model must preserve several intermediate conclusions. Part also follows mechanically from aggregating multiple decisions.

In a simplified example, a system that judges each statement correctly 90% of the time has only about a 39% probability of judging all nine statements correctly:

$$ 0.9^9 \approx 0.39 $$

Encyclo-K does not always require an exact independent judgment on every statement because the candidate options permit comparison and elimination. Still, the example explains why multi-statement accuracy should not be interpreted as a pure measure of deeper reasoning. Aggregation turns several moderate error probabilities into one fragile final answer.

The appendix separates context interference from answer synthesis

Appendix D provides a particularly useful diagnostic for DeepSeek-R1:

Evaluation setting Accuracy What the setting adds
Statements judged individually 69.28% Isolated factual verification
Statements judged within a question, then averaged 64.50% Shared context and multi-statement interference
Final multi-statement multiple-choice answer 48.99% Combination selection and answer synthesis

The decline from 69.28% to 64.50% suggests that merely placing statements together makes judgment harder.

The larger decline from 64.50% to 48.99% shows that producing the correct combined option introduces another substantial burden. A model can locally identify many correct and incorrect claims, then lose the question while assembling the final response.

For business evaluation, that distinction is valuable. A system may possess the necessary knowledge but fail at orchestration. The remedy for missing knowledge differs from the remedy for unreliable synthesis.

Explicit reasoning helps most when the test requires bookkeeping

The paper examines whether reasoning-oriented modes improve performance on the multi-statement task.

A first comparison looks across related model variants. Qwen3-30B-A3B-Thinking scores 40.57%, compared with 36.72% for Qwen3-30B-A3B-Instruct. Qwen3-4B-Thinking scores 33.45%, compared with 28.60% for Qwen3-4B-Instruct.

These comparisons suggest an advantage from reasoning, but model variants may differ in training and optimization. The authors therefore conduct a more controlled comparison using hybrid Qwen3 models that can operate in thinking and non-thinking modes while retaining the same architecture and parameters.

Thinking mode performs better at every tested scale from 0.6 billion to 32 billion parameters. More interestingly, the advantage increases with model size: from 0.52 percentage points at 0.6B to 9.47 percentage points at 32B.

The likely mechanism is straightforward. Larger models may possess more useful knowledge and reasoning capacity, but the multi-statement task only benefits from that capacity when the model actually performs the intermediate work.

The 14B hybrid model illustrates the problem. In non-thinking mode, it scores 17.98%, below the 8B and 4B models. Its average response is only 40.73 tokens, compared with 818 tokens for the 8B model and 996 tokens for the 4B model.

The authors describe this as “lazy answering.” That is a plausible exploratory explanation, rather than proof that longer responses automatically cause higher accuracy. Excess verbosity remains perfectly capable of being wrong at greater length. Here, however, unusually short answers coincide with skipped analysis and unusually weak performance.

For organizations evaluating reasoning models, the operational lesson is that model identity alone is insufficient. Inference configuration can materially affect observed capability. A large reasoning model evaluated in a shortcut-prone mode may look less capable than a smaller model allowed to work through the task.

The benchmark separates models, but difficulty is partly designed into the interface

Across more than 50 evaluated models, the strongest reasoning model in the paper reaches 62.07% average accuracy. The strongest chat model reaches 50.40%.

Reasoning-model scores range from 16.04% to 62.07%, while chat-model scores range from 9.71% to 50.40%. This broad spread gives Encyclo-K useful discriminative power and leaves substantial headroom for future systems.

The lower end of the range requires context. An average question contains six answer options, implying an approximate random-choice baseline of 16.7% when options are equally likely. Because individual questions contain between four and eight options, the precise chance level varies. Scores near the lower end may therefore reflect little more than weakly informed selection.

Difficulty is also configurable.

When the authors increase the option count from four to ten:

  • DeepSeek-R1 falls from 64.08% to 44.85%;
  • Qwen3-32B falls from 49.71% to 29.51%;
  • Qwen2.5-32B-Instruct falls from 34.56% to 14.95%.

This sensitivity analysis shows that the generator can create more demanding tests and widen performance differences among models. It also means scores are inseparable from generation settings. An organization cannot casually change the option count, statement count, or combination rules and continue treating historical scores as directly comparable.

The paper runs additional tests to determine whether the headline results are driven by simpler artifacts.

Test Likely purpose What it supports What it does not prove
Five random seeds Robustness test Refreshed sets can preserve rankings for the tested models Universal stability or complete contamination resistance
Single- versus multi-statement evaluation Mechanism test Combining judgments creates substantial additional difficulty That every lost point represents missing knowledge
Thinking versus non-thinking modes Controlled comparison Explicit reasoning improves performance, especially at larger scales That longer reasoning always improves correctness
Option-count variation Sensitivity test Difficulty can be adjusted programmatically That scores remain comparable across configurations
Option-position analysis Artifact diagnostic No consistent position preference appears for the three tested models Absence of position effects for every model
Aggregation-level comparison Reporting sensitivity test Domain weighting changes reported averages A single universally correct weighting scheme

This distinction matters because supporting experiments often receive more interpretation than they can comfortably carry. The appendix tests strengthen specific claims. They do not collectively turn a long-form multiple-choice benchmark into a complete theory of intelligence.

The business value is a reusable diagnostic engine

Encyclo-K’s most transferable contribution is architectural.

An organization could build a private evaluation system around the same four layers:

  1. Authoritative statement registry Convert policies, product specifications, operating procedures, technical documents, or professional guidance into self-contained, versioned statements.

  2. Adversarial variant library Create plausible incorrect versions that reflect common employee mistakes, model hallucinations, outdated rules, or dangerous changes in scope.

  3. Configurable question generator Assemble fresh evaluation sets with controlled difficulty, domain balance, and answer formats.

  4. Diagnostic reporting layer Track performance by statement type, knowledge domain, model version, inference configuration, and evaluation date.

This design can support several recurring tasks.

Model and vendor selection

A company comparing several models can evaluate them against the same underlying knowledge base while generating different question compositions. This makes it harder to optimize for one visible test set and allows performance to be examined by business domain.

Regression testing after model changes

Model upgrades, prompt revisions, retrieval changes, and inference-setting adjustments can be tested against newly generated sets. Stable generation rules provide repeatability without requiring the exact same questions every time.

Internal knowledge-quality testing

The same mechanism can test whether a model distinguishes current policy from plausible but outdated or subtly altered versions. A statement bank can also reveal which topics repeatedly produce errors.

Training evaluation

Employees or AI agents can be assessed on combinations of related rules rather than isolated recall. That is especially useful when real failures arise from applying several individually familiar rules together.

The economic value comes from reuse. Once authoritative statements and validated false variants exist, the organization can generate many evaluations without commissioning a new expert-written question set for every testing cycle.

The quality burden, however, moves into maintaining the statement registry.

Paper result Cognaptus business inference Boundary
Randomly generated sets preserve rankings across five seeds for three models Private evaluations can be refreshed while retaining useful comparability Organizations must keep generation settings stable and test their own seed sensitivity
Multi-statement questions produce large performance declines Isolated policy questions may overestimate a model’s ability to combine rules The decline includes interface and aggregation effects, not only reasoning failure
Thinking modes improve performance Inference settings should be evaluated alongside model choice Higher reasoning cost and latency must be measured separately
Statements can be reused across many questions Recurring evaluation costs may fall after the knowledge layer is built Initial source curation, versioning, and validation remain substantial
Difficulty changes with option count Tests can be calibrated for different model tiers Scores from different configurations are not directly interchangeable

The ROI case is therefore cheaper repeated diagnosis, rather than automatic proof of production readiness.

Where the “expert-free” promise stops

The paper describes an expert-free annotation pipeline because annotators do not need to repeatedly design specialist questions.

That is a meaningful reduction in labor. It should not be interpreted as permission to remove expertise from high-stakes evaluation.

Correct statements still require trustworthy sources, careful extraction, and version control. Incorrect statements must be false for the intended jurisdiction, time period, and operational context. A statement that is incorrect in one market may be correct in another. A policy that was false last quarter may become mandatory next quarter.

The paper’s quality-control process also places different confidence levels on different components:

  • all retained correct statements receive manual review;
  • only 200 of the 21,494 incorrect statements receive sampled manual review;
  • five reviewed samples contain inconsistent explanations, although the statements remain incorrect.

For an academic feasibility study, that may be sufficient to demonstrate a scalable pipeline. For medical, legal, financial, or safety-related deployment, sampled review of generated false statements would rarely be enough. Deceptive errors are useful only when evaluators are certain which part is deceptive and why.

Other boundaries are equally important.

Encyclo-K currently covers English and Chinese and a selected collection of textbook disciplines. Its questions are long-form multiple-choice tasks. The benchmark does not directly evaluate retrieval quality, tool use, uncertainty calibration, interaction with users, or performance inside a real business workflow.

Its discipline distribution is proportional to the available statement collection. That is sensible for dataset construction, but businesses should weight evaluation areas according to operational risk rather than document volume. Ten pages of rarely used guidance should not automatically outweigh one paragraph governing a catastrophic failure mode.

Finally, dynamic generation requires disciplined configuration management. Option count materially changes difficulty. Domain weighting changes aggregate scores. A refreshed test remains comparable only when the generation process itself is controlled.

Question banks are not dead; fixed answer keys are

Encyclo-K’s strongest idea is not that every benchmark should become an encyclopedia-sized multiple-choice examination.

Its strongest idea is that evaluation should be built from reusable, governed knowledge components rather than a permanent list of finished questions.

That change makes refreshes cheaper. It reduces dependence on fixed answer keys. It exposes models that can recognize facts individually but struggle to maintain and combine several judgments. It also creates new responsibilities: statement-level quality assurance, generator versioning, difficulty calibration, and careful interpretation of what the resulting scores actually measure.

The benchmark becomes less like a published exam and more like an evaluation engine.

For organizations, that is the useful direction. A static test tells you how a model performed on yesterday’s questions. A governed statement bank and a controlled generator can keep asking whether the model still understands the rules after the questions change.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yiming Liang et al., “Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements,” arXiv:2512.24867, January 2026. ↩︎