ASKing Smarter Questions: When Scholarly Search Learns to Explain Itself

Search used to be a polite negotiation with a database.

You typed keywords. The system returned papers. You inspected titles, opened tabs, skimmed abstracts, cursed quietly, adjusted the keywords, and repeated the ritual until either the literature became clear or your soul left the building.

Large language models changed the ritual, but not always for the better. Now a system can answer a research question directly, which feels magical until one remembers that “fluent” and “correct” are not synonyms. In scholarly work, this distinction is not academic decoration. It is the difference between literature discovery and very confident misinformation wearing a lab coat.

That is the useful starting point for ORKG ASK, introduced by Allard Oelen, Mohamad Yaser Jaradeh, and Sören Auer as an AI-driven scholarly literature search and exploration system built around vector search, retrieval-augmented generation, large language models, and knowledge-graph-based filtering.¹ ASK is not presented as a chatbot that replaces reading. It is a search system that attaches generated explanations to retrieved literature so researchers can decide what deserves closer inspection.

That framing matters. Many AI search tools are sold as if the answer is the product. ASK’s more interesting claim is that the workflow is the product: retrieve relevant papers, extract question-specific information, synthesize a sourced overview, expose the generation context, and let users narrow the corpus with semantic filters. Less fireworks. More plumbing. Often, the plumbing is where serious systems either become useful or quietly flood the basement.

ASK is search with answers attached, not answer generation with citations attached

The common misconception is easy to see: a user enters a research question, the system generates answers, therefore ASK must be a research question answering system.

The authors push against that interpretation. ASK’s primary purpose is scholarly information retrieval. The generated answer is a support layer, a way to help researchers assess whether a paper is relevant before investing time in reading it. That is a narrower, safer, and frankly more realistic ambition than “AI reads the literature for you.”

The distinction changes how the system should be judged.

If ASK were mainly a question-answering engine, the central evaluation question would be: are the generated answers correct, complete, and superior to alternative AI answer systems? The paper does not fully establish that. It reports operational feedback, usability results, interaction analytics, and a small comparison against Google Scholar. Useful evidence, yes. A leaderboard-style victory over proprietary AI research assistants, no.

If ASK is treated as an AI-assisted scholarly search system, the evidence becomes more coherent. The right question becomes: does the system make literature discovery easier, more transparent, and more inspectable than keyword-only search? On that question, the paper has a more defensible story.

The system is built around five functional requirements: literature search, information extraction, answer synthesis, result filtering, and bibliography management. It also defines non-functional requirements around reproducibility, accessibility, usability, maintainability, and interoperability. This is already a hint that the authors are not merely showing a clever prompt. They are describing a service architecture.

That difference is important for business readers. A prompt can impress a demo audience. A service architecture has to survive users, latency, data quality, accessibility requirements, interface confusion, reproducibility concerns, and the eternal joy of production maintenance. ASK lives closer to the second world.

The mechanism: retrieve, extract, synthesize, filter, inspect

ASK’s mechanism can be understood as a controlled sequence rather than a single act of generation.

Research question
      ↓
Vector search over indexed literature
      ↓
Relevant article contexts: abstract and, when available, full text
      ↓
LLM extraction for each retrieved article
      ↓
Synthesized answer with citations to listed results
      ↓
Knowledge-graph and semantic filters
      ↓
Reproducibility menu: prompt, model, parameters, context

This sequence is the core reason the paper is more interesting than another “LLM for literature review” announcement.

The retrieval layer narrows the universe. ASK indexes scholarly documents using embeddings and vector search, with Nomic’s nomic-embed-text-v1.5 model described as using 768-dimensional embeddings and an 8K token context window. The point is not that this particular embedding model is magically final. The point is that ASK gives the language model a bounded set of documents to work with. In the authors’ terminology, this is the non-parametric memory: an updatable document store that can be expanded without retraining the model.

The generation layer then works inside that retrieved context. ASK uses Mistral 7B Instruct v0.2, described as a relatively small model with a 32K token context window. The paper emphasizes that the authors did not need custom fine-tuning at this stage because the LLM is mostly used for extraction and text generation over retrieved material. That is a quiet but useful lesson: for many enterprise search products, the expensive part may not be training a private model. It may be designing the retrieval corpus, prompts, parsers, interface, caching, and audit trail well enough that a smaller model can do useful work.

The symbolic layer narrows and organizes the search space. ASK’s “neuro-symbolic” label refers to vector search and LLMs on the neural side, and knowledge graphs on the symbolic side. The practical expression is semantic filtering: users can refine results through structured concepts rather than relying only on keyword reformulation. In business terms, this is the difference between “ask the model again” and “constrain the search environment more intelligently.”

Then comes the inspection layer. ASK exposes the prompt, model, model parameters, and context used for generated content through a reproducibility menu. This is not cosmetic. In scholarly and professional knowledge work, a generated answer without its construction history is a very smooth liability.

The governance layer is not a nice extra; it is the product boundary

The most business-relevant part of ASK is not that it uses RAG. Everyone and their procurement slide now uses RAG, or at least claims to.

The more serious feature is the reproducibility menu. For LLM-generated content, ASK exposes the prompts, model, parameters such as temperature and seed, and the context used for generation. That moves the system away from “trust this answer” and toward “inspect how this answer was produced.”

This is where the paper speaks to enterprise AI beyond academia.

In a company, “knowledge search” rarely fails only because retrieval is weak. It fails because people cannot tell whether the answer came from the right documents, whether the system used outdated context, whether two users received different outputs, whether the model invented a bridge between sources, or whether the answer is good enough for a decision with consequences. The question is not merely accuracy. It is traceability under uncertainty.

ASK’s design suggests a practical pattern:

System layer	What ASK does	Enterprise translation
Corpus boundary	Uses a defined indexed corpus, mainly CORE open-access papers plus selected special collections	Define what the system is allowed to know before asking it to explain anything
Retrieval	Uses vector search to rank relevant documents	Treat search as evidence selection, not just text matching
Extraction	Generates article-level answers to the user’s research question	Help users triage documents before deeper reading
Synthesis	Produces an overview answer tied to listed results	Provide a first-pass map, not a final verdict
Filtering	Supports semantic narrowing through knowledge-graph concepts	Add structured control instead of relying only on repeated prompting
Reproducibility	Shows prompt, model, parameters, and context	Make AI outputs auditable enough for serious workflows

The table also shows the boundary. ASK is not claiming that traceability eliminates hallucination. The authors explicitly state that hallucinations can still occur and that users are expected to manually check generated information. The point is more modest and more useful: when AI is used for discovery, the system should make verification easier rather than pretending verification is unnecessary.

That is the grown-up version of AI search. Less “the model knows.” More “the system helps you find, inspect, and verify.”

The corpus is large, open, and uneven by design

ASK’s indexed corpus is built mainly on CORE open-access research papers. The paper reports that ASK imported 76.4 million articles from CORE after excluding items that did not meet basic requirements such as valid titles and abstracts. Of these imported articles, 36.9% have a DOI and 25% have full text available.

Those numbers are impressive, but they should be read carefully.

A 76.4 million article corpus gives ASK broad coverage, especially for open-access literature. But only a quarter having full text available means that many generated answers will depend on abstracts rather than complete papers. In literature search, that is still useful. Abstracts often contain the research question, method, and headline finding. But for detailed claims, methods, limitations, and nuanced interpretation, abstract-level evidence is not enough.

The paper also notes that the CORE data requires curation before ingestion because it is crawled from repositories and publisher websites. The authors use heuristics to decide which items are suitable for indexing, with valid abstracts playing a major role. Again, this is not a flaw so much as a reminder: AI search quality is heavily dependent on data ingestion quality. Garbage in, fluent garbage out. We have met this species before.

The special collection example is also instructive. ASK includes approximately 310 BMBF-conform research reports related to autonomous driving, demonstrating how a library or organization could use the system to expose a curated institutional collection. This may be the more transferable business case. Large public corpora are useful, but many organizations have smaller, high-value document collections where semantic search, extraction, and reproducibility controls could create immediate value.

Think regulatory filings, internal research reports, engineering documentation, legal memos, market intelligence archives, medical guideline libraries, or procurement records. In these settings, the ASK pattern is not “build a universal search engine.” It is “make a bounded knowledge collection explorable without making the user surrender judgment to a chatbot.”

The evaluation is useful, but it is not a quality benchmark

The paper evaluates ASK through operational feedback, a small user experiment, and production analytics. The results are worth discussing because they show how the system behaves in real use. They should not be inflated into stronger claims.

Here is the cleanest way to read the evidence.

Evidence source	Likely purpose	What it supports	What it does not prove
Question-specific feedback from 1,212 form submissions by 1,032 browser-fingerprinted users	Operational usability and perceived answer usefulness	Users interacted with generated answers and provided mixed-to-neutral feedback on helpfulness, correctness, and completeness	It does not establish objective answer correctness
UMUX-Lite feedback from approximately 443 users, with 409 included in the final score	Usability measurement	ASK is generally usable, with a reported UMUX score of 65.7/100	It does not prove ASK meets every research need or beats other AI search tools
User satisfaction responses from 363 users	General satisfaction signal	Satisfaction leans more positive than negative	It does not diagnose which features caused satisfaction
Within-subject experiment with 9 participants comparing ASK and Google Scholar	Exploratory comparison with established search	ASK showed lower perceived task load than Scholar in this small setup	It is too small and specific to serve as a definitive product benchmark
Web analytics from May 15, 2024 to February 1, 2025	Production adoption and behavior	ASK attracted substantial usage and users performed real search-related actions	It does not measure the correctness of individual outputs

The small Google Scholar comparison is especially tempting to overstate, so let us not.

In the experiment, 9 participants answered four predefined research questions, two under each condition. They had to use at least two references per answer, and they were required to manually verify correctness from the source article. The study measured perceived task load using an unweighted NASA TLX scale and recorded completion time.

The reported task load was much lower for ASK than for Google Scholar: 26.76% versus 61.3%. Time was also lower for ASK when one outlier was excluded: 984.41 seconds versus 1241.39 seconds. With the outlier included, ASK took longer: 1628.79 seconds versus 1267.49 seconds.

This is promising, not conclusive. The result suggests ASK may reduce the cognitive burden of literature search by turning raw result lists into question-specific summaries and article-level extractions. But with only 9 participants, prior ASK exposure among participants, predefined questions, and a comparison against Google Scholar rather than proprietary AI search assistants, the study is best read as exploratory evidence.

The authors themselves are careful here. They note that ASK and Google Scholar may not be directly comparable and that a quantitative comparison with similar AI scholarly search services is out of scope.

Good. The paper does not need to pretend to be a final tournament bracket. Its stronger contribution is showing what a transparent, open, production scholarly search system can look like.

The production analytics say users are actually doing search work

The analytics are less glamorous than a benchmark chart, but they may be more useful for product thinking.

From May 15, 2024 to February 1, 2025, the paper reports 74,145 visits, 26,354 returning visits, 219,189 pageviews, an average visit duration of 4 minutes and 1 second, and a 3% bounce rate. It also reports 67,949 queries asked, 7,595 downloads, 19,723 outlinks, 723 custom filters added, 415 custom columns added, and substantial use of “load more” interactions.

These numbers suggest that users were not merely visiting a demo page and leaving. They were asking queries, opening external materials, downloading items, and continuing through result pages. The 3% bounce rate is particularly notable as a usage signal, though it should not be interpreted as proof of answer quality.

The feature-level behavior is also interesting. Custom filters and custom columns were used relatively infrequently compared with queries, outlinks, and downloads. The authors suggest two possible interpretations: users may have been satisfied with default extracted properties, or they may not have understood how to use those features.

That ambiguity is useful. For enterprise AI search, advanced controls often sound excellent in architecture diagrams and then disappear in actual usage. Users may want the system to be inspectable, but they may not want to operate a cockpit. The design challenge is to make governance available without making the interface feel like a compliance seminar with buttons.

ASK also reports meaningful non-desktop usage: 76.9% desktop, 21.2% smartphone, and 1.9% other devices. This supports the authors’ emphasis on responsive design and accessibility. For scholarly search, mobile use may sound secondary. For organizational knowledge tools, it is not. Decision-makers often consume knowledge products between meetings, while traveling, or from whatever device happens to be nearby. A serious search product should not assume everyone is sitting peacefully at a 27-inch monitor, because reality enjoys comedy.

The direct paper result is about scholarly search. The broader business inference is about enterprise knowledge navigation.

Many companies already have the ingredients ASK uses: document repositories, metadata, internal taxonomies, search indexes, and growing pressure to add LLM interfaces. The mistake is to bolt a chatbot onto the repository and call it transformation. That produces a pleasant demo and a governance headache.

ASK points to a more durable pattern:

Start with the corpus. Decide which documents belong in the system and what quality thresholds they must meet. ASK’s CORE ingestion process illustrates this with title and abstract requirements.
Use retrieval to constrain generation. The LLM should operate on retrieved context, not free-associate from parametric memory.
Generate intermediate evidence, not only final summaries. ASK displays article-level extracted answers as well as a synthesized answer. This helps users see how the system moves from documents to overview.
Preserve structured controls. Knowledge-graph-based filters matter because business users often need to narrow by concept, source, domain, region, product, jurisdiction, or time period.
Expose the generation recipe. Prompt, model, parameters, and context are not backend trivia. They are part of the trust surface.
Design for verification. ASK explicitly warns users that generated information must be manually checked. That warning is not weakness. It is alignment between system capability and responsible use.

For Cognaptus readers, the ROI relevance is not simply “AI makes search faster.” That claim is too cheap. The more precise value is reduced triage cost under traceability constraints.

A researcher, analyst, lawyer, compliance officer, investor, or product strategist rarely needs a machine to produce one beautiful paragraph. They need a machine to shorten the path from question to relevant evidence, while preserving enough visibility to decide whether the evidence deserves trust. ASK is interesting because it treats that path as a workflow rather than a magic answer box.

Where the result applies, and where it does not

The paper’s strongest claim is architectural and operational: ASK demonstrates a public, open-source, production scholarly search system that combines vector retrieval, LLM extraction and synthesis, knowledge-graph filtering, reproducibility controls, and usage evaluation.

Its weaker claim would be answer-quality superiority. The paper does not establish that ASK’s generated answers are more accurate than those of commercial AI search tools such as Elicit, Consensus, or SciSpace. It also does not provide a large-scale objective evaluation of generated answer correctness. The authors acknowledge that such comparisons are future work.

Three boundaries matter for business interpretation.

First, corpus quality remains decisive. ASK imports a large CORE-based corpus, but full text is available for only 25% of imported articles. In enterprise settings, this means the value of the system will depend heavily on document completeness, metadata quality, access permissions, and update discipline.

Second, user feedback is informative but noisy. Operational feedback is collected from real users in an uncontrolled environment. That increases ecological validity but limits causal interpretation. Users may misunderstand questions, test the form casually, or submit partial impressions. Browser fingerprinting is also imperfect for counting unique users.

Third, advanced features need adoption design. Custom filters and custom columns were used much less often than basic query and navigation actions. If semantic filters and custom extraction are key differentiators, future product design may need better onboarding, defaults, prompts, or interface cues.

These limitations do not weaken the paper’s main contribution. They prevent the wrong conclusion. ASK is not proof that open AI scholarly search has solved literature review. It is evidence that a more inspectable architecture is feasible, useful enough to attract real usage, and promising enough to deserve more rigorous evaluation.

That is already a meaningful contribution. Not every paper needs to declare the future has arrived. Sometimes it is enough to show that the future has a backend, a reproducibility menu, and fewer excuses.

The quiet shift: from search results to explainable discovery

Traditional scholarly search gives users a list. LLM chat gives users an answer. ASK tries to occupy the more useful middle: a search result page that explains itself.

That middle ground is where many professional AI tools should probably live.

A fully automated answer is attractive when the decision is low-stakes. Scholarly work is not that. Neither are legal reviews, investment memos, regulatory monitoring, clinical guideline search, technical due diligence, or strategic intelligence. In these domains, the user does not only ask, “What is the answer?” The user asks, “Where did this come from, what did the system see, what did it ignore, and can I verify the path?”

ASK’s answer is not perfect, but the system design is pointed in the right direction. It narrows the search space with retrieval, generates question-specific extractions, synthesizes an overview, supports semantic filtering, and exposes generation details. It uses AI to reduce the cost of discovery rather than pretending discovery has become unnecessary.

That is the sober version of AI search. It may not produce the loudest demo. It may, however, be much closer to what organizations actually need.

Because in serious knowledge work, the smartest system is not the one that answers most confidently. It is the one that helps users ask, inspect, narrow, verify, and then think.

A tragic concept, I know. The human remains in the loop.

Cognaptus: Automate the Present, Incubate the Future.

Allard Oelen, Mohamad Yaser Jaradeh, and Sören Auer, “Introducing ORKG ASK: an AI-driven Scholarly Literature Search and Exploration System Taking a Neuro-Symbolic Approach,” arXiv:2512.16425, 2025, https://arxiv.org/abs/2512.16425. ↩︎

ASK is search with answers attached, not answer generation with citations attached#

The mechanism: retrieve, extract, synthesize, filter, inspect#

The governance layer is not a nice extra; it is the product boundary#

The corpus is large, open, and uneven by design#

The evaluation is useful, but it is not a quality benchmark#

The production analytics say users are actually doing search work#

The business lesson is auditable knowledge navigation#

Where the result applies, and where it does not#

The quiet shift: from search results to explainable discovery#