Query the Receipt, Not the Vibe: DualGraph and the RAG Catalog Problem

Catalogs look deceptively friendly to RAG systems. A product page has descriptions, feature bullets, specification tables, prices, variants, categories, and marketing copy. Feed those pages into a vector database, ask an LLM a question, and the system should answer. This is the comforting story. It is also where many enterprise RAG demos begin their quiet decline into customer-support theater.

The problem is not that semantic retrieval is useless. It is that many business questions over semi-structured data are not really asking for “relevant text.” They are asking for a controlled operation over a set: filter these products, combine these conditions, compare these families, list every matching item, and do not forget the awkward variant hidden behind a slightly different page layout. “Which phones under £500 support 5G, have a 120Hz display, and include at least a 5000mAh battery?” is not a vibe. It is a database query wearing natural-language clothing.

That is the useful lens for reading Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering, a Samsung AI paper introducing DualGraph and the SpecsQA benchmark.¹ The paper is not merely another GraphRAG variant. Its more practical claim is sharper: GraphRAG-style semantic structure still does not reliably solve exact filtering, aggregation, and exhaustive enumeration. For that, a RAG system needs a symbolic path as well as a semantic one.

The failure mode is retrieval that finds the neighborhood, not the answer set

A standard dense RAG pipeline usually asks: which chunks are semantically close to the question? That works well when the answer is concentrated in a passage. Ask “What is wireless charging?” and semantic retrieval has a reasonable job: find explanatory text, pass it to the generator, produce an answer.

But product QA often asks a different kind of question. The answer may be spread across hundreds or thousands of product records. The evidence may live in specification tables, variant metadata, price fields, and duplicated product-family pages. The right output may be a list, not a paragraph. Worse, the absence of a product from the list matters as much as its presence. Missing one qualifying device is not a graceful stylistic variation. It is an error.

This is where the paper’s misconception matters. A reader might assume that GraphRAG already solves this because it builds graphs. Not necessarily. Many graph-based RAG systems still retrieve semantically: they organize text into entities, relations, summaries, communities, or graph neighborhoods, then select context that looks relevant to the query. That can improve multi-document retrieval, but it does not automatically provide formal operations such as:

Operation the user implicitly asks for	Why semantic retrieval struggles	What symbolic querying adds
Exact filtering	Similar chunks can mention related products without satisfying every constraint	Explicit predicates and filters
Exhaustive list retrieval	Top-$k$ context may omit valid matches outside the retrieved neighborhood	Query execution over the full structured graph
Numeric comparison	Embeddings do not naturally enforce thresholds such as price or battery capacity	Typed values, normalized units, arithmetic filters
Aggregation and group comparison	Relevant evidence may be distributed across product families	Structured traversal and controlled joins
Variant-sensitive answers	Product pages may mix variants, prices, and model identifiers	Entity normalization and product-variant representation

DualGraph’s central mechanism is built around this split. It represents the same corpus through two views: a Textual Knowledge Graph, or TKG, for semantic retrieval over natural-language descriptions, and a Symbolic Knowledge Graph, or SKG, for exact querying over typed subject–predicate–object triples. The paper’s title asks whether to query symbolically or retrieve semantically. The practical answer is: both, but not for the same job.

SpecsQA makes catalog QA harder than a polished demo

The paper introduces SpecsQA to test this problem in a setting that resembles an actual commercial catalog rather than a clean academic table. The dataset is built from a snapshot of the Samsung UK online shop collected on November 14, 2025. The authors scraped 2,162 webpages across 26 product categories. They also extracted structured metadata such as product names, categories, prices, model identifiers, variant configurations, and specification attributes.

That detail matters. A product catalog is not just one table. It is a living website with multiple page layouts, missing fields, repeated variants, prices stored outside product pages, specification pages that may describe several variants at once, and product families whose names are almost—but not quite—consistent. The appendix notes three rough page layouts. One layout is general, another is used for premium smartphones with a separate specifications page, and a third, mostly for accessories, may not include a specification table at all. In other words: exactly the kind of mildly untidy environment where enterprise data engineering earns its salary.

SpecsQA contains 117 manually written questions. The categories include inverse queries, multi-condition queries, group comparisons, and reasoning queries. The first two categories are especially important for the paper’s thesis because they require filtering over attributes and often demand exhaustive product lists. The latter two categories require more synthesis and sometimes more subjective recommendation logic.

The benchmark also distinguishes objective from subjective questions. In the full results, there are 93 objective questions and 24 subjective questions. That split is useful because list matching is meaningful when there is a factual answer set, but recommendation questions can have multiple acceptable answers. If a benchmark treats “good phone for an elderly person” as if it were the same kind of answer as “phones with 8GB RAM under £600,” the evaluation starts doing interpretive gymnastics. Nobody enjoys that, except maybe benchmark tables.

DualGraph separates language matching from constraint satisfaction

The core architecture is easier to understand if we treat it as two retrieval instruments mounted over the same corpus.

The TKG is the semantic instrument. It preserves natural-language context and ambiguity. The authors build it using an open information extraction method based on UnWeaver: chunks are mapped to entity mentions, entity descriptions are aggregated, and entity-centric retrieval uses similarity between the question and entity descriptions to vote for relevant chunks. In implementation, the paper retains entity nodes and drops explicit TKG edges for efficiency.

The SKG is the symbolic instrument. It converts structured components, especially specification tables, into ontology-consistent triples. Products, product ranges, categories, specification rows, features, values, and numeric attributes become typed graph objects. The SKG supports SPARQL-style exact querying: filtering, comparison, arithmetic constraints, and exhaustive listing.

The important point is not merely “there are two graphs.” The important point is that the two graphs preserve different kinds of information loss.

Semantic retrieval is forgiving. If a user phrases a question messily, the TKG can still retrieve text that appears contextually related. But it is not reliable for exact set operations.

Symbolic querying is precise. If the graph contains normalized triples for product features, prices, batteries, and display types, a SPARQL query can enforce conditions directly. But symbolic querying is brittle when the question is vague, the schema lacks a field, the data appears only in marketing text, or the natural-language-to-SPARQL translation fails.

DualGraph’s design therefore offers several orchestration strategies:

DualGraph variant	Mechanism	Operational interpretation
TKG only	Semantic retrieval only	Good for open-ended textual context, weak for exact lists
SKG only	Symbolic querying only	Stronger for structured constraints, brittle when symbolic retrieval fails
SKG + TKG concatenation	Use both contexts together	Higher context coverage, higher query cost
SKG + TKG fallback	Try symbolic retrieval; fall back to semantic retrieval if needed	Practical default for specification-heavy catalogs
Router	LLM chooses symbolic or semantic retrieval	Useful when question mix is balanced, but routing can fail
Router + TKG fallback	LLM routes, with semantic fallback after symbolic failure	Better general-purpose trade-off in the paper
Agentic router	Iterative LLM agent calls retrieval tools and self-reflects	More expensive; not automatically better on core metrics

This is the mechanism-first story. DualGraph is not saying semantic retrieval is obsolete. It is saying semantic retrieval should not be forced to imitate a database.

SPARQL generation needs patterns, not just a schema pasted into a prompt

One easily missed technical detail is how the SKG side turns natural-language questions into symbolic queries. The paper uses an LLM to generate SPARQL, but it does not simply hand the model a schema and hope for disciplined behavior. The prompt includes a common component—persona, query-generation rules, domain description, schema, numerical handling rules, and output format—and a question-specific component based on retrieved graph patterns.

The graph patterns are important because they ground SPARQL generation in actual structures found in the SKG. The paper uses four pattern types:

Pattern type	What it represents	Why it matters
Spec	Product specification entries and values	The main bridge from product-table rows to queryable constraints
Feature	Product capabilities	Helps with reusable product-level capabilities
Category	Product family or category information	Helps locate product groups, though not always beneficial
Singular Node	Individual graph entities	Provides concrete anchors for query generation

This pattern mechanism is not decorative engineering. The ablation results show that Spec patterns are the most critical. In the SKG + TKG fallback setting, the full pattern setup achieves 50.2% successful SKG retrieval, 0.293 factual-correctness F1, and 0.372 list-matching F1. Removing the Spec pattern drops successful SKG retrieval to 37.7%, factual-correctness F1 to 0.247, and list-matching F1 to 0.292. Removing all patterns is worse: 29.4% successful SKG retrieval, 0.230 factual-correctness F1, and 0.253 list-matching F1.

The likely purpose of this experiment is ablation, not a second thesis. It tests whether the symbolic retriever’s performance comes from the graph-pattern grounding rather than merely from having an SKG somewhere in the architecture. The answer is mostly yes. Specification-level patterns are the workhorse because SpecsQA is built around product attributes.

The Category pattern is more ambiguous. Removing it slightly improves some downstream metrics in the SKG + TKG fallback variant: factual-correctness F1 rises to 0.317 and list-matching F1 to 0.380. The authors still keep all patterns in the main experiments because the full configuration gives robust retrieval success and may generalize better across question types. That is a sensible default. It also quietly reminds us that “more schema context” is not always better. Sometimes extra structure gives the model another elegant way to be wrong.

The main evidence: symbolic-semantic hybrids beat the usual suspects

The paper compares DualGraph against dense RAG, graph-based RAG, symbolic RAG, table-oriented RAG, agentic RAG, and an LLM-only baseline. The models and evaluation setup are held consistent: GPT-OSS-120B is used as the underlying LLM, Qwen3-Embedding-4B as the embedding model, and results are averaged over multiple indexing, query-generation, and evaluation runs to reduce stochastic variation.

The headline result is clear. DualGraph’s representative variants outperform the baselines on SpecsQA.

System	Factual Correctness F1	List Matching F1	LLM-as-a-Judge
LLM only	0.107	0.028	0.559
Vector RAG	0.092	0.118	0.438
Microsoft GraphRAG	0.153	0.203	0.575
RAPTOR	0.207	0.216	0.568
Wikontic	0.135	0.290	0.523
TableRAG	0.091	0.118	0.453
DualGraph Router + TKG fallback	0.298	0.357	0.640
DualGraph SKG + TKG fallback	0.293	0.372	0.644

The strongest non-DualGraph factual-correctness baseline is RAPTOR at 0.207. The strongest non-DualGraph list-matching baseline is Wikontic at 0.290. DualGraph improves on both: Router + TKG fallback reaches 0.298 factual-correctness F1, while SKG + TKG fallback reaches 0.372 list-matching F1.

For a business reader, the absolute numbers matter as much as the ranking. A list-matching F1 of 0.372 is not a victory parade. It is a sign that semi-structured QA remains hard. The more useful interpretation is diagnostic: if state-of-the-art systems struggle to retrieve complete product lists from a controlled website snapshot, then enterprise RAG over messy internal catalogs, procurement databases, policy manuals, and support knowledge bases should be treated as an engineering problem, not a prompt template.

The paper’s result is not “DualGraph is production-ready.” The result is “this class of questions exposes a capability gap, and a symbolic-semantic split closes part of it.” That is a much more valuable conclusion, because it tells us where to put engineering effort.

The ablation says SKG carries the catalog burden

The internal DualGraph ablation is the most useful part of the paper for system designers. TKG-only retrieval reaches 0.140 factual-correctness F1 and 0.136 list-matching F1. SKG-only retrieval reaches 0.273 and 0.321. Hybrid variants do better.

DualGraph variant	Factual Correctness F1	List Matching F1	LLM-as-a-Judge	Likely purpose of test
TKG only	0.140	0.136	0.352	Tests semantic retrieval alone
SKG only	0.273	0.321	0.507	Tests symbolic retrieval alone
SKG concat TKG	0.306	0.367	0.561	Tests joint context coverage
SKG + TKG fallback	0.293	0.372	0.523	Tests symbolic-first robustness
Router	0.268	0.303	0.485	Tests question-dependent routing
Router + TKG fallback	0.298	0.357	0.528	Tests routing plus failure recovery
Agentic router	0.240	0.341	0.764	Tests iterative tool-use orchestration

The interpretation is fairly direct. On SpecsQA, symbolic retrieval carries much of the catalog burden. The benchmark is specification-heavy, so SKG-based variants have a natural advantage. The TKG still matters, but mainly as a robustness layer: it helps when a question is open-ended, underspecified, or when symbolic retrieval fails.

The agentic result is particularly useful because it punctures a fashionable assumption. The agentic router has the highest LLM-as-a-judge score, 0.764, but weaker factual correctness and list matching than simpler hybrid variants. The paper notes that LLM-as-a-judge may prefer longer answers, and the appendix shows a length-related bias in the judge. This does not mean agentic retrieval is useless. It means iterative tool use is not a magic solvent for structured retrieval. Sometimes the agent produces a more judge-pleasing answer while still failing the set operation. Elegant prose remains a poor substitute for the right product list. Shocking, I know.

SpecsQA is a benchmark for diagnosis, not a consumer-search leaderboard

The benchmark’s question categories help reveal which retrieval capability is being tested.

Inverse and multi-condition questions are the symbolic heartland. These ask for all products satisfying properties or combinations of constraints. DualGraph performs especially well here because the SKG can enforce structured filters. This is the cleanest support for the paper’s main thesis.

Group-comparison questions are harder. They require aggregation and contrast across product families, often mixing specifications with narrative explanation. Pure symbolic retrieval may find facts, but answer quality depends on synthesis.

Reasoning questions are harder again, especially when subjective recommendation criteria enter the question. “Best for multitasking” or “good for someone who only uses basic apps” may depend on user preferences, price sensitivity, product availability, and implicit trade-offs. Here, the TKG is more useful because it preserves descriptive context. But evaluation also becomes less deterministic.

The objective/subjective split confirms this. On objective questions, Router + TKG fallback reaches 0.354 factual-correctness F1 and 0.441 list-matching F1, while SKG + TKG fallback reaches 0.337 and 0.434. On subjective questions, scores are lower and noisier; SKG + TKG fallback reaches 0.124 factual-correctness F1 and 0.188 list-matching F1, while Router + TKG fallback reaches 0.082 and 0.111. The benchmark is therefore not just measuring “better RAG.” It is separating different tasks that often get blurred inside the same chatbot interface.

That diagnostic separation is valuable. It tells product teams not to use one confidence story for every question type. A catalog assistant answering “which products meet these exact constraints?” should be evaluated differently from an assistant answering “which product would you recommend for my use case?” The first needs set accuracy. The second needs grounded reasoning, preference elicitation, and possibly interaction. Different disease, different medicine.

What the paper directly shows

The paper directly supports three claims.

Claim	Evidence in the paper	Business meaning	Boundary
Semi-structured QA exposes a gap in standard RAG	Vector RAG, GraphRAG, table RAG, symbolic, and agentic baselines underperform DualGraph on SpecsQA	Product and enterprise assistants need more than semantic chunk retrieval	SpecsQA is one commercial website snapshot, not all enterprise data
Symbolic retrieval improves exact catalog questions	SKG-only beats TKG-only; SKG-based hybrids are strongest on list matching	Structured attributes should be queryable, not merely embedded	SKG construction depends on schema, extraction, normalization, and rules
Semantic retrieval remains necessary	Hybrid and fallback variants beat single-view retrieval in important cases	Open-ended and underspecified questions still need textual context	Current SKG-TKG alignment is lightweight and not truly joint retrieval

The evidence does not show that GraphRAG is obsolete. It shows that graph-shaped semantic retrieval is not the same as symbolic query execution. The distinction is subtle in architecture diagrams and painfully obvious when a user asks for every product satisfying four constraints.

What Cognaptus infers for business systems

For business use, the practical pathway is straightforward.

If a knowledge base contains mostly policy prose, incident reports, meeting notes, and narrative documentation, semantic retrieval may be a reasonable starting point. It still needs evaluation, but the main job is often to find relevant passages.

If the knowledge base contains catalogs, prices, SKUs, compliance attributes, customer entitlements, product specifications, contract fields, HR policy thresholds, insurance coverage rules, or financial line items, semantic retrieval alone is structurally misaligned. The system needs a symbolic or database-like layer.

A useful production architecture would therefore separate at least four steps:

Normalize structured facts. Extract product attributes, prices, variants, dates, thresholds, identifiers, and categorical fields into a controlled representation.
Preserve textual context. Keep natural-language descriptions, policy explanations, FAQs, and marketing copy available for semantic retrieval.
Route by question type. Use symbolic querying for exact filters, counts, comparisons, and exhaustive lists; use semantic retrieval for open-ended explanation and recommendation context.
Evaluate by output type. Measure list accuracy for list questions, factual support for explanatory answers, and preference alignment for recommendation questions.

This is not glamorous. It is also how many useful enterprise AI systems will actually be built. The fancy model should not be asked to hallucinate a database. Give it a database.

The uncertainty boundary is where the engineering bill lives

The paper is careful about its limitations, and they materially affect adoption.

First, the SKG is built mainly from structured parts of the corpus, especially product specification tables. That is the right move for SpecsQA, but it means information only present in marketing copy, FAQs, images, or videos may not enter the symbolic layer. The appendix notes that the original webpages are multimodal, while the paper focuses on textual information and structured specifications. A full commercial assistant would need a broader extraction stack.

Second, the symbolic layer requires manual domain work. The authors manually design a lightweight ontology, SPARQL retrieval patterns, and Datalog rules. This gives the system interpretability and control, but it is still an engineering cost. Businesses should read this as a trade-off, not a flaw. The cost of symbolic modeling buys better behavior on exact queries. Whether that trade is worthwhile depends on query volume, error cost, and how often the underlying schema changes.

Third, cross-graph alignment is still shallow. The paper aligns SKG and TKG nodes largely through normalized names and reports that a learned contrastive alignment model did not improve performance. That means DualGraph’s two views are complementary, but not deeply integrated. Future systems could do more: expand symbolic results with textual evidence, constrain semantic retrieval with symbolic filters, or traverse both views jointly.

Fourth, the benchmark is one snapshot of one commercial website. This is a strength for reproducibility and memorization resistance, but it limits claims about generality. Catalogs in finance, insurance, healthcare operations, B2B procurement, and government services will have different schemas, error patterns, and compliance constraints.

Finally, list scores remain modest. This is perhaps the most honest result in the paper. DualGraph improves the frontier, but exact semi-structured QA is still not solved. For production, the system would need confidence thresholds, abstention behavior, user-visible filters, audit trails, and regression tests whenever the catalog updates.

The real lesson: RAG needs a retrieval operating system, not a bigger embedding drawer

DualGraph’s strongest contribution is not that it adds “a graph” to RAG. We have enough graph-shaped things in AI papers to pave a small airport. Its stronger contribution is a clean division of labor.

Semantic retrieval is for meaning. Symbolic querying is for constraints. Routing is for deciding which kind of evidence the question demands. Evaluation is for checking the output in the same form the user needs it: paragraph, product list, comparison, or recommendation.

That division maps directly to business practice. A customer-support assistant over a product catalog should not merely cite a few relevant pages. It should know when the user is asking for a faceted search, when the answer must be exhaustive, when price or availability is time-sensitive, and when the correct response is not an answer but a clarifying question. DualGraph does not solve all of that. It does give a credible mechanism for the first and hardest distinction: text relevance versus structured truth.

The mildly uncomfortable conclusion is that many RAG systems marketed as knowledge assistants are still glorified document finders. They retrieve the neighborhood around an answer and hope the generator walks the last mile. For prose-heavy questions, that may be acceptable. For semi-structured business questions, the last mile contains the actual task.

Query the receipt. Retrieve the explanation. Do not confuse the two.

Cognaptus: Automate the Present, Incubate the Future.

Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał T. Godziszewski, Michał Karpowicz, Timothy Hospedales, and Cristina Cornelio, “Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering,” arXiv:2605.27164v1, 26 May 2026, https://arxiv.org/abs/2605.27164 ↩︎

A product catalog is not a paragraph with a search box#

The failure mode is retrieval that finds the neighborhood, not the answer set#

SpecsQA makes catalog QA harder than a polished demo#

DualGraph separates language matching from constraint satisfaction#

SPARQL generation needs patterns, not just a schema pasted into a prompt#

The main evidence: symbolic-semantic hybrids beat the usual suspects#

The ablation says SKG carries the catalog burden#

SpecsQA is a benchmark for diagnosis, not a consumer-search leaderboard#

What the paper directly shows#

What Cognaptus infers for business systems#

The uncertainty boundary is where the engineering bill lives#

The real lesson: RAG needs a retrieval operating system, not a bigger embedding drawer#

A product catalog is not a paragraph with a search box