Ground Control to Synthetic Data: Why Enterprise LLMs Need a Source of Truth

TL;DR for operators

Synthetic data is having its predictable enterprise moment: everyone wants more of it, faster, cheaper, and preferably without involving humans who ask inconvenient questions like “is this correct?”

The two papers here are useful because they push against that lazy version of the story. StateGen, from PayPal AI, focuses on generating multi-turn training conversations for tool-augmented LLM agents, using an authoritative world-state object, tool simulation, persona variation, and multi-axis judging.¹ CYQUARK focuses on generating Text-To-Cypher fine-tuning data from a target property graph and schema, expanding query expressivity while filtering natural-language paraphrases for logical fidelity.²

Different domains, same lesson: synthetic data is valuable only when it is tied to a source of truth.

For operators, the takeaway is not “generate more data.” That is the slogan on the cheap T-shirt. The useful version is: generate data from the actual structure of your business systems, verify that the generated examples preserve meaning, score failure modes separately, and only then use the corpus to train or evaluate smaller, domain-specific models.

The commercial prize is obvious: less dependence on giant proprietary models for every query, better data sovereignty, lower inference cost, and training data that reflects business workflows rather than internet-flavored improvisation. The caveat is equally obvious, though less popular at conferences: if the source system is wrong, the schema is misleading, the state manager drifts, or the judge is uncalibrated, your synthetic data pipeline will scale errors with admirable efficiency.

The shared problem: enterprise AI is data-hungry, but production data is awkward

The current AI deployment bottleneck is no longer only model access. Most companies can call a powerful model. Many can even host a capable open-weight model. The harder question is whether they can make that model behave reliably inside their own systems.

That means training and evaluating against domain-specific behavior: customer workflows, tool calls, database schemas, exception cases, compliance boundaries, and all the tedious operational reality that never appears in a clean demo.

Real production data has three problems.

First, there is not enough of the right kind. Agentic systems need multi-turn, tool-grounded trajectories, not just single-turn instructions. Graph query systems need paired natural-language questions and executable Cypher queries over the target graph, not generic SQL examples with a name tag.

Second, the data is often sensitive. Customer conversations, payment records, CRM histories, medical graphs, supply-chain nodes, internal tickets, and security logs are not ideal material for casual model training. Apparently regulators remain unimpressed by the phrase “but the model needed context.”

Third, human annotation does not scale cleanly. It is expensive, slow, inconsistent, and often requires domain expertise. For many enterprises, the question becomes: can we create training data synthetically without creating synthetic nonsense?

The two papers answer yes, but only under a condition: the synthetic generator must be grounded. Not prompted. Not vibes-aligned. Grounded.

The shared insight: synthetic data needs a source of truth

StateGen and CYQUARK operate in different technical settings. StateGen is about tool-using agents in multi-turn workflows. CYQUARK is about Text-To-Cypher semantic parsing over property graphs. The article spine is the shared high-level insight:

Synthetic data becomes operationally useful when generation is constrained by authoritative structure and then filtered or judged for semantic reliability.

This is a stricter claim than “LLMs can generate training data.” They can, of course. LLMs can also generate legal advice, corporate strategy, and a recipe for lasagna that includes eighteen cloves of garlic and no remorse. Generation is not the scarce capability. Controlled generation is.

The papers each build a different kind of grounding layer.

Paper	Domain	Grounding object	Verification mechanism	Business purpose
StateGen	Multi-turn tool-augmented agents	Structured world-state object maintained across turns	Backend-is-truth invariant, scenario validation, multi-axis LLM judge	Generate scored agent conversations for fine-tuning and evaluation
CYQUARK	Text-To-Cypher over property graphs	Target property graph plus schema and grounded query patterns	Semantic constraints and LLM-based logical meaning verification	Fine-tune small local models for graph querying

The important point is not that both use synthetic data. That would be a shallow connection. The deeper point is that both papers treat synthetic data generation as an engineering discipline built around invariants.

CYQUARK’s invariant is that generated Cypher must come from the graph schema and executable graph-grounded patterns, while the natural-language question must preserve the logical meaning of that query. StateGen’s invariant is that simulated tool responses must remain consistent with an authoritative backend state. In both cases, the model is not trusted to invent reality. Sensible. Late, but sensible.

What StateGen shows: agent training data needs a backend, not theater

StateGen targets a familiar enterprise problem: training tool-using agents requires many conversations in which users express goals, agents call APIs or sub-agents, tools return results, state changes, errors occur, and the conversation eventually succeeds or fails.

The paper’s architecture is a four-role loop: a persona-conditioned user simulator, an agent under test, a tool simulator, and an LLM judge. Around that loop sits a state manager maintaining a structured world-state object. The authors describe this as enforcing a “backend-is-truth” invariant: a fact can enter the world state only if it was present initially or legally created through a write-authorized tool call.

That matters because tool simulation without state grounding is dangerously easy to fake. A tool simulator can produce a plausible response containing a transaction, account, policy, order, or customer record that does not exist. The next turn builds on it. Then the agent learns from it. Now the training data has taught the system that confident fiction is a valid integration pattern. Lovely.

StateGen’s design turns tool responses into state-conditioned outputs rather than free-form improvisations. It also extends this setup to hierarchical multi-agent systems by treating sub-agents as tools that share the same authoritative state object. That is a meaningful architectural point: in multi-agent workflows, contradiction is not a philosophical inconvenience; it is a production bug with better branding.

The paper reports results across 64,698 evaluated conversations from three corpora. In the mixed corpus, the tool-call hallucination criterion scores near the ceiling, while goal achievement remains meaningfully harder. That distinction is useful. The system can reduce one failure mode—fabricated tool behavior—without magically solving all reasoning, planning, or user-handling problems. The paper is refreshingly clear on this point: hallucination and goal success are separable axes, not a single moral score.

For business readers, this is the operational lesson: do not evaluate an agent training corpus only by whether the final answer looks reasonable. Split the failure modes. A conversation can be polite, fluent, and completely wrong about what the backend actually said. The customer will not admire the prose while their refund disappears into an imaginary ledger.

What CYQUARK shows: query data needs schema-grounded meaning, not prompt roulette

CYQUARK addresses a different but related problem: building accurate natural-language interfaces for property graphs. Property graphs are increasingly used to represent complex relationships across domains, but querying them requires Cypher expertise. A Text-To-Cypher model converts user language into executable graph queries.

The easy route is to prompt a large proprietary model with the graph schema and ask for Cypher. The paper argues that this is often unsuitable when data sovereignty matters or when organizations want smaller locally deployed models. So CYQUARK generates synthetic fine-tuning data for the target property graph.

The method builds on earlier graph-driven synthetic generation but expands expressivity. Earlier tree-pattern approaches could generate basic queries but struggled with more complex Cypher constructs such as nested subqueries, complex return patterns, regular expression constraints, richer aggregations, and multiple answer nodes. CYQUARK broadens the generated query space and adds two important controls: semantic consistency checks after Cypher generation and logical meaning verification after LLM paraphrasing.

That second point is crucial. The pipeline can generate a precise Cypher query and a proto-natural-language realization. Then an LLM paraphrases the proto-language into something a human might actually ask. But paraphrasing is where meaning goes to die if nobody is watching. CYQUARK therefore uses a verifier to check whether the paraphrased question still preserves the logical meaning of the Cypher query.

The experimental results are the business hook. The authors evaluate on four major Text-To-Cypher benchmarks across 19 graphs. They report that small models fine-tuned with CYQUARK-generated data substantially outperform their zero-shot versions and come close to much larger proprietary models on average execution accuracy. In their aggregate discussion, the fine-tuned 4B model doubles the average accuracy of the zero-shot 4B baseline, while the 0.8B fine-tuned model also performs strongly.

That does not mean every company should immediately replace frontier models with tiny local graph parsers. It means something narrower and more useful: when the task is bounded, structured, and tied to a known schema, grounded synthetic data can shift performance economics. You do not always need a giant general model when a smaller specialized model has been trained on the actual structure of the task.

This is the part procurement teams tend to enjoy. It has costs, but not the thrilling recurring-per-query kind.

The relationship between the papers: same insight, two operating levels

The accepted structure for this cluster is shared_high_level_insight, and that is the right frame. These papers are not making opposing arguments. They are not merely sequential steps in one pipeline either. They are two implementations of the same principle at different layers of the enterprise AI stack.

CYQUARK operates at the structured query layer. It asks: how do we turn a company’s graph schema and graph contents into synthetic examples that teach a model to produce executable queries?

StateGen operates at the agentic workflow layer. It asks: how do we turn business scenarios, personas, tool definitions, and backend state into synthetic conversations that teach an agent to interact with systems reliably?

One is about semantic parsing. The other is about multi-turn tool use. But both reject unguided generation.

A useful way to see the relationship:

Layer	Bad synthetic-data approach	Grounded approach
Query generation	Ask an LLM to invent question-query pairs	Generate from target graph, schema, executable query patterns, and meaning checks
Tool-agent conversation	Ask an LLM to role-play tools and users	Use scenario facts, world state, tool schemas, state updates, and multi-axis judging
Fine-tuning	Train on everything that looks fluent	Filter by semantic consistency, logical fidelity, and task-specific judge scores
Evaluation	Report one average score	Separate hallucination, goal achievement, tool usage, communication, compliance, and generalization gaps

This is the shift from synthetic data as “cheap examples” to synthetic data as “controlled simulation.”

And yes, controlled simulation is less glamorous than prompting a model to generate 100,000 examples before lunch. It is also much less likely to create a magnificent landfill of plausible garbage.

The business interpretation: synthetic data as a control plane

The papers show technical methods. The business interpretation is broader: grounded synthetic data can become an AI control plane.

A control plane is not just a place where data is produced. It is where assumptions, boundaries, checks, and failure modes become explicit. That matters for enterprises because AI systems are increasingly expected to operate inside workflows rather than sit beside them as decorative chat windows.

For a business, a grounded synthetic data pipeline can support five practical capabilities.

1. Data sovereignty without performance surrender

CYQUARK’s most direct business relevance is local deployment. If a company can fine-tune a small model on synthetic data generated from its own property graph, it may reduce reliance on sending every graph-query request to a proprietary external model.

This is not ideological open-source theater. It is a practical trade-off: latency, cost, privacy, governance, and infrastructure control. In regulated sectors, “just use the biggest API” is not always a strategy. Sometimes it is a risk memo waiting for formatting.

2. Cheaper specialization

Both papers point toward the same economic pattern: use stronger or specialized systems to generate and verify training data, then deploy smaller models for repeated domain tasks.

In simplified form:

$$ \text{Total Cost} = C_{\text{generation}} + C_{\text{fine-tuning}} + n \cdot C_{\text{inference}} $$

If $n$ is large, reducing $C_{\text{inference}}$ matters. Synthetic data generation and fine-tuning are upfront costs; repeated calls to a large proprietary model are recurring costs. The business case improves when the task is frequent, structured, and stable enough for specialization.

3. Workflow coverage beyond observed logs

Production logs show what users have already done. They may underrepresent rare edge cases, adversarial inputs, tool failures, compliance traps, vague requests, and awkward persona variations.

StateGen explicitly uses persona vectors, query-complexity overlays, scenario validation, and held-out golden scenarios. CYQUARK uses graph coverage and query expressivity to generate cases that may not be well represented in hand-labeled data. Both aim to expand coverage without waiting passively for the real world to produce enough mistakes.

This is a major advantage. It lets organizations rehearse before deployment. AI teams could try that more often.

4. Better failure-mode accounting

A single model score is comforting and usually unhelpful. StateGen’s eight-axis judge is valuable because it separates goal achievement from tool-call hallucination, reasoning hallucination, communication, consistency, and error handling. CYQUARK’s filtering analysis similarly distinguishes generation from paraphrase quality and logical preservation.

For managers, this changes how model readiness should be discussed. The right question is not “what is the score?” It is:

Which failure modes did we reduce?
Which ones remain?
Which failures are tolerable in this workflow?
Which failures are disqualifying even if the average score looks fine?

A beautiful average can hide a catastrophic tail. Spreadsheets have been doing this for decades; AI just added better lighting.

5. A reusable operating model for domain AI

The shared pattern can be generalized:

Step	Operator question	Implementation pattern
Ground	What is the source of truth?	Graph schema, database state, system facts, tool definitions
Generate	What examples should exist?	Query patterns, scenarios, personas, edge cases
Verify	Did meaning survive generation?	Semantic constraints, execution checks, logical verifiers, state invariants
Score	Which failure modes matter?	Multi-axis judging, per-criterion metadata
Filter	What should enter training?	Thresholds, top-decile selection, removal of noisy samples
Evaluate	Does performance generalize?	Held-out graphs, held-out scenarios, golden sets

This is the practical framework businesses should take from the papers. It is not tied to Cypher or payment agents. It is a design pattern for synthetic training data in structured domains.

The misconception to avoid: bigger LLMs do not automatically make better synthetic data

The most tempting misunderstanding is that synthetic data quality mainly comes from asking a stronger model to generate more examples. The papers argue otherwise.

CYQUARK includes an experiment where prompting a powerful model to generate data does not match the performance of its grounded generation pipeline on tested settings. The authors’ point is not that frontier models are useless. It is that unguided prompt-based generation struggles to achieve the coverage, executability, and logical fidelity needed for this task.

StateGen makes the parallel point for tool agents. If the tool simulator is just an LLM inventing plausible responses, it can fabricate backend facts. The fix is not merely a better prompt. The fix is a state manager and an invariant.

This distinction matters because enterprises are often sold “synthetic data” as if it were a volume knob. Turn the knob, get more training data. But synthetic data without grounding is often just overconfident augmentation. More examples do not help if the examples encode the wrong task.

A useful rule:

The value of synthetic data is not proportional to how much you generate. It is proportional to how much verified task structure survives generation.

Volume comes later. First, the data needs to be true in the ways the business cares about.

Where the papers are strong

The strongest part of this cluster is that both papers treat structure as central rather than decorative.

StateGen’s world-state object gives the agentic generation loop a memory of what is true. Its judge scores multiple criteria rather than collapsing quality into one number. Its train-versus-golden setup also recognizes that generated data can become memorization bait if scenario overlap is not controlled.

CYQUARK’s strength is similar but applied to semantic parsing. It does not merely generate natural-language questions and hope they map to plausible Cypher. It grounds query generation in the graph and schema, expands the supported Cypher constructs, and then checks whether paraphrasing preserved logical meaning.

Both papers also acknowledge limits. StateGen notes that its judge is not yet calibrated against human gold labels, that persona covariance is rule-described rather than learned, and that its state manager is still an LLM rather than a deterministic database. CYQUARK notes that it currently supports English and Cypher, that paraphrasing and filtering can improve, and that real property-graph interaction is often conversational and iterative rather than single-query.

That last limitation is especially interesting. CYQUARK points toward multi-round semantic parsing. StateGen already lives in multi-turn land. The two papers do not merge into one system, but they rhyme loudly enough.

What businesses should do next

The operational recommendation is not to copy either paper wholesale. Most companies do not need a research-grade synthetic data platform by next Tuesday. They do need to stop treating generated data as automatically useful.

A practical starting point:

Pick one narrow workflow where the system of record is explicit.
Define the source of truth: schema, API state, database snapshot, policy table, or business rules.
Generate examples from that structure, not from generic prompting alone.
Add a verification layer that checks executable correctness or semantic fidelity.
Score outputs by failure mode, not one global “quality” label.
Fine-tune or evaluate smaller models only on filtered examples.
Keep a held-out benchmark that the generator cannot casually memorize.

The right first use cases are not open-ended assistants. They are bounded, repetitive, structured tasks where the organization already knows what truth looks like: graph querying, CRM workflows, transaction disputes, internal policy lookup, claims routing, order management, booking flows, compliance triage, and similar domains.

This is also where the business case is easiest to measure. Compare annotation cost, inference cost, accuracy, failure modes, latency, privacy exposure, and escalation rates. Avoid mystical KPIs such as “agent helpfulness,” unless the plan is to bury the project in sentiment fog.

The strategic read: the future is not synthetic data, it is grounded data factories

The AI industry likes to rename old disciplines when a GPU is nearby. In this case, what matters is not synthetic data in the abstract. It is the emergence of grounded data factories: repeatable pipelines that convert business structure into verified training and evaluation material.

That is a serious capability. It lets companies train models on workflows they cannot safely expose, generate coverage for cases they rarely observe, and deploy smaller models where proprietary frontier models are too costly, too external, or too blunt.

But the standard is higher than “the examples look plausible.” The examples must be tied to a real schema, a real state model, executable queries, scenario invariants, semantic checks, and failure-mode scores.

StateGen and CYQUARK both point to the same practical conclusion: synthetic data is not a shortcut around truth. It is a way to operationalize truth at scale.

Which is less magical than the usual AI pitch. Also more useful. Funny how often those two travel together.

Cognaptus: Automate the Present, Incubate the Future.

Rahul Khedar et al., “State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs,” arXiv:2606.16307, 2026, https://arxiv.org/abs/2606.16307. ↩︎
Francesco Cazzaro, Jessica Lennon, and Ariadna Quattoni, “Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation,” arXiv:2606.14325, 2026, https://arxiv.org/abs/2606.14325. ↩︎

TL;DR for operators#

The shared problem: enterprise AI is data-hungry, but production data is awkward#

The shared insight: synthetic data needs a source of truth#

What StateGen shows: agent training data needs a backend, not theater#

What CYQUARK shows: query data needs schema-grounded meaning, not prompt roulette#

The relationship between the papers: same insight, two operating levels#

The business interpretation: synthetic data as a control plane#

1. Data sovereignty without performance surrender#

2. Cheaper specialization#

3. Workflow coverage beyond observed logs#

4. Better failure-mode accounting#

5. A reusable operating model for domain AI#

The misconception to avoid: bigger LLMs do not automatically make better synthetic data#

Where the papers are strong#

What businesses should do next#

The strategic read: the future is not synthetic data, it is grounded data factories#