Ground Control to Synthetic Data: Why Enterprise LLMs Need a Source of Truth
TL;DR for operators Synthetic data is having its predictable enterprise moment: everyone wants more of it, faster, cheaper, and preferably without involving humans who ask inconvenient questions like “is this correct?” The two papers here are useful because they push against that lazy version of the story. StateGen, from PayPal AI, focuses on generating multi-turn training conversations for tool-augmented LLM agents, using an authoritative world-state object, tool simulation, persona variation, and multi-axis judging.1 CYQUARK focuses on generating Text-To-Cypher fine-tuning data from a target property graph and schema, expanding query expressivity while filtering natural-language paraphrases for logical fidelity.2 ...