Data Governance

Sovereign Syntax: How Poland Built Its Own LLM Empire

A citizen-facing AI assistant is where the PLLuM story becomes interesting. Not because a chatbot in a government app is a dazzling concept. It is not. Most public-sector chatbots have the charisma of a PDF with a search bar and the legal confidence of a nervous intern. The interesting part is what Poland had to build before such an assistant could be considered remotely serious: a rights-managed national corpus, Polish-native instruction data, preference alignment, safety filters, RAG evaluation, retrieval tooling, and a family of public models with different licence regimes. ...

Automate All the Things? Mind the Blind Spots

A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper. ...

Map Before You Train: Data Cartography to Defuse LLM Memorization

TL;DR for operators Training data does not become risky only after a model has memorised it. It often leaves signals while training is still happening. That is the useful idea behind Generative Data Cartography, or GenDataCarto: track how each pretraining sample behaves during early training, then use that behaviour to decide which data should be kept, up-sampled, down-weighted, or removed.1 The method uses two signals. The first is early loss, which approximates how difficult a sample is. The second is the frequency of “forget events”, where a sample appears learned and later becomes poorly fitted again. In the paper’s framing, frequent forget events are not just training noise. They are a warning that a sample may be unusually influential, repeatedly re-entering the model’s attention like a guest who refuses to leave the meeting. ...

Faking It to Make It: When Synthetic Data Actually Works

TL;DR for operators Synthetic data is not magic fake data that politely becomes real after a procurement cycle. It is a set of techniques for generating artificial records that imitate useful properties of real datasets, and its value depends on what bottleneck you are trying to remove. Li et al.’s tutorial proposal, Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era, is best read as a map of the modern synthetic-data stack: GANs, diffusion models, and LLMs; text, tabular, graph, sequential, visual, and multimodal data; evaluation criteria; and practical deployment settings in health, finance, and education.1 It is not a benchmark paper. It does not run a new experiment showing that synthetic data improves business outcomes by some conveniently rounded percentage. That is inconvenient, but also useful. The paper is trying to organise the field, not sell a miracle. ...