Data Governance

Anonymize Customer Data with AI

How to use AI to redact, mask, or pseudonymize customer data safely, and where automated anonymization can fail in practice.

Many Voices, One Label: How Pluralistic AI Flattens the World

TL;DR for operators An AI project can interview communities, collect thousands of preference judgments, preserve several user perspectives, and still impose one rigid interpretation of the world. That is the central warning in Rashid Mushkani’s AI Pluralism and the Worlds It Misses.1 The paper names the failure ontological flattening: the process by which contested concepts such as safety, accessibility, inclusion, comfort, or belonging become fixed labels, measurable proxies, aggregation rules, or benchmark targets that are subsequently treated as neutral. ...

CQ, AI & The Question of Questions

Questions look cheap. That is why they are dangerous. In most enterprise AI projects, the visible work arrives late: dashboards, RAG demos, knowledge graphs, compliance assistants, workflow copilots, and executive slides with arrows pointing to a “semantic layer.” The invisible work arrives earlier and is less glamorous: deciding what the system must actually know, answer, retrieve, distinguish, reject, and explain. ...

SEALing the Gap: When Synthetic Data Learns Accountability

Network data is easy to fake. Accountability is not. That is the uncomfortable little problem sitting behind synthetic data. A team can simulate users, devices, traffic surges, mobility patterns, channel interference, and edge-network behavior long before a full 6G deployment exists. This is useful. It is also slightly dangerous. A synthetic dataset can look realistic, train a model successfully, and still carry hidden bias, brittle assumptions, weak provenance, or regulatory gaps. Reality is not only a distribution. It is also a chain of responsibility. ...

Noise Without Regret: How Error Feedback Fixes Differentially Private Image Generation

Photos are annoying data. They are useful because they contain details: the handle of a bag, the edge of a sleeve, the texture of a face, the faint classroom gesture that matters only after someone trains a model on it. They are risky for exactly the same reason. If a generated image looks too much like the real training data, it may quietly leak what the organization was trying not to reveal. If it is protected too aggressively, it becomes a blurry souvenir from a dataset that used to be useful. ...

SD‑RAG: Don’t Trust the Model, Trust the Pipeline

A chatbot should not be the only employee in the company responsible for keeping secrets. That sounds obvious until we look at how many enterprise RAG systems are designed. A user asks a question. The system retrieves internal documents. The documents are placed into the model context. A policy instruction is added somewhere above the user prompt: do not reveal sensitive information. Then everyone hopes the model behaves. ...

Redundancy Overload Is Optional: Finding the FDs That Actually Matter

Tables have a talent for pretending to be tidy. A customer table may have fifty columns. A transaction table may have a hundred. A log table may contain derived fields, timestamps, status codes, copied identifiers, normalized labels, and a few columns that nobody remembers creating but everybody is afraid to delete. Then a data profiling tool arrives, dutifully discovers functional dependencies, and returns several hundred thousand “valid” relationships. ...

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

When Data Comes in Boxes: Why Hierarchies Beat Sample Hoarding

Data rarely arrives as loose sand Data teams like to speak as if training data arrives one sample at a time: one image, one row, one document, one carefully chosen datapoint. Procurement departments, research consortia, hospitals, vendors, and public repositories are less poetic. They ship data in boxes. A box might be a dataset from one partner institution. A folder from a public repository. A domain-specific archive. A vendor package. A department export. It arrives with source, license, schema, quirks, and hidden failure modes already attached. The operational question is not only “Which samples should we keep?” It is also “Which boxes are worth opening?” ...

Probe and Error: Why Off‑Policy Training Warps LLM Behaviour Detectors

A monitor is only useful if it fails in the boring place. The boring place is production: the real domain, the real prompt style, the real user incentives, the real model generating the real response. Not the tidy benchmark. Not the synthetic dataset. Not the “please pretend to be deceptive” prompt that makes everyone in the lab feel productive. Production is where a detector either catches the thing it was built to catch, or quietly becomes a compliance ornament with a nice AUROC score. ...